我们的AI没有护城河，OpenAI也没有丨Google "We Have No Moat, And Neither Does OpenAI"

type

status

date

slug

summary

我们没有护城河

我们一直在关注 OpenAI 的动向。谁将达到下一个里程碑？下一步会是什么？

但不舒服的事实是，我们无法赢得这场军备竞赛，OpenAI 也一样。当我们争吵不休时，第三方一直在悄悄地抢占我们的市场。

我当然是在谈论开源。简单地说，他们正在超越我们。我们认为的“重大开放问题”已经被解决，并且现在已经掌握在人们手中。仅举几个例子：

手机上的LLMs: 人们正在Pixel 6上以每秒5个标记的速度运行基础模型 at 5 tokens / sec.

可扩展的个人人工智能: 您可以在一晚上在您的笔记本电脑上微调个性化的人工智能

这并不是“解决”，而是“避免”: 这个网站充满了没有任何限制的艺术模特，而文本也不远了, and text is not far behind.

多模态: 目前多模态ScienceQA SOTA在一个小时内进行了训练

虽然我们的模型在质量方面仍然稍占优势，但差距正在迅速缩小。开源模型更快、更可定制、更私密，而且在性能上更胜一筹。他们用100美元和13B个参数做的事情，我们用1000万美元和540B个参数仍然很困难。而且他们只需要几周的时间，而不是几个月的时间。这对我们有深远的影响：

我们没有秘密武器。我们最好的希望是从谷歌之外的其他人所做的事情中学习并合作。我们应该优先考虑启用第三方集成。

当免费、不受限制的替代品在质量上相当时，人们不会为受限制的模型付费。我们应该考虑我们真正的价值所在。

巨型模型正在拖慢我们的速度。从长远来看，最好的模型是那些可以快速迭代的。既然我们知道在小于20B参数范围内可能实现什么，我们应该将小变体视为重要的事情，而不是事后的想法。

What Happened

三月初，开源社区得到了他们的第一个真正有能力的基础模型，Meta的LLaMA泄露给了公众。它没有指令或对话调整，也没有RLHF。尽管如此，社区立即理解了他们所得到的东西的重要性。

紧接着是一系列惊人的创新，每个重大进展之间仅相隔数天（详见时间轴）。现在，仅仅一个月后，已经出现了具有指令调整、量化、质量改进、人类评估、多模态、RLHF等变体，其中许多变体都是相互建立的。

最重要的是，他们已经解决了扩展问题，以至于任何人都可以进行调整。许多新想法来自普通人。培训和实验的准入门槛已经从一个主要研究机构的总产出降至一个人、一个晚上和一台强大的笔记本电脑。

为什么我们本可以预见到它的到来

在许多方面，这对任何人来说都不应该是一个惊喜。开源LLM的当前复兴紧随图像生成的复兴之后。社区没有忽视这些相似之处，许多人称这是LLM的“稳定扩散时刻”。

在这两种情况下，低成本的公众参与都得益于一种名为低秩适应（LoRA）的微调机制，以及在规模上取得的重大突破（图像合成的潜在扩散，LLM的Chinchilla）。在这两种情况下，获得足够高质量的模型引发了全球个人和机构的一系列想法和迭代。在这两种情况下，这很快超越了大型参与者。

这些贡献对于图像生成领域至关重要，使得 Stable Diffusion 走上了与 Dall-E 不同的道路。拥有一个开放的模型导致了产品集成、市场、用户界面和创新，这些在 Dall-E 上没有发生。

效果是明显的：在文化影响方面，快速占主导地位，而OpenAI的解决方案变得越来越无关紧要。LLMs是否会发生同样的事情还有待观察，但广泛的结构元素是相同的。

我们错失了什么

最近开源取的成功创新直接解决了我们仍在努力解决的问题，我们应当更加关注他们的工作，这可以帮助我们避免重复造轮子。

LoRA是一种非常强大的技术，我们可能应该更加关注它

LoRA的工作原理是将模型更新表示为低秩分解，这将更新矩阵的大小减少了数千倍。这使得模型微调的成本和时间降低了很多。在消费级硬件上能够在几个小时内个性化一个语言模型是一件大事，特别是对于涉及在几乎实时中融合新的和多样化知识的愿景而言。尽管这项技术直接影响我们最雄心勃勃的项目之一，但它在谷歌内部的利用还不足。

从头开始重新训练模型是一条艰难的道路

LoRA 如此有效的部分原因在于它像其他微调形式一样是可堆叠的。例如，指令微调等改进可以应用，然后随着其他贡献者添加对话、推理或工具使用等内容而发挥作用。虽然单个微调的等级较低，但它们的总和并不低低，因为它们允许模型的全等级更新是随着时间的推移不断累积。

这意味着随着新的和更好的数据集和任务的出现，模型可以便宜地保持最新状态，而无需支付完整运行的成本。

相比之下，从头开始训练巨型模型不仅会丢弃预训练，还会丢弃在其基础上进行的任何迭代改进。在开源世界中，这些改进很快就会占据主导地位，使得完全重新训练变得极其昂贵。

我们应该仔细思考每个新应用或新想法是否真的需要一个全新的模型。如果我们确实有重大的架构改进，使得直接重用模型权重不可行，那么我们应该采取更积极主动的知识提炼方式，以尽可能保留上一代模型的能力。

如果我们可以更快地迭代小模型，那么大模型在长期内并不更具有能力

LoRA更新成本低廉（最受欢迎的模型大小约为100美元），这意味着几乎任何有想法的人都可以生成并分发它。训练时间不到一天就可以完成。在这样的速度下，所有这些微调的累积效应很快就会克服起点劣势。事实上，就工程师小时而言，这些模型的改进速度远远超过我们最大变体的能力，而且最好的模型已经在很大程度上与ChatGPT难以区分。实际上，专注于维护一些全球最大的模型实际上会使我们处于劣势。

数据质量比数据大小更重要

许多这些项目通过在小型、高度筛选的数据集上进行训练来节省时间。这表明数据缩放规律存在一定的灵活性。这些数据集的存在源于《数据不会按你想象的那样工作》一书中的思路，并且它们正在迅速成为谷歌以外进行训练的标准方式。这些数据集是使用合成方法构建的（例如从现有模型中筛选出最佳响应），并从其他项目中进行回收利用，这两种方法在谷歌都没有占主导地位。幸运的是，这些高质量的数据集是开源的，因此可以免费使用。

直接与开源竞争是一种失败的提议

这最近的进展对我们的商业策略有直接、即时的影响。如果有一个没有使用限制的免费高质量替代品，谁会愿意购买谷歌的产品呢？

我们不应该期望能够追赶上来。现代互联网之所以运行在开源软件上，是有其重要优势的，而这些优势我们无法复制。

我们需要他们，比他们需要我们更多

保守我们技术的秘密一直是一个不好建议。谷歌的研究人员会跳槽，因此我们可以假设他们知道我们所知道的一切，并且只要有人跳槽，外部就将继续知道我们的秘密。

但是，由于LLM的尖端研究是负担得起的，因此在技术上保持竞争优势变得更加困难。世界各地的研究机构都在相互努力的基础上，以远远超过我们自身能力的广度优先的方式探索解决方案空间。我们可以在外部创新稀释其价值的同时，努力保守自己的秘密，也可以努力相互学习。

个人受许可证的限制程度不如公司严格

这种创新大部分是在Meta泄露的模型的基础上进行的。虽然随着真正开放的模型变得更好，这种情况不可避免地会发生改变，但关键是他们不必等待。由“个人使用”提供的法律保护和起诉个人的不切实际意味着，个人可以在这些技术处于热门时获得对它们的访问权。

成为自己的客户意味着你理解使用情况

纵观人们在图像生成领域中创建的模型，有着广泛的创造力，从动漫生成器到HDR风景。这些模型是由深度沉浸在特定子流派中的人们使用和创建的，他们拥有我们无法匹敌的知识深度和共情能力。

拥有生态系统：让开源为我们服务

讽刺的是，在所有这些事件中唯一的明显赢家是Meta。因为泄露的模型是他们的，他们有效地获得了整个星球的免费劳动力。由于大多数开源创新都是在他们的架构之上进行的，所以没有什么能阻止他们直接将其纳入他们的产品中。

拥有生态系统的价值不可高估。谷歌本身已经成功地在其开源产品中使用了这种范式，例如Chrome和Android。通过拥有创新发生的平台，谷歌巩固了自己作为思想领袖和方向设定者的地位，赢得了塑造超越自身的思想叙述的能力。

我们越是严格控制我们的模型，就越能使开放式替代方案更具吸引力。Google和OpenAI都倾向于采用防御性的发布模式，以便保持对其模型使用方式的严格控制。但这种控制是虚构的，因为任何试图将LLMs用于未经授权目的的人都可以随意选择免费提供的模型。

谷歌应该在开源社区中确立自己的领导地位，通过与更广泛的对话合作来领导。这可能意味着采取一些不舒服的步骤，比如发布小型ULM变体的模型权重。这必然意味着放弃对我们模型的一些控制。但这种妥协是不可避免的。我们不能希望既推动创新又控制它。

结语：OpenAI 怎么样了？

所有这些关于开源的讨论可能会让人感到不公平，因为OpenAI目前的政策是封闭的。如果他们不分享，为什么我们要分享呢？但事实是，我们已经通过不断流失的高级研究人员与他们分享了一切。在我们阻止这种流失之前，保密是没有意义的。

最终，OpenAI并不重要。他们在开源方面采取的立场和他们保持优势的能力都值得质疑，这些都是我们所犯的同样错误。除非他们改变立场，否则开源替代品最终将超越他们。至少在这方面，我们可以率先采取行动。

作者：BigYe程普链接：https://juejin.cn/post/7229593695653314597来源：稀土掘金著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

原文英文：https://www.semianalysis.com/p/google-we-have-no-moat-and-neither

We Have No Moat

And neither does OpenAI

We’ve done a lot of looking over our shoulders at OpenAI. Who will cross the next milestone? What will the next move be?

But the uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI. While we’ve been squabbling, a third faction has been quietly eating our lunch.

I’m talking, of course, about open source. Plainly put, they are lapping us. Things we consider “major open problems” are solved and in people’s hands today. Just to name a few:

LLMs on a Phone: People are running foundation models on a Pixel 6 at 5 tokens / sec.

Scalable Personal AI: You can finetune a personalized AI on your laptop in an evening.

Responsible Release: This one isn’t “solved” so much as “obviated”. There are entire websites full of art models with no restrictions whatsoever, and text is not far behind.

Multimodality: The current multimodal ScienceQA SOTA was trained in an hour.

While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly. Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months. This has profound implications for us:

We have no secret sauce. Our best hope is to learn from and collaborate with what others are doing outside Google. We should prioritize enabling 3P integrations.

People will not pay for a restricted model when free, unrestricted alternatives are comparable in quality. We should consider where our value add really is.

Giant models are slowing us down. In the long run, the best models are the ones

which can be iterated upon quickly. We should make small variants more than an afterthought, now that we know what is possible in the <20B parameter regime.

https://lmsys.org/blog/2023-03-30-vicuna/

What Happened

At the beginning of March the open source community got their hands on their first really capable foundation model, as Meta’s LLaMA was leaked to the public. It had no instruction or conversation tuning, and no RLHF. Nonetheless, the community immediately understood the significance of what they had been given.

A tremendous outpouring of innovation followed, with just days between major developments (see The Timeline for the full breakdown). Here we are, barely a month later, and there are variants with instruction tuning, quantization, quality improvements, human evals, multimodality, RLHF, etc. etc. many of which build on each other.

Most importantly, they have solved the scaling problem to the extent that anyone can tinker. Many of the new ideas are from ordinary people. The barrier to entry for training and experimentation has dropped from the total output of a major research organization to one person, an evening, and a beefy laptop.

Why We Could Have Seen It Coming

In many ways, this shouldn’t be a surprise to anyone. The current renaissance in open source LLMs comes hot on the heels of a renaissance in image generation. The similarities are not lost on the community, with many calling this the “Stable Diffusion moment” for LLMs.

In both cases, low-cost public involvement was enabled by a vastly cheaper mechanism for fine tuning called low rank adaptation, or LoRA, combined with a significant breakthrough in scale (latent diffusion for image synthesis, Chinchilla for LLMs). In both cases, access to a sufficiently high-quality model kicked off a flurry of ideas and iteration from individuals and institutions around the world. In both cases, this quickly outpaced the large players.

These contributions were pivotal in the image generation space, setting Stable Diffusion on a different path from Dall-E. Having an open model led to product integrations, marketplaces, user interfaces, and innovations that didn’t happen for Dall-E.

The effect was palpable: rapid domination in terms of cultural impact vs the OpenAI solution, which became increasingly irrelevant. Whether the same thing will happen for LLMs remains to be seen, but the broad structural elements are the same.

What We Missed

The innovations that powered open source’s recent successes directly solve problems we’re still struggling with. Paying more attention to their work could help us to avoid reinventing the wheel.

LoRA is an incredibly powerful technique we should probably be paying more attention to

LoRA works by representing model updates as low-rank factorizations, which reduces the size of the update matrices by a factor of up to several thousand. This allows model fine-tuning at a fraction of the cost and time. Being able to personalize a language model in a few hours on consumer hardware is a big deal, particularly for aspirations that involve incorporating new and diverse knowledge in near real-time. The fact that this technology exists is underexploited inside Google, even though it directly impacts some of our most ambitious projects.

Retraining models from scratch is the hard path

Part of what makes LoRA so effective is that - like other forms of fine-tuning - it’s stackable. Improvements like instruction tuning can be applied and then leveraged as other contributors add on dialogue, or reasoning, or tool use. While the individual fine tunings are low rank, their sum need not be, allowing full-rank updates to the model to accumulate over time.

This means that as new and better datasets and tasks become available, the model can be cheaply kept up to date, without ever having to pay the cost of a full run.

By contrast, training giant models from scratch not only throws away the pretraining, but also any iterative improvements that have been made on top. In the open source world, it doesn’t take long before these improvements dominate, making a full retrain extremely costly.

We should be thoughtful about whether each new application or idea really needs a whole new model. If we really do have major architectural improvements that preclude directly reusing model weights, then we should invest in more aggressive forms of distillation that allow us to retain as much of the previous generation’s capabilities as possible.

Large models aren’t more capable in the long run if we can iterate faster on small models

LoRA updates are very cheap to produce (~$100) for the most popular model sizes. This means that almost anyone with an idea can generate one and distribute it. Training times under a day are the norm. At that pace, it doesn’t take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage. Indeed, in terms of engineer-hours, the pace of improvement from these models vastly outstrips what we can do with our largest variants, and the best are already largely indistinguishable from ChatGPT. Focusing on maintaining some of the largest models on the planet actually puts us at a disadvantage.

Data quality scales better than data size

Many of these projects are saving time by training on small, highly curated datasets. This suggests there is some flexibility in data scaling laws. The existence of such datasets follows from the line of thinking in Data Doesn't Do What You Think, and they are rapidly becoming the standard way to do training outside Google. These datasets are built using synthetic methods (e.g. filtering the best responses from an existing model) and scavenging from other projects, neither of which is dominant at Google. Fortunately, these high quality datasets are open source, so they are free to use.

Directly Competing With Open Source Is a Losing Proposition

This recent progress has direct, immediate implications for our business strategy. Who would pay for a Google product with usage restrictions if there is a free, high quality alternative without them?

And we should not expect to be able to catch up. The modern internet runs on open source for a reason. Open source has some significant advantages that we cannot replicate.

We need them more than they need us

Keeping our technology secret was always a tenuous proposition. Google researchers are leaving for other companies on a regular cadence, so we can assume they know everything we know, and will continue to for as long as that pipeline is open.

But holding on to a competitive advantage in technology becomes even harder now that cutting edge research in LLMs is affordable. Research institutions all over the world are building on each other’s work, exploring the solution space in a breadth-first way that far outstrips our own capacity. We can try to hold tightly to our secrets while outside innovation dilutes their value, or we can try to learn from each other.

Individuals are not constrained by licenses to the same degree as corporations

Much of this innovation is happening on top of the leaked model weights from Meta. While this will inevitably change as truly open models get better, the point is that they don’t have to wait. The legal cover afforded by “personal use” and the impracticality of prosecuting individuals means that individuals are getting access to these technologies while they are hot.

Being your own customer means you understand the use case

Browsing through the models that people are creating in the image generation space, there is a vast outpouring of creativity, from anime generators to HDR landscapes. These models are used and created by people who are deeply immersed in their particular subgenre, lending a depth of knowledge and empathy we cannot hope to match.

Owning the Ecosystem: Letting Open Source Work for Us

Paradoxically, the one clear winner in all of this is Meta. Because the leaked model was theirs, they have effectively garnered an entire planet's worth of free labor. Since most open source innovation is happening on top of their architecture, there is nothing stopping them from directly incorporating it into their products.

The value of owning the ecosystem cannot be overstated. Google itself has successfully used this paradigm in its open source offerings, like Chrome and Android. By owning the platform where innovation happens, Google cements itself as a thought leader and direction-setter, earning the ability to shape the narrative on ideas that are larger than itself.

The more tightly we control our models, the more attractive we make open alternatives. Google and OpenAI have both gravitated defensively toward release patterns that allow them to retain tight control over how their models are used. But this control is a fiction. Anyone seeking to use LLMs for unsanctioned purposes can simply take their pick of the freely available models.

Google should establish itself a leader in the open source community, taking the lead by cooperating with, rather than ignoring, the broader conversation. This probably means taking some uncomfortable steps, like publishing the model weights for small ULM variants. This necessarily means relinquishing some control over our models. But this compromise is inevitable. We cannot hope to both drive innovation and control it.

Epilogue: What about OpenAI?

All this talk of open source can feel unfair given OpenAI’s current closed policy. Why do we have to share, if they won’t? But the fact of the matter is, we are already sharing everything with them in the form of the steady flow of poached senior researchers. Until we stem that tide, secrecy is a moot point.

And in the end, OpenAI doesn’t matter. They are making the same mistakes we are in their posture relative to open source, and their ability to maintain an edge is necessarily in question. Open source alternatives can and will eventually eclipse them unless they change their stance. In this respect, at least, we can make the first move.