大型语言模型列表
外观
大型语言模型(LLM)是一种机器学习模型,专为语言生成等自然语言处理任务而设计。LLM 是具有许多参数的语言模型,并通过对大量文本进行自监督学习进行训练。
本页列出了值得注意的大型语言模型。
对于训练成本一列,1 petaFLOP-day = 1 petaFLOP/sec × 1 天 = 8.64×1019 FLOP。此外,仅列出最大模型的成本。
| 名称 | 发布日期[a] | 开发者 | 参数量 (十亿) [b] | 语料库大小 | 训练成本 (petaFLOP-day) | 许可证[c] | 注解 |
|---|---|---|---|---|---|---|---|
| Attention Is All You Need | 2017年6月 | 瓦斯瓦尼等人在Google發表 | 0.213 | 3600萬個英語-法語句子對 | 0.09[1] | 未发布 | 在8個NVIDIA P100 GPU上訓練了30万步。訓練和評估代碼根據Apache 2.0許可證發布。[2] |
| GPT-1 | 2018年6月 | OpenAI | 0.117 | 1[3] | MIT[4] | 首个GPT模型,为仅解码器transformer。 在8个P600GPU上训练了30天。 | |
| BERT | 2018年10月 | 0.340[5] | 33亿单词[5] | 9[6] | Apache 2.0[7] | 这是一个早期且有影响力的语言模型。[8]是仅编码器模型,因此并非为提示或生成而构建。[9] 在 64个TPUv2芯片上训练耗时4天。[10] | |
| T5 | 2019年10月 | 11[11] | 340亿 tokens[11] | Apache 2.0[12] | 许多Google项目的基础模型,例如Imagen。[13] | ||
| XLNet | 2019年6月 | 0.340[14] | 330亿单词 | 330 | Apache 2.0[15] | 作为BERT的替代,设计为仅编码器 。在512个TPU v3芯片上训练了5.5天。[16] | |
| GPT-2 | 2019年2月 | OpenAI | 1.5[17] | 40 GB[18] (~100亿 tokens)[19] | 28[20] | MIT[21] | 在32个TPU v3芯片上训练了一周。[20] |
| GPT-3 | 2020年5月 | OpenAI | 175[22] | 3000亿 tokens[19] | 3640[23] | 专有 | 2022年,GPT-3的一个经过微调的变体,称为 GPT-3.5,通过名为ChatGPT的网络界面向公众开放。[24] |
| GPT-Neo | 2021年3月 | EleutherAI | 2.7[25] | 825 GiB[26] | MIT[27] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[27] | |
| GPT-J | 2021年6月 | EleutherAI | 6[28] | 825 GiB[26] | 200[29] | Apache 2.0 | GPT-3-style language model |
| Megatron-Turing NLG | 2021年10月 [30] | Microsoft and Nvidia | 530[31] | 338.6 billion tokens[31] | 38000[32] | Restricted web access | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[32] |
| Ernie 3.0 Titan | 2021年12月 | Baidu | 260[33] | 4 Tb | 专有 | Chinese-language LLM. Ernie Bot is based on this model. | |
| Claude[34] | 2021年12月 | Anthropic | 52[35] | 400 billion tokens[35] | beta | Fine-tuned for desirable behavior in conversations.[36] | |
| GLaM (Generalist Language Model) | 2021年12月 | 1200[37] | 1.6 trillion tokens[37] | 5600[37] | 专有 | Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. | |
| Gopher | 2021年12月 | DeepMind | 280[38] | 300 billion tokens[39] | 5833[40] | 专有 | Later developed into the Chinchilla model. |
| LaMDA (Language Models for Dialog Applications) | 2022年1月 | 137[41] | 1.56T words,[41] 168 billion tokens[39] | 4110[42] | 专有 | Specialized for response generation in conversations. | |
| GPT-NeoX | 2022年2月 | EleutherAI | 20[43] | 825 GiB[26] | 740[29] | Apache 2.0 | based on the Megatron architecture |
| Chinchilla | 2022年3月 | DeepMind | 70[44] | 1.4 trillion tokens[44][39] | 6805[40] | 专有 | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. |
| PaLM(路徑語言模型) | 2022年4月 | 540[45] | 768 billion tokens[44] | 29,250[40] | 专有 | Trained for ~60 days on ~6000 TPU v4 chips.[40] 截至2024年10月[update], it is the largest dense Transformer published. | |
| OPT (Open Pretrained Transformer) | 2022年5月 | Meta | 175[46] | 180 billion tokens[47] | 310[29] | Non-commercial research[d] | GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[48] |
| YaLM 100B | 2022年6月 | Yandex | 100[49] | 1.7TB[49] | Apache 2.0 | English-Russian model based on Microsoft's Megatron-LM. | |
| Minerva | 2022年6月 | 540[50] | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[50] | 专有 | For solving "mathematical and scientific questions using step-by-step reasoning".[51] Initialized from PaLM models, then finetuned on mathematical and scientific data. | ||
| BLOOM | 2022年7月 | Large collaboration led by Hugging Face | 175[52] | 350 billion tokens (1.6TB)[53] | Responsible AI | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) | |
| Galactica | 2022年11月 | Meta | 120 | 106 billion tokens[54] | 未知 | CC-BY-NC-4.0 | Trained on scientific text and modalities. |
| AlexaTM (Teacher Models) | 2022年11月 | Amazon | 20[55] | 1.3 trillion[56] | 专有[57] | bidirectional sequence-to-sequence architecture | |
| LLaMA (Large Language Model Meta AI) | 2023年2月 | Meta AI | 65[58] | 1.4 trillion[58] | 6300[59] | Non-commercial research[e] | Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.[58] |
| GPT-4 | 2023年3月 | OpenAI | 未知[f] (According to rumors: 1760)[61] | 未知 | 未知 | 专有 | Available for ChatGPT Plus users and used in several products. |
| Chameleon | 2024年6月 | Meta AI | 34[62] | 4.4 trillion | |||
| Cerebras-GPT | 2023年3月 | Cerebras | 13[63] | 270[29] | Apache 2.0 | Trained with Chinchilla formula. | |
| Falcon | 2023年3月 | Technology Innovation Institute | 40[64] | 1 trillion tokens, from RefinedWeb (filtered web text corpus)[65] plus some "curated corpora".[66] | 2800[59] | Apache 2.0[67] | |
| BloombergGPT | 2023年3月 | Bloomberg L.P. | 50 | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[68] | 专有 | Trained on financial data from proprietary sources, for financial tasks. | |
| PanGu-Σ | 2023年3月 | Huawei | 1085 | 329 billion tokens[69] | 专有 | ||
| OpenAssistant[70] | 2023年3月 | LAION | 17 | 1.5 trillion tokens | Apache 2.0 | Trained on crowdsourced open data | |
| Jurassic-2[71] | 2023年3月 | AI21 Labs | 未知 | 未知 | 专有 | Multilingual[72] | |
| PaLM 2(路徑語言模型2) | 2023年5月 | 340[73] | 3.6 trillion tokens[73] | 85,000[59] | 专有 | Was used in Bard chatbot.[74] | |
| Llama 2 | 2023年7月 | Meta AI | 70[75] | 2 trillion tokens[75] | 21,000 | Llama 2 license | 1.7 million A100-hours.[76] |
| Claude 2 | 2023年7月 | Anthropic | 未知 | 未知 | 未知 | 专有 | Used in Claude chatbot.[77] |
| Granite 13b | 2023年7月 | IBM | 未知 | 未知 | 未知 | 专有 | Used in IBM Watsonx.[78] |
| Mistral 7B | 2023年9月 | Mistral AI | 7.3[79] | 未知 | Apache 2.0 | ||
| Claude 2.1 | 2023年11月 | Anthropic | 未知 | 未知 | 未知 | 专有 | Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[80] |
| Grok-1[81] | 2023年11月 | xAI | 314 | 未知 | 未知 | Apache 2.0 | Used in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).[82] |
| Gemini 1.0 | 2023年12月 | Google DeepMind | 未知 | 未知 | 未知 | 专有 | Multimodal model, comes in three sizes. Used in the chatbot of the same name.[83] |
| Mixtral 8x7B | 2023年12月 | Mistral AI | 46.7 | 未知 | 未知 | Apache 2.0 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[84] Mixture of experts model, with 12.9 billion parameters activated per token.[85] |
| Mixtral 8x22B | 2024年4月 | Mistral AI | 141 | 未知 | 未知 | Apache 2.0 | [86] |
| DeepSeek LLM | 2023年11月29日 | DeepSeek | 67 | 2T tokens[87] | 12,000 | DeepSeek License | Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B[87] |
| Phi-2 | 2023年12月 | Microsoft | 2.7 | 1.4T tokens | 419[88] | MIT | Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[88] |
| Gemini 1.5 | 2024年2月 | Google DeepMind | 未知 | 未知 | 未知 | 专有 | Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[89] |
| Gemini Ultra | 2024年2月 | Google DeepMind | 未知 | 未知 | 未知 | ||
| Gemma | 2024年2月 | Google DeepMind | 7 | 6T tokens | 未知 | Gemma Terms of Use[90] | |
| Claude 3 | 2024年3月 | Anthropic | 未知 | 未知 | 未知 | 专有 | Includes three models, Haiku, Sonnet, and Opus.[91] |
| Nova (页面存档备份,存于互联网档案馆) | 2024年10月 | Rubik's AI (页面存档备份,存于互联网档案馆) | 未知 | 未知 | 未知 | 专有 | Includes three models, Nova-Instant, Nova-Air, and Nova-Pro. |
| DBRX | 2024年3月 | Databricks與Mosaic ML | 136 | 12T Tokens | Databricks Open Model License | Training cost 10 million USD. | |
| Fugaku-LLM | 2024年5月 | 富士通與東京工業大學等 | 13 | 380B Tokens | The largest model ever trained on CPU-only, on the Fugaku.[92] | ||
| Phi-3 | 2024年4月 | Microsoft | 14[93] | 4.8T Tokens | MIT | Microsoft markets them as "small language model".[94] | |
| Granite Code Models | 2024年5月 | IBM | 未知 | 未知 | 未知 | Apache 2.0 | |
| Qwen2 | 2024年6月 | 阿里雲 | 72[95] | 3T Tokens | 未知 | Qwen License | Multiple sizes, the smallest being 0.5B. |
| DeepSeek V2 | 2024年6月 | DeepSeek | 236 | 8.1T tokens | 28,000 | DeepSeek License | 1.4M hours on H800.[96] |
| Nemotron-4 | 2024年6月 | Nvidia | 340 | 9T Tokens | 200,000 | NVIDIA Open Model License | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[97][98] |
| Llama 3.1 | 2024年7月 | Meta AI | 405 | 15.6T tokens | 440,000 | Llama 3 license | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[99][100] |
| DeepSeek V3 | 2024年12月 | DeepSeek | 671 | 14.8T tokens | 56,000 | DeepSeek License | 在H800 GPU上训练278.8万小时。[101] |
| Amazon Nova | 2024年12月 | Amazon | 未知 | 未知 | 未知 | 专有 | Includes three models, Nova Micro, Nova Lite, and Nova Pro[102] |
| DeepSeek R1 | 2025年1月 | DeepSeek | 671 | 未知 | 未知 | MIT | 无预训练,基于V3-Base强化学习。[103][104] |
| Qwen2.5 | 2025年1月 | Alibaba | 72 | 18T tokens | 未知 | Qwen License | [105] |
| MiniMax-Text-01 | January 2025 | Minimax | 456 | 4.7T tokens[106] | 未知 | Minimax Model license | [107][106] |
| Gemini 2.0 | 2025年2月 | Google DeepMind | 未知 | 未知 | 未知 | 专有 | Three models released: Flash, Flash-Lite and Pro[108][109][110] |
| Mistral Large | 2024年11月 | Mistral AI | 123 | 未知 | 未知 | Mistral Research License | Upgraded over time. The latest version is 24.11.[111] |
| Pixtral | 2024年11月 | Mistral AI | 123 | 未知 | 未知 | Mistral Research License | Multimodal. There is also a 12B version which is under Apache 2 license.[111] |
| Grok 3 | 2025年2月 | xAI | 未知 | 未知 | 未知, estimated 5,800,000. |
专有 | Training cost claimed "10x the compute of previous state-of-the-art models".[112] |
| Llama 4 | 2025年4月5日 | Meta AI | 400 | 40T tokens | Llama 4 license | [113][114] | |
| Qwen3 | 2025年4月 | 阿里雲 | 235 | 36T tokens | 未知 | Apache 2.0 | Multiple sizes, the smallest being 0.6B.[115] |
| GPT-OSS | 2025年8月5日 | OpenAI | 117 | 未知 | 未知 | Apache 2.0 | 有20B和120B兩種模型大小發布。[116] |
| Claude 4.1 | 2025年8月5日 | Anthropic | 未知 | 未知 | 未知 | 专有 | Includes one model, Opus.[117] |
| GPT-5 | 2025年8月7日 | OpenAI | 未知 | 未知 | 未知 | 专有 | 包括三个模型GPT-5,GPT-5 mini,和GPT-5 nano。GPT-5可在ChatGPT及其API中使用,包含思考能力。[118][119] |
| DeepSeek-V3.1 | August 21, 2025 | DeepSeek | 671 | 15.639T | MIT | 训练大小:14.8T tokens, of DeepSeek V3 plus 839B tokens from the extension phases (630B + 209B)[120]这是一个可在思考和非思考模式间切换的混合模型。[121] | |
| Apertus | 2025年9月2日 | ETH Zurich and EPF Lausanne | 70 | 15 trillion[122] | 未知 | Apache 2.0 | 据称这是首个符合欧盟《人工智能法案》的LLM。[123] |
| Claude 4.5 | 2025年9月29日 | Anthropic | 未知 | 未知 | 未知 | 专有 | [124] |
| DeepSeek-V3.2-Exp | 2025年9月29日 | DeepSeek | 685 | MIT | 该实验性模型基于v3.1-Terminus构建,使用名为 DeepSeek Sparse Attention (DSA) 的自定义高效机制。[125][126][127] | ||
| GLM-4.6 | 2025年9月30日 | 智谱 | 357 | Apache 2.0 | [128][129][130] | ||
| Kimi K2 Thinking | 2025年11月6日 | Moonshot AI | 1000 | MIT | [131][132][133] | ||
| GPT-5.1 | 2025年11月12日 | OpenAI | 专有 | [134] | |||
| Grok 4.1 | 2025年11月17日 | xAI | 专有 | [135] | |||
| Gemini 3 | 2025年11月18日 | Google DeepMind | 专有 | [136] | |||
| Claude Opus 4.5 | 2025年11月25日 | Anthropic | 专有 | [137] | |||
| DeepSeek-V3.2 | 2025年12月1日 | DeepSeek | 685 | MIT | 平衡推理能力与输出长度,适合日常使用场景如问答和通用Agent任务[138][139][140][141] | ||
| DeepSeek-V3.2-Speciale | 2025年12月1日 | DeepSeek | 685 | MIT | 将开源模型的推理能力推向极致,探索模型能力边界;但是仅供研究使用,不支持工具调用[142][143][144][145] |
参见
[编辑]注释
[编辑]- ^ 这是描述模型架构的文档首次发布的日期。
- ^ 在许多情况下,研究人员会发布或报告具有不同尺寸的多个模型版本。在这些情况下,此处会列出最大模型的尺寸。
- ^ 这是预训练模型权重的许可证。在几乎所有情况下,训练代码本身都是开源的或可以轻松复制。
- ^ The smaller models including 66B are publicly available, while the 175B model is available on request.
- ^ Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
- ^ As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."[60]
参考资料
[编辑]- ^ AI and compute. openai.com. 2022-06-09 [2025-04-24] (美国英语).
- ^ Apache License. TensorFlow. [2025-08-06] –通过GitHub (英语).
- ^ Improving language understanding with unsupervised learning. openai.com. June 11, 2018 [2023-03-18]. (原始内容存档于2023-03-18).
- ^ finetune-transformer-lm. GitHub. [2 January 2024]. (原始内容存档于19 May 2023).
- ^ 5.0 5.1 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 11 October 2018. arXiv:1810.04805v2
[cs.CL].
- ^ Prickett, Nicole Hemsoth. Cerebras Shifts Architecture To Meet Massive AI/ML Models. The Next Platform. 2021-08-24 [2023-06-20]. (原始内容存档于2023-06-20).
- ^ BERT. March 13, 2023 [March 13, 2023]. (原始内容存档于January 13, 2021) –通过GitHub.
- ^ Manning, Christopher D. Human Language Understanding & Reasoning. Daedalus. 2022, 151 (2): 127–138 [2023-03-09]. S2CID 248377870. doi:10.1162/daed_a_01905
. (原始内容存档于2023-11-17).
- ^ Patel, Ajay; Li, Bryan; Rasooli, Mohammad Sadegh; Constant, Noah; Raffel, Colin; Callison-Burch, Chris. Bidirectional Language Models Are Also Few-shot Learners. 2022. arXiv:2209.14500
[cs.LG].
- ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 11 October 2018. arXiv:1810.04805v2
[cs.CL].
- ^ 11.0 11.1 Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020, 21 (140): 1–67 [2025-02-11]. ISSN 1533-7928. arXiv:1910.10683
. (原始内容存档于2024-10-05).
- ^ google-research/text-to-text-transfer-transformer, Google Research, 2024-04-02 [2024-04-04], (原始内容存档于2024-03-29)
- ^ Imagen: Text-to-Image Diffusion Models. imagen.research.google. [2024-04-04]. (原始内容存档于2024-03-27).
- ^ Pretrained models — transformers 2.0.0 documentation. huggingface.co. [2024-08-05]. (原始内容存档于2024-08-05).
- ^ xlnet. GitHub. [2 January 2024]. (原始内容存档于2 January 2024).
- ^ Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. 2 January 2020. arXiv:1906.08237
[cs.CL].
- ^ GPT-2: 1.5B Release. OpenAI. 2019-11-05 [2019-11-14]. (原始内容存档于2019-11-14) (英语).
- ^ Better language models and their implications. openai.com. [2023-03-13]. (原始内容存档于2023-03-16).
- ^ 19.0 19.1 OpenAI's GPT-3 Language Model: A Technical Overview. lambdalabs.com. 3 June 2020 [13 March 2023]. (原始内容存档于27 March 2023).
- ^ 20.0 20.1 openai-community/gpt2-xl · Hugging Face. huggingface.co. [2024-07-24]. (原始内容存档于2024-07-24).
- ^ gpt-2. GitHub. [13 March 2023]. (原始内容存档于11 March 2023).
- ^ Wiggers, Kyle. The emerging types of language models and why they matter. TechCrunch. 28 April 2022 [9 March 2023]. (原始内容存档于16 March 2023).
- ^ Table D.1 in Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario. Language Models are Few-Shot Learners. May 28, 2020. arXiv:2005.14165v4
[cs.CL].
- ^ ChatGPT: Optimizing Language Models for Dialogue. OpenAI. 2022-11-30 [2023-01-13]. (原始内容存档于2022-11-30).
- ^ GPT Neo. March 15, 2023 [March 12, 2023]. (原始内容存档于March 12, 2023) –通过GitHub.
- ^ 26.0 26.1 26.2 Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. 31 December 2020. arXiv:2101.00027
[cs.CL].
- ^ 27.0 27.1 Iyer, Abhishek. GPT-3's free alternative GPT-Neo is something to be excited about. VentureBeat. 15 May 2021 [13 March 2023]. (原始内容存档于9 March 2023).
- ^ GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront. www.forefront.ai. [2023-02-28]. (原始内容存档于2023-03-09).
- ^ 29.0 29.1 29.2 29.3 Dey, Nolan; Gosal, Gurpreet; Zhiming; Chen; Khachane, Hemant; Marshall, William; Pathria, Ribhu; Tom, Marvin; Hestness, Joel. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. 2023-04-01. arXiv:2304.03208
[cs.LG].
- ^ Alvi, Ali; Kharya, Paresh. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model. Microsoft Research. 11 October 2021 [13 March 2023]. (原始内容存档于13 March 2023).
- ^ 31.0 31.1 Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. 2022-02-04. arXiv:2201.11990
[cs.CL].
- ^ 32.0 32.1 Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong, DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, 2022-07-21, arXiv:2201.05596
- ^ Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang, Junyuan; Zhao, Yanbin; Pang, Chao; Liu, Jiaxiang; Chen, Xuyi; Lu, Yuxiang; Liu, Weixin; Wang, Xi; Bai, Yangfan; Chen, Qiuliang; Zhao, Li; Li, Shiyong; Sun, Peng; Yu, Dianhai; Ma, Yanjun; Tian, Hao; Wu, Hua; Wu, Tian; Zeng, Wei; Li, Ge; Gao, Wen; Wang, Haifeng. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. December 23, 2021. arXiv:2112.12731
[cs.CL].
- ^ Product. Anthropic. [14 March 2023]. (原始内容存档于16 March 2023).
- ^ 35.0 35.1 Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. A General Language Assistant as a Laboratory for Alignment. 9 December 2021. arXiv:2112.00861
[cs.CL].
- ^ Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. Constitutional AI: Harmlessness from AI Feedback. 15 December 2022. arXiv:2212.08073
[cs.CL].
- ^ 37.0 37.1 37.2 Dai, Andrew M; Du, Nan. More Efficient In-Context Learning with GLaM. ai.googleblog.com. December 9, 2021 [2023-03-09]. (原始内容存档于2023-03-12).
- ^ Language modelling at scale: Gopher, ethical considerations, and retrieval. www.deepmind.com. 8 December 2021 [20 March 2023]. (原始内容存档于20 March 2023).
- ^ 39.0 39.1 39.2 Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. Training Compute-Optimal Large Language Models. 29 March 2022. arXiv:2203.15556
[cs.CL].
- ^ 40.0 40.1 40.2 40.3 Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways 互联网档案馆的存檔,存档日期2023-06-10.
- ^ 41.0 41.1 Cheng, Heng-Tze; Thoppilan, Romal. LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything. ai.googleblog.com. January 21, 2022 [2023-03-09]. (原始内容存档于2022-03-25).
- ^ Thoppilan, Romal; De Freitas, Daniel; Hall, Jamie; Shazeer, Noam; Kulshreshtha, Apoorv; Cheng, Heng-Tze; Jin, Alicia; Bos, Taylor; Baker, Leslie; Du, Yu; Li, YaGuang; Lee, Hongrae; Zheng, Huaixiu Steven; Ghafouri, Amin; Menegali, Marcelo. LaMDA: Language Models for Dialog Applications. 2022-01-01. arXiv:2201.08239
[cs.CL].
- ^ Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models: 95–136. 2022-05-01 [2022-12-19]. (原始内容存档于2022-12-10).
- ^ 44.0 44.1 44.2 Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent. An empirical analysis of compute-optimal large language model training. Deepmind Blog. 12 April 2022 [9 March 2023]. (原始内容存档于13 April 2022).
- ^ Narang, Sharan; Chowdhery, Aakanksha. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance. ai.googleblog.com. April 4, 2022 [2023-03-09]. (原始内容存档于2022-04-04) (英语).
- ^ Susan Zhang; Mona Diab; Luke Zettlemoyer. Democratizing access to large-scale language models with OPT-175B. ai.facebook.com. [2023-03-12]. (原始内容存档于2023-03-12).
- ^ Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke. OPT: Open Pre-trained Transformer Language Models. 21 June 2022. arXiv:2205.01068
[cs.CL].
- ^ metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq. GitHub. [2024-10-18]. (原始内容存档于2024-01-24) (英语).
- ^ 49.0 49.1 Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay, YaLM 100B, 2022-06-22 [2023-03-18], (原始内容存档于2023-06-16)
- ^ 50.0 50.1 Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra, Vedant. Solving Quantitative Reasoning Problems with Language Models. 30 June 2022. arXiv:2206.14858
[cs.CL].
- ^ Minerva: Solving Quantitative Reasoning Problems with Language Models. ai.googleblog.com. 30 June 2022 [20 March 2023]. (原始内容存档于2022-06-30).
- ^ Ananthaswamy, Anil. In AI, is bigger always better?. Nature. 8 March 2023, 615 (7951): 202–205 [9 March 2023]. Bibcode:2023Natur.615..202A. PMID 36890378. S2CID 257380916. doi:10.1038/d41586-023-00641-w. (原始内容存档于16 March 2023).
- ^ bigscience/bloom · Hugging Face. huggingface.co. [2023-03-13]. (原始内容存档于2023-04-12).
- ^ Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert. Galactica: A Large Language Model for Science. 16 November 2022. arXiv:2211.09085
[cs.CL].
- ^ 20B-parameter Alexa model sets new marks in few-shot learning. Amazon Science. 2 August 2022 [12 March 2023]. (原始内容存档于15 March 2023).
- ^ Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model. 3 August 2022. arXiv:2208.01448
[cs.CL].
- ^ AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog. aws.amazon.com. 17 November 2022 [13 March 2023]. (原始内容存档于13 March 2023).
- ^ 58.0 58.1 58.2 Introducing LLaMA: A foundational, 65-billion-parameter large language model. Meta AI. 24 February 2023 [9 March 2023]. (原始内容存档于3 March 2023).
- ^ 59.0 59.1 59.2 The Falcon has landed in the Hugging Face ecosystem. huggingface.co. [2023-06-20]. (原始内容存档于2023-06-20).
- ^ GPT-4 Technical Report (PDF). OpenAI. 2023 [March 14, 2023]. (原始内容存档 (PDF)于March 14, 2023).
- ^ Schreiner, Maximilian. GPT-4 architecture, datasets, costs and more leaked. THE DECODER. 2023-07-11 [2024-07-26]. (原始内容存档于2023-07-12) (美国英语).
- ^ Dickson, Ben. Meta introduces Chameleon, a state-of-the-art multimodal model. VentureBeat. 22 May 2024 [2025-02-11]. (原始内容存档于2025-02-11).
- ^ Dey, Nolan. Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models. Cerebras. March 28, 2023 [March 28, 2023]. (原始内容存档于March 28, 2023).
- ^ Abu Dhabi-based TII launches its own version of ChatGPT. tii.ae. [2023-04-03]. (原始内容存档于2023-04-03).
- ^ Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. 2023-06-01. arXiv:2306.01116
[cs.CL].
- ^ tiiuae/falcon-40b · Hugging Face. huggingface.co. 2023-06-09 [2023-06-20]. (原始内容存档于2023-06-02).
- ^ UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free 互联网档案馆的存檔,存档日期2024-02-08., 31 May 2023
- ^ Wu, Shijie; Irsoy, Ozan; Lu, Steven; Dabravolski, Vadim; Dredze, Mark; Gehrmann, Sebastian; Kambadur, Prabhanjan; Rosenberg, David; Mann, Gideon. BloombergGPT: A Large Language Model for Finance. March 30, 2023. arXiv:2303.17564
[cs.LG].
- ^ Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun. PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. March 19, 2023. arXiv:2303.10845
[cs.CL].
- ^ Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew. OpenAssistant Conversations – Democratizing Large Language Model Alignment. 2023-04-14. arXiv:2304.07327
[cs.CL].
- ^ Wrobel, Sharon. Tel Aviv startup rolls out new advanced AI language model to rival OpenAI. www.timesofisrael.com. [2023-07-24]. (原始内容存档于2023-07-24).
- ^ Wiggers, Kyle. With Bedrock, Amazon enters the generative AI race. TechCrunch. 2023-04-13 [2023-07-24]. (原始内容存档于2023-07-24).
- ^ 73.0 73.1 Elias, Jennifer. Google's newest A.I. model uses nearly five times more text data for training than its predecessor. CNBC. 16 May 2023 [18 May 2023]. (原始内容存档于16 May 2023).
- ^ Introducing PaLM 2. Google. May 10, 2023 [May 18, 2023]. (原始内容存档于May 18, 2023).
- ^ 75.0 75.1 Introducing Llama 2: The Next Generation of Our Open Source Large Language Model. Meta AI. 2023 [2023-07-19]. (原始内容存档于2024-01-05).
- ^ llama/MODEL_CARD.md at main · meta-llama/llama. GitHub. [2024-05-28]. (原始内容存档于2024-05-28).
- ^ Claude 2. anthropic.com. [12 December 2023]. (原始内容存档于15 December 2023).
- ^ Nirmal, Dinesh. Building AI for business: IBM's Granite foundation models. IBM Blog. 2023-09-07 [2024-08-11]. (原始内容存档于2024-07-22) (美国英语).
- ^ Announcing Mistral 7B. Mistral. 2023 [2023-10-06]. (原始内容存档于2024-01-06).
- ^ Introducing Claude 2.1. anthropic.com. [12 December 2023]. (原始内容存档于15 December 2023).
- ^ xai-org/grok-1, xai-org, 2024-03-19 [2024-03-19], (原始内容存档于2024-05-28)
- ^ Grok-1 model card. x.ai. [12 December 2023]. (原始内容存档于2023-11-05).
- ^ Gemini – Google DeepMind. deepmind.google. [12 December 2023]. (原始内容存档于8 December 2023).
- ^ Franzen, Carl. Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance. VentureBeat. 11 December 2023 [12 December 2023]. (原始内容存档于11 December 2023).
- ^ Mixtral of experts. mistral.ai. 11 December 2023 [12 December 2023]. (原始内容存档于13 February 2024).
- ^ AI, Mistral. Cheaper, Better, Faster, Stronger. mistral.ai. 2024-04-17 [2024-05-05]. (原始内容存档于2024-05-05).
- ^ 87.0 87.1 DeepSeek-AI; Bi, Xiao; Chen, Deli; Chen, Guanting; Chen, Shanhuang; Dai, Damai; Deng, Chengqi; Ding, Honghui; Dong, Kai, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, 2024-01-05 [2025-02-11], arXiv:2401.02954
, (原始内容存档于2025-03-29)
- ^ 88.0 88.1 Hughes, Alyssa. Phi-2: The surprising power of small language models. Microsoft Research. 12 December 2023 [13 December 2023]. (原始内容存档于12 December 2023).
- ^ Our next-generation model: Gemini 1.5. Google. 15 February 2024 [16 February 2024]. (原始内容存档于16 February 2024).
This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.
- ^ Gemma. [2025-02-11]. (原始内容存档于2024-02-21) –通过GitHub.
- ^ Introducing the next generation of Claude. www.anthropic.com. [2024-03-04]. (原始内容存档于2024-03-04).
- ^ Fugaku-LLM/Fugaku-LLM-13B · Hugging Face. huggingface.co. [2024-05-17]. (原始内容存档于2024-05-17).
- ^ Phi-3. azure.microsoft.com. 23 April 2024 [2024-04-28]. (原始内容存档于2024-04-27).
- ^ Phi-3 Model Documentation. huggingface.co. [2024-04-28]. (原始内容存档于2024-05-13).
- ^ Qwen2. GitHub. [2024-06-17]. (原始内容存档于2024-06-17).
- ^ DeepSeek-AI; Liu, Aixin; Feng, Bei; Wang, Bin; Wang, Bingxuan; Liu, Bo; Zhao, Chenggang; Dengr, Chengqi; Ruan, Chong, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, 2024-06-19 [2025-02-11], arXiv:2405.04434
, (原始内容存档于2025-03-30)
- ^ nvidia/Nemotron-4-340B-Base · Hugging Face. huggingface.co. 2024-06-14 [2024-06-15]. (原始内容存档于2024-06-15).
- ^ Nemotron-4 340B | Research. research.nvidia.com. [2024-06-15]. (原始内容存档于2024-06-15).
- ^ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta. [2025-02-11]. (原始内容存档于2024-07-24).
- ^ llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models. GitHub. [2024-07-23]. (原始内容存档于2024-07-23) (英语).
- ^ deepseek-ai/DeepSeek-V3, DeepSeek, 2024-12-26 [2024-12-26], (原始内容存档于2025-03-27)
- ^ Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3, Amazon, 2024-12-27 [2024-12-27], (原始内容存档于2025-02-11)
- ^ deepseek-ai/DeepSeek-R1, DeepSeek, 2025-01-21 [2025-01-21], (原始内容存档于2025-02-04)
- ^ DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025-01-22 [2025-02-11], arXiv:2501.12948
, (原始内容存档于2025-04-09)
- ^ Qwen; Yang, An; Yang, Baosong; Zhang, Beichen; Hui, Binyuan; Zheng, Bo; Yu, Bowen; Li, Chengyuan; Liu, Dayiheng, Qwen2.5 Technical Report, 2025-01-03 [2025-02-11], arXiv:2412.15115
, (原始内容存档于2025-04-01)
- ^ 106.0 106.1 MiniMax; Li, Aonian; Gong, Bangwei; Yang, Bo; Shan, Boji; Liu, Chang; Zhu, Cheng; Zhang, Chunhao; Guo, Congchao, MiniMax-01: Scaling Foundation Models with Lightning Attention, 2025-01-14 [2025-01-26], arXiv:2501.08313
, (原始内容存档于2025-03-22)
- ^ MiniMax-AI/MiniMax-01, MiniMax, 2025-01-26 [2025-01-26]
- ^ Kavukcuoglu, Koray. Gemini 2.0 is now available to everyone. Google. [6 February 2025]. (原始内容存档于2025-04-10).
- ^ Gemini 2.0: Flash, Flash-Lite and Pro. Google for Developers. [6 February 2025]. (原始内容存档于2025-04-10).
- ^ Franzen, Carl. Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search. VentureBeat. 5 February 2025 [6 February 2025]. (原始内容存档于2025-03-17).
- ^ 111.0 111.1 Models Overview. mistral.ai. [2025-03-03].
- ^ Grok 3 Beta — The Age of Reasoning Agents. x.ai. [2025-02-22] (英语).
- ^ meta-llama/Llama-4-Maverick-17B-128E · Hugging Face. huggingface.co. 2025-04-05 [2025-04-06].
- ^ The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. ai.meta.com. [2025-04-05]. (原始内容存档于2025-04-05) (英语).
- ^ Team, Qwen. Qwen3: Think Deeper, Act Faster. Qwen. 2025-04-29 [2025-04-29] (英语).
- ^ Whitwam, Ryan. OpenAI announces two "gpt-oss" open AI models, and you can download them today. Ars Technica. 2025-08-05 [2025-08-06] (英语).
- ^ Claude Opus 4.1. www.anthropic.com. [8 August 2025] (英语).
- ^ Introducing GPT-5. openai.com. 7 August 2025 [8 August 2025].
- ^ OpenAI Platform: GPT-5 Model Documentation. openai.com. [18 August 2025].
- ^ deepseek-ai/DeepSeek-V3.1 · Hugging Face. huggingface.co. 2025-08-21 [2025-08-25].
- ^ DeepSeek-V3.1 Release | DeepSeek API Docs. api-docs.deepseek.com. [2025-08-25] (英语).
- ^ Apertus: Ein vollständig offenes, transparentes und mehrsprachiges Sprachmodell. Zürich: ETH Zürich. 2025-09-02 [2025-11-07] (德语).
- ^ Kirchner, Malte. Apertus: Schweiz stellt erstes offenes und mehrsprachiges KI-Modell vor. heise online. 2025-09-02 [2025-11-07] (德语).
- ^ Introducing Claude Sonnet 4.5. www.anthropic.com. [29 September 2025] (英语).
- ^ Introducing DeepSeek-V3.2-Exp | DeepSeek API Docs. api-docs.deepseek.com. [2025-10-01] (英语).
- ^ deepseek-ai/DeepSeek-V3.2-Exp · Hugging Face. huggingface.co. 2025-09-29 [2025-10-01].
- ^ DeepSeek-V3.2-Exp/DeepSeek_V3_2.pdf at main · deepseek-ai/DeepSeek-V3.2-Exp (PDF). GitHub. [2025-10-01] (英语).
- ^ GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities. z.ai. [2025-10-01] (英语).
- ^ zai-org/GLM-4.6 · Hugging Face. huggingface.co. 2025-09-30 [2025-10-01].
- ^ GLM-4.6. modelscope.cn. [2025-10-01].
- ^ Kimi K2 Thinking. moonshotai.github.io. [2025-11-06] (英语).
- ^ moonshotai/Kimi-K2-Thinking · Hugging Face. huggingface.co. 2025-11-06 [2025-11-06].
- ^ Kimi-K2-Thinking. modelscope.cn. [2025-11-09].
- ^ GPT-5.1 全新上线:更智能、更具对话感的 ChatGPT. openai.com. [2025-11-12] (中文).
- ^ Grok 4.1. x.ai. [2025-11-17] (英语).
- ^ Gemini 3: Introducing the latest Gemini AI model from Google. blog.google. [2025-11-18] (中文).
- ^ Introducing Claude Opus 4.5. anthropic.com. [2025-11-25] (英语).
- ^ DeepSeek-V3.2 Release. api-docs.deepseek.com. [2025-12-01] (英语).
- ^ DeepSeek V3.2 正式版:强化 Agent 能力,融入思考推理. mp.weixin.qq.com. [2025-12-01] (中文).
- ^ deepseek-ai/DeepSeek-V3.2 · Hugging Face. huggingface.co. 2025-12-01 [2025-12-01].
- ^ DeepSeek-V3.2. modelscope.cn. [2025-12-01].
- ^ DeepSeek-V3.2 Release. api-docs.deepseek.com. [2025-12-01] (英语).
- ^ DeepSeek V3.2 正式版:强化 Agent 能力,融入思考推理. mp.weixin.qq.com. [2025-12-01] (中文).
- ^ deepseek-ai/DeepSeek-V3.2-Speciale · Hugging Face. huggingface.co. 2025-12-01 [2025-12-01].
- ^ DeepSeek-V3.2-Speciale. modelscope.cn. [2025-12-01].