llama.cpp

GGUF
扩展名	.gguf
开发者	Georgi Gerganov与社群
首次发布	2023年8月22日，2年前
最新版本	v3
格式类型	机器学习张量

llama.cpp
原作者	Georgi Gerganov
开发者	Georgi Gerganov与社群
首次发布	2023年3月10日，2年前
源代码库	github.com/ggml-org/llama.cpp
编程语言	C++、C
类型	大型语言模型函数库
许可协议	MIT许可证

llama.cpp是用来在多种大型语言模型（例如LLaMA）上执行推理的开放源代码函数库。^[3]此函数库中也包含了命令行工具^[4]以及接口简易的网络应用程序服务器。^[5]^[6]

背景

2022年9月底，Georgi Gerganov开始开发GGML函数库，这是实现张量代数的C语言函数库。Gerganov开发GGML函数库的目的是实现严格的存储器管理与多线程。GGML的建立则是受到法布里斯·贝拉开发LibNC的启发。^[7]

在开发llama.cpp之前，Gerganov曾经开发过类似的函数库，为使用OpenAI语音转文字模型Whisper（英语：Whisper (speech recognition system)）的whisper.cpp。^[8]

开发

Georgi Gerganov从2023年3月开始开发llama.cpp，llama.cpp是LLaMA推理代码的无外部依赖关系纯C/C++实现。llama.cpp改善了在没有图形处理器或其他专用硬件的电脑上的性能，这也是此项目的其中一个目标。^[3]^[9]^[10]因为可以仅在中央处理器上执行（甚至可以在Android上运作），llama.cpp得到了缺乏专用硬件的用户青睐。^[9]^[11]虽然一开始是为CPU设计的，但后来还是新增了GPU推理支持。^[12]

2024年3月，Justine Tunney为x86与ARM CPU引入新的优化矩阵乘法核心至此项目，改善了FP16与8位量化数据类型的提示词评估性能。^[13]Tunney也制作了llamafile这套工具，这套工具把模型与llama.cpp集成到单一个文件中，并可透过同样由Tunney开发的Cosmopolitan Libc函数库在多个操作系统上执行。^[13]

架构

llama.cpp支持多种硬件目标，包含x86、ARM、CUDA、Metal、Vulkan（1.2或更新版本）与SYCL。^[14]^[15]^[16]^[17]这些后端构成了GGML张量函数库，并供llama.cpp中不同模型的代码使用。^[18]llama.cpp支持提前而非即时量化模型。^[19]llama.cpp也使用了多种CPU扩展指令集优化性能：x86-64的AVX、AVX2与AVX-512，以及ARM上的Neon。Apple芯片也是此项目的重要目标。^[20]^[21]

GGUF文件格式

GGUF（GGML通用文件）^[24]文件格式是二进制格式，将张量与元数据存储在同一个文件中，用以快速存储与加载模型资料。^[25]此文件格式是llama.cpp项目于2023年8月开始使用，在新增对其他模型架构的支持时也维持向后兼容性。^[12]^[26]

GGUF文件通常是透过转换以PyTorch等其他机器学习函数库开发的模型所建立的。^[25]

设计

此格式着重于量化，亦即降低模型权重的精确度。如此可以降低存储器使用量，提升速度，缺点是会降低模型精度。^[27]^[26]

GGUF支持2比特至8位的量化整数类型^[28]，以及常见的浮点资料格式（如float32、float16与bfloat16）与1.56比特量化。^[4]

参考资料

^ Initial release · ggerganov/llama.cpp@26c0846. GitHub. [2025-07-12] （英语）.
^ llama.cpp/LICENSE at master · ggerganov/llama.cpp. GitHub （英语）.
^ ^3.0 ^3.1 Connatser, Matthew. How this open source LLM chatbot runner hit the gas on x86, Arm CPUs. theregister.com. [2025-07-12]. （原始内容存档于2024-05-10）.
^ ^4.0 ^4.1 Mann, Tobias. Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it. theregister. 2024-07-14 [2025-07-12]. （原始内容存档于2025-07-06）.
^ Alden, Daroc. Portable LLMs with llamafile [LWN.net]. lwn.net. [2024-07-30]. （原始内容存档于2025-03-06）.
^ Mann, Tobias. Intro to speculative decoding: Cheat codes for faster LLMs. theregister. 2024-12-15 （英语）.
^ Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532). Changelog. 2023-03-22 [2025-07-12]. （原始内容存档于2025-07-08）（英语）.
^ ggerganov/whisper.cpp. GitHub. [2025-07-12]. （原始内容存档于2025-04-03）.
^ ^9.0 ^9.1 Edwards, Benj. You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi. arstechnica.com. 2023-03-13 [2025-07-12]. （原始内容存档于2024-01-09）.
^ Wiest, Isabella Catharina; Ferber, Dyke; Zhu, Jiefu; van Treeck, Marko; Meyer, Meyer, Sonja K.; Juglan, Radhika; Carrero, Zunamys I.; Paech, Daniel; Kleesiek, Jens; Ebert, Matthias P.; Truhn, Daniel; Kather, Jakob Nikolas. Privacy-preserving large language models for structured medical information retrieval. npj Digital Medicine. 2024, 7 (257): 257. PMC 11415382 . PMID 39304709. doi:10.1038/s41746-024-01233-2.
^ Democratizing AI with open-source language models. lwn.net. [2025-07-12]. （原始内容存档于2024-07-28）.
^ ^12.0 ^12.1 Rajput, Saurabhsingh; Sharma, Tushar. Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency. 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). 2024-06-04: 238–242. ISBN 979-8-3503-6625-9. doi:10.1109/ICSA-C63560.2024.00049.
^ ^13.0 ^13.1 Connatser, Matthew. Llamafile LLM driver project boosts performance on CPU cores. www.theregister.com. [2024-05-10]. （原始内容存档于2024-05-10）（英语）.
^ Gerganov, Georgi; Nguyen, Xuan Son; Slaren. Introduction to ggml. Huggingface. 2024-08-13 [2025-07-12]. （原始内容存档于2025-06-03）.
^ Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique. QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers (PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. June 2024 [2025-07-12]. （原始内容存档 (PDF)于2024-11-10）.
^ Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel. Run LLMs on Intel GPUs Using llama.cpp. The Parallel Universe. No. 57 (Intel). 2024-07: 34–37 （英语）.
^ Bolz, Jeff. Machine Learning in Vulkan with Cooperative Matrix 2 (PDF). Cambridge, UK: The Khronos Group/Nvidia. 2025-02-11 [2025-07-12]. （原始内容存档 (PDF)于2025-04-17）（英语）.
^ Pounder, Les. How To Create Your Own AI Chatbot Server With Raspberry Pi 4. tomshardware.com. 2023-03-25 [2025-07-12]. （原始内容存档于2023-08-15）.
^ Walkowiak, Bartosz; Walkowiak, Tomasz. Implementation of language models within an infrastructure designed for Natural Language Processing (PDF). International Journal of Electronics and Telecommunications. 2024, 70 (1): 153–159 [2025-07-12]. doi:10.24425/ijet.2024.149525.
^ ggerganov/llama.cpp. GitHub.
^ Larabel, Michael. Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4. www.phoronix.com. 2024-03-31 [2025-07-12]. （原始内容存档于2025-03-13）（英语）.
^ GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp. GitHub （英语）.
^ ggml/docs/gguf.md at master · ggerganov/ggml. GitHub. [2025-07-12]. （原始内容存档于2025-01-31）（英语）.
^ ggerganov/llama.cpp/gguf-py/README.md. GitHub. [2025-07-12].
^ ^25.0 ^25.1 GGUF. huggingface.co. [2025-07-12].
^ ^26.0 ^26.1 Mucci, Tim. GGUF versus GGML. www.ibm.com. 2024-07-03 [2025-07-12]. （原始内容存档于2025-06-04）（美国英语）.
^ Labonne, Maxime. Quantize Llama models with GGUF and llama.cpp. Medium. Towards Data Science. 2023-11-29 [2024-05-09]. （原始内容存档于2024-05-09）（英语）.
^ Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel. Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students. Proceedings of the 19th International Conference on Software Technologies. 2024: 395–402. ISBN 978-989-758-706-1. doi:10.5220/0012763000003753.

[githubrelease-1] Initial release · ggerganov/llama.cpp@26c0846. GitHub. [2025-07-12] （英语）.

[license-2] .cpp/LICENSE at master · ggerganov/llama.cpp. GitHub （英语）.

[register-llamafile-3] 3.0 ^3.1 Connatser, Matthew. How this open source LLM chatbot runner hit the gas on x86, Arm CPUs. theregister.com. [2025-07-12]. （原始内容存档于2024-05-10）.

[theregister_14_Jul_2024-4] 4.0 ^4.1 Mann, Tobias. Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it. theregister. 2024-07-14 [2025-07-12]. （原始内容存档于2025-07-06）.

[lwn-5] Alden, Daroc. Portable LLMs with llamafile [LWN.net]. lwn.net. [2024-07-30]. （原始内容存档于2025-03-06）.

[theregister_15_December_2024-6] Mann, Tobias. Intro to speculative decoding: Cheat codes for faster LLMs. theregister. 2024-12-15 （英语）.

[changelog-podcast-mar-2023-7] Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532). Changelog. 2023-03-22 [2025-07-12]. （原始内容存档于2025-07-08）（英语）.

[whisper-8] rganov/whisper.cpp. GitHub. [2025-07-12]. （原始内容存档于2025-04-03）.

[arstechnica-9] 9.0 ^9.1 Edwards, Benj. You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi. arstechnica.com. 2023-03-13 [2025-07-12]. （原始内容存档于2024-01-09）.

[Wiest-10] Wiest, Isabella Catharina; Ferber, Dyke; Zhu, Jiefu; van Treeck, Marko; Meyer, Meyer, Sonja K.; Juglan, Radhika; Carrero, Zunamys I.; Paech, Daniel; Kleesiek, Jens; Ebert, Matthias P.; Truhn, Daniel; Kather, Jakob Nikolas. Privacy-preserving large language models for structured medical information retrieval. npj Digital Medicine. 2024, 7 (257): 257. PMC 11415382 . PMID 39304709. doi:10.1038/s41746-024-01233-2.

[11] Democratizing AI with open-source language models. lwn.net. [2025-07-12]. （原始内容存档于2024-07-28）.

[Rajput-12] 12.0 ^12.1 Rajput, Saurabhsingh; Sharma, Tushar. Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency. 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). 2024-06-04: 238–242. ISBN 979-8-3503-6625-9. doi:10.1109/ICSA-C63560.2024.00049.

[llamafileregister-13] 13.0 ^13.1 Connatser, Matthew. Llamafile LLM driver project boosts performance on CPU cores. www.theregister.com. [2024-05-10]. （原始内容存档于2024-05-10）（英语）.

[Gerganov_Slaren_Nguyen_Introduction_to_ggml-14] Gerganov, Georgi; Nguyen, Xuan Son; Slaren. Introduction to ggml. Huggingface. 2024-08-13 [2025-07-12]. （原始内容存档于2025-06-03）.

[Kluska-15] Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique. QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers (PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. June 2024 [2025-07-12]. （原始内容存档 (PDF)于2024-11-10）.

[Run_LLMs_on_Intel_GPUs_Using_llama.cpp-16] Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel. Run LLMs on Intel GPUs Using llama.cpp. The Parallel Universe. No. 57 (Intel). 2024-07: 34–37 （英语）.

[Bolz-17] Bolz, Jeff. Machine Learning in Vulkan with Cooperative Matrix 2 (PDF). Cambridge, UK: The Khronos Group/Nvidia. 2025-02-11 [2025-07-12]. （原始内容存档 (PDF)于2025-04-17）（英语）.

[tomshardware-18] Pounder, Les. How To Create Your Own AI Chatbot Server With Raspberry Pi 4. tomshardware.com. 2023-03-25 [2025-07-12]. （原始内容存档于2023-08-15）.

[Walkowiak-19] Walkowiak, Bartosz; Walkowiak, Tomasz. Implementation of language models within an infrastructure designed for Natural Language Processing (PDF). International Journal of Electronics and Telecommunications. 2024, 70 (1): 153–159 [2025-07-12]. doi:10.24425/ijet.2024.149525.

[llama.cpprepo-20] rganov/llama.cpp. GitHub.

[phoronix-llamafile-21] Larabel, Michael. Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4. www.phoronix.com. 2024-03-31 [2025-07-12]. （原始内容存档于2025-03-13）（英语）.

[githubgguf-22] GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp. GitHub （英语）.

[ggufdoc-23] /docs/gguf.md at master · ggerganov/ggml. GitHub. [2025-07-12]. （原始内容存档于2025-01-31）（英语）.

[gguf-py-24] rganov/llama.cpp/gguf-py/README.md. GitHub. [2025-07-12].

[huggingface-25] 25.0 ^25.1 GGUF. huggingface.co. [2025-07-12].

[ibm-gguf-vs-ggml-26] 26.0 ^26.1 Mucci, Tim. GGUF versus GGML. www.ibm.com. 2024-07-03 [2025-07-12]. （原始内容存档于2025-06-04）（美国英语）.

[towardsdatascience-27] Labonne, Maxime. Quantize Llama models with GGUF and llama.cpp. Medium. Towards Data Science. 2023-11-29 [2024-05-09]. （原始内容存档于2024-05-09）（英语）.

[Cabezas-28] Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel. Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students. Proceedings of the 19th International Conference on Software Technologies. 2024: 395–402. ISBN 978-989-758-706-1. doi:10.5220/0012763000003753.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]