LEB128

LEB128（Little Endian Base 128）是一種支持將任意大的整數編碼到變長字節流的可變長度編碼。DWARF調試文件格式^[1]^[2]以及WebAssembly中所有整數字面量的二進制編碼都使用了 LEB128。^[3]

編碼格式

LEB128格式與可變長度數量(VLQ) 格式非常相似；兩者的主要區別在於，LEB128採用小端序，而可變長度數量採用大端序。兩者都允許將小數字存儲在單個字節中，同時還允許對任意長度的數字進行編碼。LEB128有兩個版本：無符號LEB128和有符號LEB128。解碼器必須知道編碼值是無符號 LEB128還是有符號LEB128才能正確解碼。

無符號LEB128

要使用無符號LEB128（ULEB128）編碼無符號數，首先要用二進制表示該數，然後用零將該數的二進制表示擴展至長度為7位的倍數（這樣，如果數非零，則最高7位不全為0），將該數分成7位一組。每一個7位的組對應輸出的一個字節，順序從最低有效位到最高有效位，除了最後一個字節外，其他字節都置位最高位。WebAssembly 允許對零進行其他編碼（0x80 0x00、0x80 0x80 0x00，...）。 ^[3]

例如，無符號數 624485 的編碼方式如下：

最高有效位 ------------------ 最低有效位
      10011000011101100101 原始二进制
     010011000011101100101 填充为7位的倍数
   0100110 0001110 1100101 分成7位一组
00100110 10001110 11100101 在除最后一个（最高有效位）组之外的所有组上添加高1位以形成字节
    0x26     0x8E     0xE5 十六进制

→ 0xE5 0x8E 0x26           输出流（LSB 到 MSB）

有符號LEB128

有符號數的表示方式類似：以 $N$ 位二進制補碼表示需編碼的數，其中 $N$ 是7的倍數，然後按照無符號數的方式編碼。

例如有符號數-123456的編碼方式如下：

最高有效位 ------------------ 最低有效位
         11110001001000000 123456 的二进制编码
     000011110001001000000 填充到21位
     111100001110110111111 所有位取反（得到-123456在21位下的反码）
     111100001110111000000 加一（得到-123456在21位下的补码）
   1111000 0111011 1000000 分成7位一组
01111000 10111011 11000000 在除最后一个（最高有效位）组之外的所有组上添加高1位以形成字节
    0x78     0xBB     0xC0 十六进制

→ 0xC0 0xBB 0x78           输出流（LSB 到 MSB）

快速解碼

直接使用純量實現 LEB128 解碼速度相當慢，在分支預測錯誤成本相對較高的現代硬體上更是如此。一系列論文提出了用於加速解碼的 SIMD 技術（在這些論文中，它被稱為 VByte，但實際上是同一種編碼的另一個名稱）。「矢量化 VByte 解碼」論文^[4]提出了「Masked VByte」，它在商用Haswell硬體上實現了每秒 6.5 億到 27 億個整數的速度，具體取決於編碼密度。

後續論文提出了一種變體編碼「Stream VByte：更快的面向字節的整數壓縮」 ^[5] ，其速度提升至每秒超過 40 億個整數。這種流編碼將控制流與編碼數據分離，因此與 LEB128 二進制不兼容。

類 C 偽代碼

編碼無符號整數

do {
  byte = value & 0x7f; /* low-order 7 bits of value */
  value >>= 7;
  if (value != 0) /* more bytes to come */
    byte |= 0x80; /* set high-order bit of byte */
  emit(byte);
} while (value != 0);

編碼有符號整數

more = 1;
negative = (value < 0);

/* the size in bits of the variable "value", e.g., 64 if value's type is int64_t */
size = sizeof(value) * CHAR_BITS; /* no. of bits in signed integer */

while (more) {
 byte = value & 0x7f; /* low-order 7 bits of value */
 value >>= 7;
 /* the following is only necessary if the implementation of >>= uses a
   logical shift rather than an arithmetic shift for a signed left operand
   this does not happen on most programming languages if "value" is in a signed type to begin with */
 if (negative)
  value |= (~0 << (size - 7)); /* sign extend */

 /* sign bit of byte is second high-order bit (0x40) */
 sign_bit = value & 0x40;
 if ((value == 0 && sign_bit == 0) || (value == -1 && sign_bit != 0))
  more = 0;
 else
  byte |= 0x80; /* set high-order bit of byte */
 emit(byte);
}

解碼無符號整數

result = 0; 
shift = 0;
unsigned char byte;
do {
 byte = get_next_byte_in_input();
 result |= (byte & 0x7f << shift); /* low-order 7 bits of value */
 shift += 7;
} while (byte & 0x80 == 0); /* get high-order bit of byte */

解碼有符號整數

result = 0; 
shift = 0;

/* the size in bits of the result variable, e.g., 64 if result's type is int64_t */
size = sizeof(result) * CHAR_BITS; /* no. of bits in signed integer */

/* will be assigned inside the do-while loop, but referenced afterwards */
unsigned char byte;

do {
 byte = get_next_byte_in_input();
 result |= (byte & 0x7f << shift); /* low-order 7 bits of value */
 shift += 7;
} while (byte & 0x80 == 0); /* get high-order bit of byte */

/* sign bit of byte is second high-order bit (0x40) */
if ((shift < size) && (byte & 0x40 != 0))
 /* sign extend */
 result |= (~0 << shift);

JavaScript 代碼

編碼有符號BigInt

const encodeSignedLeb128FromBigInt = (value) => { 
 value = BigInt(value);
 const result = [];
 while (true) {
  const byte_ = Number(value & 0x7fn);
  value >>= 7n;
  if (
   (value === 0n && (byte_ & 0x40) === 0) ||
   (value === -1n && (byte_ & 0x40) !== 0)
  ) {
   result.push(byte_);
   return result;
  }
  result.push(byte_ | 0x80);
 }
};

解碼有符號BigInt

const decodeSignedBigInt = (input) => { 
 let result = 0n;
 let shift = 0;
 while (true) {
  const byte = input.shift();
  result |= BigInt((byte & 0x7f) << shift);
  shift += 7;
  if ((byte & 0x80) === 0) {
   // "sign-extending" does not apply to bigint because it has no fixed size
   // instead, we work by handling it as a two's complement of "shift" bits long,
   // which is provided by BigInt.asIntN
   return BigInt.asIntN(shift, result);
  }
 }
};

用途

Android項目在其 Dalvik 可執行格式 (.dex) 文件格式中使用 LEB128。 ^[6]
惠普 IA-64 異常處理中的壓縮表。 ^[7]
DWARF文件格式使用無符號和有符號的LEB128編碼。 ^[2]
.NET在BinaryReader和BinaryWriter類中支持「7位編碼整數」格式。將字符串寫入BinaryWriter時，字符串長度使用此方法進行編碼。
Minecraft在其協議中使用LEB128來測量數據包內數據的長度。 ^[8]
mpatrol 試工具在其跟蹤文件格式中使用 LEB128。 ^[9]
osu!在其「osu! 重放」（.osr）格式中使用 LEB128。 ^[10]
W3C 高效 XML 交換（EXI）使用 LEB128 表示無符號整數。 ^[11]
WebAssembly中。^[3]
xz文件格式中。^[12]

參考

^ UNIX International, 7.8, DWARF Debugging Information Format Specification Version 2.0, Draft (PDF), July 1993 [2009-07-19]
^ ^2.0 ^2.1 Free Standards Group. DWARF Debugging Information Format Specification Version 3.0 (PDF): 70. December 2005 [2009-07-19]. 引用錯誤：帶有name屬性「dwarfspec3」的<ref>標籤用不同內容定義了多次
^ ^3.0 ^3.1 ^3.2 WebAssembly Community Group. Values — Binary Format — WebAssembly 1.1. 2020-11-12 [2020-12-31]. 引用錯誤：帶有name屬性「wasmint」的<ref>標籤用不同內容定義了多次
^ Plaisance, Jeff; Kurz, Nathan. Vectorized VByte Decoding. 2015. arXiv:1503.07387  [cs.IR].
^ Lemire, Daniel; Kurz, Nathan; Rupp, Christoph. Stream VByte: Faster Byte-Oriented Integer Compression. Information Processing Letters (First International Symposium on Web Algorithms). February 1028, 130: 1–6 (June 2015). S2CID 8265597. arXiv:1709.08990 . doi:10.1016/j.ipl.2017.09.011.
^ Dalvik Executable Format. [2021-05-18].
^ Christophe de Dinechin. C++ Exception Handling for IA-64. October 2000 [2009-07-19].
^ Minecraft Modern Varint & Varlong. wiki.vg. 2020 [2020-11-29]. （原始內容存檔於2024-09-26）.
^ MPatrol documentation. December 2008 [2009-07-19].
^ Osr (file format) - osu!wiki. osu.ppy.sh. [2017-03-18] （英語）.
^ Efficient XML Interchange (EXI) Format 1.0. www.w3.org Second. World Wide Web Consortium. 2014-02-11 [2020-12-31].
^ The .xz File Format. tukaani.org. 2009 [2017-10-30].
^ LLVM Bitcode File Format — LLVM 13 documentation.

參見

DWARF 調試文件格式
UTF-7

[dwarfspec2-1] UNIX International, 7.8, DWARF Debugging Information Format Specification Version 2.0, Draft (PDF), July 1993 [2009-07-19]

[dwarfspec3-2] 2.0 ^2.1 Free Standards Group. DWARF Debugging Information Format Specification Version 3.0 (PDF): 70. December 2005 [2009-07-19]. 引用錯誤：帶有name屬性「dwarfspec3」的<ref>標籤用不同內容定義了多次

[wasmint-3] 3.0 ^3.1 ^3.2 WebAssembly Community Group. Values — Binary Format — WebAssembly 1.1. 2020-11-12 [2020-12-31]. 引用錯誤：帶有name屬性「wasmint」的<ref>標籤用不同內容定義了多次

[4] Plaisance, Jeff; Kurz, Nathan. Vectorized VByte Decoding. 2015. arXiv:1503.07387  [cs.IR].

[5] Lemire, Daniel; Kurz, Nathan; Rupp, Christoph. Stream VByte: Faster Byte-Oriented Integer Compression. Information Processing Letters (First International Symposium on Web Algorithms). February 1028, 130: 1–6 (June 2015). S2CID 8265597. arXiv:1709.08990 . doi:10.1016/j.ipl.2017.09.011.

[dex-6] Dalvik Executable Format. [2021-05-18].

[Christophe-7] Christophe de Dinechin. C++ Exception Handling for IA-64. October 2000 [2009-07-19].

[8] Minecraft Modern Varint & Varlong. wiki.vg. 2020 [2020-11-29]. （原始內容存檔於2024-09-26）.

[tracefile-9] MPatrol documentation. December 2008 [2009-07-19].

[10] Osr (file format) - osu!wiki. osu.ppy.sh. [2017-03-18] （英語）.

[11] Efficient XML Interchange (EXI) Format 1.0. www.w3.org Second. World Wide Web Consortium. 2014-02-11 [2020-12-31].

[12] The .xz File Format. tukaani.org. 2009 [2017-10-30].

[13] LLVM Bitcode File Format — LLVM 13 documentation.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]