当前位置: 首页 > article >正文

llama.cpp GGML Quantization Type

llama.cpp GGML Quantization Type

  • 1. GGML Quantization Type
  • 2. `static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]`
  • 3. `Q#_K_M` and `Q#_K`
  • References

什么神仙妖魔,不过是他们禁锢异族命运的枷锁!

GGUF
https://huggingface.co/docs/hub/gguf

docs/hub/gguf.md
https://github.com/huggingface/hub-docs/blob/main/docs/hub/gguf.md

1. GGML Quantization Type

packages/gguf/src/quant-descriptions.ts
https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts

import { GGMLQuantizationType } from "./types";export const GGUF_QUANT_DESCRIPTIONS: Record<GGMLQuantizationType, { txt: string; src_url?: string }> = {[GGMLQuantizationType.F32]: {txt: "32-bit standard IEEE 754 single-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Single-precision_floating-point_format",},[GGMLQuantizationType.F16]: {txt: "16-bit standard IEEE 754 half-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Half-precision_floating-point_format",},[GGMLQuantizationType.Q8_0]: {txt: "8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q8_1]: {txt: "8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q8_K]: {txt: `8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q6_K]: {txt: `6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q5_0]: {txt: "5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q5_1]: {txt: "5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q5_K]: {txt: `5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q4_0]: {txt: "4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q4_1]: {txt: "4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q4_K]: {txt: `4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q3_K]: {txt: `3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q2_K]: {txt: `2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.5625 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.IQ4_XS]: {txt: "4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ3_S]: {txt: "3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ3_XXS]: {txt: "3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_S]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_XS]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_XXS]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ1_S]: {txt: "1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ4_NL]: {txt: "4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.",src_url: "https://github.com/ggerganov/llama.cpp/pull/5590",},[GGMLQuantizationType.I8]: {txt: "8-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I16]: {txt: "16-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I32]: {txt: "32-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I64]: {txt: "64-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6062",},[GGMLQuantizationType.F64]: {txt: "64-bit standard IEEE 754 double-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Double-precision_floating-point_format",},[GGMLQuantizationType.IQ1_M]: {txt: "1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6302",},[GGMLQuantizationType.BF16]: {txt: "16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Bfloat16_floating-point_format",},
};
typesourcedescription
F64Wikipedia64-bit standard IEEE 754 double-precision floating-point number.
I64GH64-bit fixed-width integer number.
F32Wikipedia32-bit standard IEEE 754 single-precision floating-point number.
I32GH32-bit fixed-width integer number.
F16Wikipedia16-bit standard IEEE 754 half-precision floating-point number.
BF16Wikipedia16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.
I16GH16-bit fixed-width integer number.
Q8_0GH8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q8_1GH8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today)
Q8_KGH8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.
I8GH8-bit fixed-width integer number.
Q6_KGH6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.
Q5_0GH5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q5_1GH5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q5_KGH5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.
Q4_0GH4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q4_1GH4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q4_KGH4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.
Q3_KGH3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.
Q2_KGH2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.5625 bits-per-weight.
IQ4_NLGH4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.
IQ4_XSHF4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.
IQ3_SHF3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.
IQ3_XXSHF3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.
IQ2_XXSHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.
IQ2_SHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.
IQ2_XSHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.
IQ1_SHF1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.
IQ1_MGH1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.
GitHub, GH
Hugging Face, HF

2. static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]

https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.h
https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c

https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c

static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {[GGML_TYPE_I8] = {.type_name                = "i8",.blck_size                = 1,.type_size                = sizeof(int8_t),.is_quantized             = false,},[GGML_TYPE_I16] = {.type_name                = "i16",.blck_size                = 1,.type_size                = sizeof(int16_t),.is_quantized             = false,},[GGML_TYPE_I32] = {.type_name                = "i32",.blck_size                = 1,.type_size                = sizeof(int32_t),.is_quantized             = false,},[GGML_TYPE_I64] = {.type_name                = "i64",.blck_size                = 1,.type_size                = sizeof(int64_t),.is_quantized             = false,},[GGML_TYPE_F64] = {.type_name                = "f64",.blck_size                = 1,.type_size                = sizeof(double),.is_quantized             = false,},[GGML_TYPE_F32] = {.type_name                = "f32",.blck_size                = 1,.type_size                = sizeof(float),.is_quantized             = false,},[GGML_TYPE_F16] = {.type_name                = "f16",.blck_size                = 1,.type_size                = sizeof(ggml_fp16_t),.is_quantized             = false,.to_float                 = (ggml_to_float_t) ggml_fp16_to_fp32_row,.from_float_ref           = (ggml_from_float_t) ggml_fp32_to_fp16_row,},[GGML_TYPE_Q4_0] = {.type_name                = "q4_0",.blck_size                = QK4_0,.type_size                = sizeof(block_q4_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_0_ref,},[GGML_TYPE_Q4_1] = {.type_name                = "q4_1",.blck_size                = QK4_1,.type_size                = sizeof(block_q4_1),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_1,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_1_ref,},[4] = { // GGML_TYPE_Q4_2.type_name                = "DEPRECATED",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[5] = { // GGML_TYPE_Q4_3.type_name                = "DEPRECATED",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[GGML_TYPE_Q5_0] = {.type_name                = "q5_0",.blck_size                = QK5_0,.type_size                = sizeof(block_q5_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_0_ref,},[GGML_TYPE_Q5_1] = {.type_name                = "q5_1",.blck_size                = QK5_1,.type_size                = sizeof(block_q5_1),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_1,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_1_ref,},[GGML_TYPE_Q8_0] = {.type_name                = "q8_0",.blck_size                = QK8_0,.type_size                = sizeof(block_q8_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q8_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_0_ref,},[GGML_TYPE_Q8_1] = {.type_name                = "q8_1",.blck_size                = QK8_1,.type_size                = sizeof(block_q8_1),.is_quantized             = true,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_1_ref,},[GGML_TYPE_Q2_K] = {.type_name                = "q2_K",.blck_size                = QK_K,.type_size                = sizeof(block_q2_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q2_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q2_K_ref,},[GGML_TYPE_Q3_K] = {.type_name                = "q3_K",.blck_size                = QK_K,.type_size                = sizeof(block_q3_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q3_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q3_K_ref,},[GGML_TYPE_Q4_K] = {.type_name                = "q4_K",.blck_size                = QK_K,.type_size                = sizeof(block_q4_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_K_ref,},[GGML_TYPE_Q5_K] = {.type_name                = "q5_K",.blck_size                = QK_K,.type_size                = sizeof(block_q5_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_K_ref,},[GGML_TYPE_Q6_K] = {.type_name                = "q6_K",.blck_size                = QK_K,.type_size                = sizeof(block_q6_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q6_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q6_K_ref,},[GGML_TYPE_IQ2_XXS] = {.type_name                = "iq2_xxs",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_xxs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_xxs,.from_float_ref           = NULL,},[GGML_TYPE_IQ2_XS] = {.type_name                = "iq2_xs",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_xs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_xs,.from_float_ref           = NULL,},[GGML_TYPE_IQ3_XXS] = {.type_name                = "iq3_xxs",.blck_size                = QK_K,.type_size                = sizeof(block_iq3_xxs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq3_xxs,.from_float_ref           = (ggml_from_float_t)quantize_row_iq3_xxs_ref,},[GGML_TYPE_IQ3_S] = {.type_name                = "iq3_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq3_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq3_s,.from_float_ref           = (ggml_from_float_t)quantize_row_iq3_s_ref,},[GGML_TYPE_IQ2_S] = {.type_name                = "iq2_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_s,.from_float_ref           = (ggml_from_float_t)quantize_row_iq2_s_ref,},[GGML_TYPE_IQ1_S] = {.type_name                = "iq1_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq1_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq1_s,.from_float_ref           = NULL,},[GGML_TYPE_IQ1_M] = {.type_name                = "iq1_m",.blck_size                = QK_K,.type_size                = sizeof(block_iq1_m),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq1_m,.from_float_ref           = NULL,},[GGML_TYPE_IQ4_NL] = {.type_name                = "iq4_nl",.blck_size                = QK4_NL,.type_size                = sizeof(block_iq4_nl),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq4_nl,.from_float_ref           = (ggml_from_float_t)quantize_row_iq4_nl_ref,},[GGML_TYPE_IQ4_XS] = {.type_name                = "iq4_xs",.blck_size                = QK_K,.type_size                = sizeof(block_iq4_xs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq4_xs,.from_float_ref           = (ggml_from_float_t)quantize_row_iq4_xs_ref,},[GGML_TYPE_Q8_K] = {.type_name                = "q8_K",.blck_size                = QK_K,.type_size                = sizeof(block_q8_K),.is_quantized             = true,},[GGML_TYPE_BF16] = {.type_name                = "bf16",.blck_size                = 1,.type_size                = sizeof(ggml_bf16_t),.is_quantized             = false,.to_float                 = (ggml_to_float_t) ggml_bf16_to_fp32_row,.from_float_ref           = (ggml_from_float_t) ggml_fp32_to_bf16_row_ref,},[31] = { // GGML_TYPE_Q4_0_4_4.type_name                = "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[32] = { // GGML_TYPE_Q4_0_4_8.type_name                = "TYPE_Q4_0_4_8 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[33] = { // GGML_TYPE_Q4_0_8_8.type_name                = "TYPE_Q4_0_8_8 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[GGML_TYPE_TQ1_0] = {.type_name                = "tq1_0",.blck_size                = QK_K,.type_size                = sizeof(block_tq1_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_tq1_0,.from_float_ref           = (ggml_from_float_t) quantize_row_tq1_0_ref,},[GGML_TYPE_TQ2_0] = {.type_name                = "tq2_0",.blck_size                = QK_K,.type_size                = sizeof(block_tq2_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_tq2_0,.from_float_ref           = (ggml_from_float_t) quantize_row_tq2_0_ref,},[36] = { // GGML_TYPE_IQ4_NL_4_4.type_name                = "TYPE_IQ4_NL_4_4 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[37] = { // GGML_TYPE_IQ4_NL_4_8.type_name                = "TYPE_IQ4_NL_4_8 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[38] = { // GGML_TYPE_IQ4_NL_8_8.type_name                = "TYPE_IQ4_NL_8_8 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},
};

/home/yongqiang/llm_work/llama_cpp_25_01_05/llama.cpp/ggml/include/ggml.h

    // NOTE: always add types at the end of the enum to keep backward compatibilityenum ggml_type {GGML_TYPE_F32     = 0,GGML_TYPE_F16     = 1,GGML_TYPE_Q4_0    = 2,GGML_TYPE_Q4_1    = 3,// GGML_TYPE_Q4_2 = 4, support has been removed// GGML_TYPE_Q4_3 = 5, support has been removedGGML_TYPE_Q5_0    = 6,GGML_TYPE_Q5_1    = 7,GGML_TYPE_Q8_0    = 8,GGML_TYPE_Q8_1    = 9,GGML_TYPE_Q2_K    = 10,GGML_TYPE_Q3_K    = 11,GGML_TYPE_Q4_K    = 12,GGML_TYPE_Q5_K    = 13,GGML_TYPE_Q6_K    = 14,GGML_TYPE_Q8_K    = 15,GGML_TYPE_IQ2_XXS = 16,GGML_TYPE_IQ2_XS  = 17,GGML_TYPE_IQ3_XXS = 18,GGML_TYPE_IQ1_S   = 19,GGML_TYPE_IQ4_NL  = 20,GGML_TYPE_IQ3_S   = 21,GGML_TYPE_IQ2_S   = 22,GGML_TYPE_IQ4_XS  = 23,GGML_TYPE_I8      = 24,GGML_TYPE_I16     = 25,GGML_TYPE_I32     = 26,GGML_TYPE_I64     = 27,GGML_TYPE_F64     = 28,GGML_TYPE_IQ1_M   = 29,GGML_TYPE_BF16    = 30,// GGML_TYPE_Q4_0_4_4 = 31, support has been removed from gguf files// GGML_TYPE_Q4_0_4_8 = 32,// GGML_TYPE_Q4_0_8_8 = 33,GGML_TYPE_TQ1_0   = 34,GGML_TYPE_TQ2_0   = 35,// GGML_TYPE_IQ4_NL_4_4 = 36,// GGML_TYPE_IQ4_NL_4_8 = 37,// GGML_TYPE_IQ4_NL_8_8 = 38,GGML_TYPE_COUNT   = 39,};// precisionenum ggml_prec {GGML_PREC_DEFAULT,GGML_PREC_F32,};// model file typesenum ggml_ftype {GGML_FTYPE_UNKNOWN        = -1,GGML_FTYPE_ALL_F32        = 0,GGML_FTYPE_MOSTLY_F16     = 1,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_0    = 2,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_1    = 3,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16GGML_FTYPE_MOSTLY_Q8_0    = 7,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_0    = 8,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_1    = 9,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q2_K    = 10, // except 1d tensorsGGML_FTYPE_MOSTLY_Q3_K    = 11, // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_K    = 12, // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_K    = 13, // except 1d tensorsGGML_FTYPE_MOSTLY_Q6_K    = 14, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_XXS = 15, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_XS  = 16, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ3_XXS = 17, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ1_S   = 18, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ4_NL  = 19, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ3_S   = 20, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_S   = 21, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ4_XS  = 22, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ1_M   = 23, // except 1d tensorsGGML_FTYPE_MOSTLY_BF16    = 24, // except 1d tensors};

3. Q#_K_M and Q#_K

https://netraneupane.medium.com/hands-on-llms-quantization-a4c7ab1421c2

In the context of llama.cpp, Q4_K_M refers to a specific type of k-means quantization method. The naming convention is as follows:

  • Q stands for Quantization.
  • 4 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.
  • M represents the size of the model after quantization. (S = Small, M = Medium, L = Large).

Similarly, Q2_K refers to specific type of k-means quantization too. The naming convention is as follow:

  • Q stands for Quantization.
  • 2 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] huggingface/gguf, https://github.com/huggingface/huggingface.js/tree/main/packages/gguf
[3] llama.cpp, https://github.com/ggerganov/llama.cpp
[4] k-quants, https://github.com/ggerganov/llama.cpp/pull/1684

相关文章:

llama.cpp GGML Quantization Type

llama.cpp GGML Quantization Type 1. GGML Quantization Type2. static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]3. Q#_K_M and Q#_KReferences 什么神仙妖魔&#xff0c;不过是他们禁锢异族命运的枷锁&#xff01; GGUF https://huggingface.co/docs/hu…...

【深度学习框架】MXNet(Apache MXNet)

MXNet&#xff08;Apache MXNet&#xff09;是一个 高性能、可扩展 的 开源深度学习框架&#xff0c;支持 多种编程语言&#xff08;如 Python、R、Scala、C 和 Julia&#xff09;&#xff0c;并能在 CPU、GPU 以及分布式集群 上高效运行。MXNet 是亚马逊 AWS 官方支持的深度学…...

游戏引擎学习第87天

当直接使用内存时&#xff0c;可能会发生一些奇怪的事情 在直接操作内存时&#xff0c;一些意外的情况可能会发生。由于内存实际上只是一个大块的空间&#xff0c;开发者可以完全控制它&#xff0c;而不像高级语言那样必须遵守许多规则&#xff0c;因此很容易发生错误。在一个…...

【物联网】ARM核常用指令(详解):数据传送、计算、位运算、比较、跳转、内存访问、CPSR/SPSR

文章目录 指令格式&#xff08;重点&#xff09;1. 立即数2. 寄存器位移 一、数据传送指令1. MOV指令2. MVN指令3. LDR指令 二、数据计算指令1. ADD指令1. SUB指令1. MUL指令 三、位运算指令1. AND指令2. ORR指令3. EOR指令4. BIC指令 四、比较指令五、跳转指令1. B/BL指令2. l…...

Qt展厅播放器/多媒体播放器/中控播放器/帧同步播放器/硬解播放器/监控播放器

一、前言说明 音视频开发除了应用在安防监控、视频网站、各种流媒体app开发之外&#xff0c;还有一个小众的市场&#xff0c;那就是多媒体展厅场景&#xff0c;这个场景目前处于垄断地位的软件是HirenderS3&#xff0c;做的非常早而且非常全面&#xff0c;都是通用的需求&…...

VSCode源码分析参考资料

VSCode Architecture Analysis - Electron Project Cross-Platform Best Practices 中文版 VSCode 架构分析 - Electron 项目跨平台最佳实践 Sihan Li博客上的vscode源码分析系列&#xff1a;分析了微服务架构、事件体系、资源管理、配置系统等 文召博客上的vscode 源码解析…...

html中的表格属性以及合并操作

表格用table定义&#xff0c;标签标题用caption标签定义&#xff1b;用tr定义表格的若干行&#xff1b;用td定义若干个单元格&#xff1b;&#xff08;当单元格是表头时&#xff0c;用th标签定义&#xff09;&#xff08;th标签会略粗于td标签&#xff09; table的整体外观取决…...

html的字符实体和颜色表示

在HTML中&#xff0c;颜色可以通过以下几种方式表示&#xff0c;以下是具体的示例&#xff1a; 1. 十六进制颜色代码 十六进制颜色代码以#开头&#xff0c;后面跟随6个字符&#xff0c;每两个字符分别表示红色、绿色和蓝色的强度。例如&#xff1a; • #FF0000&#xff1a;纯红…...

unordered_map/set的哈希封装

【C笔记】unordered_map/set的哈希封装 &#x1f525;个人主页&#xff1a;大白的编程日记 &#x1f525;专栏&#xff1a;C笔记 文章目录 【C笔记】unordered_map/set的哈希封装前言一. 源码及框架分析二.迭代器三.operator[]四.使用哈希表封装unordered_map/set后言 前言 哈…...

运算符(C#)

运算符(C#) 算数运算符 - * / % //算数运算符// - * / %//这跟我们初中的运算符一样// 加号Console.WriteLine(12);//3int a 5 6;Console.WriteLine(a);//11// - 减号Console.WriteLine(6-3);//3int b 10 - 6;Console.WriteLine(b);//4// * 乘号Console.WriteL…...

idea中git的简单使用

提交&#xff0c;推送直接合并 合到哪个分支就到先切到哪个分支...

Fastdds学习分享_xtpes_发布订阅模式及rpc模式

在之前的博客中我们介绍了dds的大致功能&#xff0c;与组成结构。本篇博文主要介绍的是xtypes.分为理论和实际运用两部分.理论主要用于梳理hzy大佬的知识&#xff0c;对于某些一带而过的部分作出更为详细的阐释&#xff0c;并在之后通过实际案例便于理解。案例分为普通发布订阅…...

SQLite Update 语句详解

SQLite Update 语句详解 SQLite 是一款轻量级的数据库管理系统&#xff0c;以其简单、易用和高效的特点在全球范围内得到了广泛的应用。在 SQLite 中&#xff0c;UPDATE 语句是用于修改数据库表中记录的常用命令。本文将详细解析 SQLite 的 UPDATE 语句&#xff0c;包括其语法…...

【大数据技术】用户行为日志分析(python+hadoop+mapreduce+yarn+hive)

用户行为日志分析(python+hadoop+mapreduce+yarn+hive) 搭建完全分布式高可用大数据集群(VMware+CentOS+FinalShell) 搭建完全分布式高可用大数据集群(Hadoop+MapReduce+Yarn) 本机PyCharm远程连接虚拟机Python 搭建完全分布式高可用大数据集群(MySQL+Hive)...

开发板上Qt运行的环境变量的三条设置语句的详解

在终端中运行下面三句命令用于配置开发板上Qt运行的环境变量&#xff1a; export QT_QPA_GENERIC_PLUGINStslib:/dev/input/event1 export QT_QPA_PLATFORMlinuxfb:fb/dev/fb0 export QT_QPA_FONTDIR/usr/lib/fonts/设置成功后可以用下面的语句检查设置成功没有 echo $QT_QPA…...

vue3中el-input无法获得焦点的问题

文章目录 现象两次nextTick()加setTimeout()解决结论 现象 el-input被外层div包裹了&#xff0c;设置autofocus不起作用&#xff1a; <el-dialog v-model"visible" :title"title" :append-to-bodytrue width"50%"><el-form v-model&q…...

语言月赛 202412【顽强拼搏奖的四种发法】题解(AC)

》》》点我查看「视频」详解》》》 [语言月赛 202412] 顽强拼搏奖的四种发法 题目描述 在 XCPC 竞赛里&#xff0c;会有若干道题目&#xff0c;一支队伍可以对每道题目提交若干次。我们称一支队伍对一道题目的一次提交是有效的&#xff0c;当且仅当&#xff1a; 在本次提交…...

使用 Kotlin 将 Vertx 和 Springboot 整合

本篇文章目的是将 Springboot 和 Vertx 进行简单整合。整合目的仅仅是为了整活&#xff0c;因为两个不同的东西整合在一起提升的性能并没有只使用 Vertx 性能高&#xff0c;因此追求高性能的话这是在我来说不推荐。而且他们不仅没有提高很多性能甚至增加了学习成本 一、整合流…...

自定义数据集 使用scikit-learn中svm的包实现svm分类

引入必要的库 import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report 生成自定义数据集 X, y ma…...

python:如何播放 .spx 声音文件

.spx 是 Speex音频编解码器的文件扩展名&#xff0c;它是一种开源的、免费的音频编解码器&#xff0c;主要用于语音压缩和语音通信领域。spx 文件通常用于语音记录、VoIP应用、语音信箱等场景。 .mp3 是一种广泛使用的音频格式&#xff0c;它采用了有损压缩算法&#xff0c;可…...

php的使用及 phpstorm环境部署

php语法 环境搭建&#xff1a;在小皮中新建网站&#xff0c;注意先填写域名再点击选择根目录。 成功创建网站后&#xff0c;打开发现forbidden&#xff0c;因为新建的网站里是空的&#xff0c;需要新建index.php文件----> 在Phpstorm中左上角打开文件&#xff0c;打开那个文…...

有用的sql链接

『SQL』常考面试题&#xff08;2——窗口函数&#xff09;_sql的窗口函数面试题-CSDN博客 史上最强sql计算用户次日留存率详解&#xff08;通用版&#xff09;及相关常用函数 -2020.06.10 - 知乎 (zhihu.com) 1280. 学生们参加各科测试的次数 - 力扣&#xff08;LeetCode&…...

【Numpy核心编程攻略:Python数据处理、分析详解与科学计算】2.27 NumPy+Pandas:高性能数据处理的黄金组合

2.27 NumPyPandas&#xff1a;高性能数据处理的黄金组合 目录 #mermaid-svg-x3ndEE4hrhO6WR6H {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-x3ndEE4hrhO6WR6H .error-icon{fill:#552222;}#mermaid-svg-x3ndEE4hr…...

第一个3D程序!

运行效果 CPP #include <iostream> #include <fstream> #include <string> #include <cmath>#include <GL/glew.h> #include <GLFW/glfw3.h> #include <glm/glm.hpp> #include <glm/gtc/type_ptr.hpp> #include <glm/gtc/…...

NeuralCF 模型:神经网络协同过滤模型

实验和完整代码 完整代码实现和jupyter运行&#xff1a;https://github.com/Myolive-Lin/RecSys--deep-learning-recommendation-system/tree/main 引言 NeuralCF 模型由新加坡国立大学研究人员于 2017 年提出&#xff0c;其核心思想在于将传统协同过滤方法与深度学习技术相结…...

第二十三章 MySQL锁之表锁

目录 一、概述 二、语法 三、特点 一、概述 表级锁&#xff0c;每次操作锁住整张表。锁定粒度大&#xff0c;发生锁冲突的概率最高&#xff0c;并发度最低。应用在MyISAM、InnoDB、BDB等存储引擎中。 对于表级锁&#xff0c;主要分为以下三类&#xff1a; 1. 表锁 2. 元数…...

【Uniapp-Vue3】获取用户状态栏高度和胶囊按钮高度

在项目目录下创建一个utils文件&#xff0c;并在里面创建一个system.js文件。 在system.js中配置如下代码&#xff1a; const SYSTEM_INFO uni.getSystemInfoAsync();// 返回状态栏高度 export const getStatusBarHeight ()> SYSTEM_INFO.statusBarHeight || 15;// 返回胶…...

04树 + 堆 + 优先队列 + 图(D1_树(D10_决策树))

目录 一、引言 二、算法原理 三、算法实现 四、知识小结 一、引言 决策树算法是一种常用的机器学习算法&#xff0c;可用于分类和回归问题。它基于特征之间的条件判断来构 建一棵树&#xff0c;树的每个节点代表一个特征&#xff0c;每个叶节点代表一个类别或回归值。决策…...

通向AGI之路:人工通用智能的技术演进与人类未来

文章目录 引言:当机器开始思考一、AGI的本质定义与技术演进1.1 从专用到通用:智能形态的范式转移1.2 AGI发展路线图二、突破AGI的五大技术路径2.1 神经符号整合(Neuro-Symbolic AI)2.2 世界模型架构(World Models)2.3 具身认知理论(Embodied Cognition)三、AGI安全:价…...

将ollama迁移到其他盘(eg:F盘)

文章目录 1.迁移ollama的安装目录2.修改环境变量3.验证 背景&#xff1a;在windows操作系统中进行操作 相关阅读 &#xff1a;本地部署deepseek模型步骤 1.迁移ollama的安装目录 因为ollama默认安装在C盘&#xff0c;所以只能安装好之后再进行手动迁移位置。 # 1.迁移Ollama可…...