当前位置: 首页 > news >正文

llama.cpp GGML Quantization Type

llama.cpp GGML Quantization Type

  • 1. GGML Quantization Type
  • 2. `static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]`
  • 3. `Q#_K_M` and `Q#_K`
  • References

什么神仙妖魔,不过是他们禁锢异族命运的枷锁!

GGUF
https://huggingface.co/docs/hub/gguf

docs/hub/gguf.md
https://github.com/huggingface/hub-docs/blob/main/docs/hub/gguf.md

1. GGML Quantization Type

packages/gguf/src/quant-descriptions.ts
https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts

import { GGMLQuantizationType } from "./types";export const GGUF_QUANT_DESCRIPTIONS: Record<GGMLQuantizationType, { txt: string; src_url?: string }> = {[GGMLQuantizationType.F32]: {txt: "32-bit standard IEEE 754 single-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Single-precision_floating-point_format",},[GGMLQuantizationType.F16]: {txt: "16-bit standard IEEE 754 half-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Half-precision_floating-point_format",},[GGMLQuantizationType.Q8_0]: {txt: "8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q8_1]: {txt: "8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q8_K]: {txt: `8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q6_K]: {txt: `6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q5_0]: {txt: "5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q5_1]: {txt: "5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q5_K]: {txt: `5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q4_0]: {txt: "4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q4_1]: {txt: "4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q4_K]: {txt: `4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q3_K]: {txt: `3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q2_K]: {txt: `2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.5625 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.IQ4_XS]: {txt: "4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ3_S]: {txt: "3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ3_XXS]: {txt: "3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_S]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_XS]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_XXS]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ1_S]: {txt: "1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ4_NL]: {txt: "4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.",src_url: "https://github.com/ggerganov/llama.cpp/pull/5590",},[GGMLQuantizationType.I8]: {txt: "8-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I16]: {txt: "16-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I32]: {txt: "32-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I64]: {txt: "64-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6062",},[GGMLQuantizationType.F64]: {txt: "64-bit standard IEEE 754 double-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Double-precision_floating-point_format",},[GGMLQuantizationType.IQ1_M]: {txt: "1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6302",},[GGMLQuantizationType.BF16]: {txt: "16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Bfloat16_floating-point_format",},
};
typesourcedescription
F64Wikipedia64-bit standard IEEE 754 double-precision floating-point number.
I64GH64-bit fixed-width integer number.
F32Wikipedia32-bit standard IEEE 754 single-precision floating-point number.
I32GH32-bit fixed-width integer number.
F16Wikipedia16-bit standard IEEE 754 half-precision floating-point number.
BF16Wikipedia16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.
I16GH16-bit fixed-width integer number.
Q8_0GH8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q8_1GH8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today)
Q8_KGH8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.
I8GH8-bit fixed-width integer number.
Q6_KGH6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.
Q5_0GH5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q5_1GH5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q5_KGH5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.
Q4_0GH4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q4_1GH4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q4_KGH4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.
Q3_KGH3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.
Q2_KGH2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.5625 bits-per-weight.
IQ4_NLGH4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.
IQ4_XSHF4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.
IQ3_SHF3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.
IQ3_XXSHF3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.
IQ2_XXSHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.
IQ2_SHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.
IQ2_XSHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.
IQ1_SHF1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.
IQ1_MGH1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.
GitHub, GH
Hugging Face, HF

2. static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]

https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.h
https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c

https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c

static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {[GGML_TYPE_I8] = {.type_name                = "i8",.blck_size                = 1,.type_size                = sizeof(int8_t),.is_quantized             = false,},[GGML_TYPE_I16] = {.type_name                = "i16",.blck_size                = 1,.type_size                = sizeof(int16_t),.is_quantized             = false,},[GGML_TYPE_I32] = {.type_name                = "i32",.blck_size                = 1,.type_size                = sizeof(int32_t),.is_quantized             = false,},[GGML_TYPE_I64] = {.type_name                = "i64",.blck_size                = 1,.type_size                = sizeof(int64_t),.is_quantized             = false,},[GGML_TYPE_F64] = {.type_name                = "f64",.blck_size                = 1,.type_size                = sizeof(double),.is_quantized             = false,},[GGML_TYPE_F32] = {.type_name                = "f32",.blck_size                = 1,.type_size                = sizeof(float),.is_quantized             = false,},[GGML_TYPE_F16] = {.type_name                = "f16",.blck_size                = 1,.type_size                = sizeof(ggml_fp16_t),.is_quantized             = false,.to_float                 = (ggml_to_float_t) ggml_fp16_to_fp32_row,.from_float_ref           = (ggml_from_float_t) ggml_fp32_to_fp16_row,},[GGML_TYPE_Q4_0] = {.type_name                = "q4_0",.blck_size                = QK4_0,.type_size                = sizeof(block_q4_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_0_ref,},[GGML_TYPE_Q4_1] = {.type_name                = "q4_1",.blck_size                = QK4_1,.type_size                = sizeof(block_q4_1),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_1,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_1_ref,},[4] = { // GGML_TYPE_Q4_2.type_name                = "DEPRECATED",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[5] = { // GGML_TYPE_Q4_3.type_name                = "DEPRECATED",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[GGML_TYPE_Q5_0] = {.type_name                = "q5_0",.blck_size                = QK5_0,.type_size                = sizeof(block_q5_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_0_ref,},[GGML_TYPE_Q5_1] = {.type_name                = "q5_1",.blck_size                = QK5_1,.type_size                = sizeof(block_q5_1),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_1,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_1_ref,},[GGML_TYPE_Q8_0] = {.type_name                = "q8_0",.blck_size                = QK8_0,.type_size                = sizeof(block_q8_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q8_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_0_ref,},[GGML_TYPE_Q8_1] = {.type_name                = "q8_1",.blck_size                = QK8_1,.type_size                = sizeof(block_q8_1),.is_quantized             = true,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_1_ref,},[GGML_TYPE_Q2_K] = {.type_name                = "q2_K",.blck_size                = QK_K,.type_size                = sizeof(block_q2_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q2_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q2_K_ref,},[GGML_TYPE_Q3_K] = {.type_name                = "q3_K",.blck_size                = QK_K,.type_size                = sizeof(block_q3_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q3_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q3_K_ref,},[GGML_TYPE_Q4_K] = {.type_name                = "q4_K",.blck_size                = QK_K,.type_size                = sizeof(block_q4_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_K_ref,},[GGML_TYPE_Q5_K] = {.type_name                = "q5_K",.blck_size                = QK_K,.type_size                = sizeof(block_q5_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_K_ref,},[GGML_TYPE_Q6_K] = {.type_name                = "q6_K",.blck_size                = QK_K,.type_size                = sizeof(block_q6_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q6_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q6_K_ref,},[GGML_TYPE_IQ2_XXS] = {.type_name                = "iq2_xxs",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_xxs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_xxs,.from_float_ref           = NULL,},[GGML_TYPE_IQ2_XS] = {.type_name                = "iq2_xs",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_xs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_xs,.from_float_ref           = NULL,},[GGML_TYPE_IQ3_XXS] = {.type_name                = "iq3_xxs",.blck_size                = QK_K,.type_size                = sizeof(block_iq3_xxs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq3_xxs,.from_float_ref           = (ggml_from_float_t)quantize_row_iq3_xxs_ref,},[GGML_TYPE_IQ3_S] = {.type_name                = "iq3_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq3_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq3_s,.from_float_ref           = (ggml_from_float_t)quantize_row_iq3_s_ref,},[GGML_TYPE_IQ2_S] = {.type_name                = "iq2_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_s,.from_float_ref           = (ggml_from_float_t)quantize_row_iq2_s_ref,},[GGML_TYPE_IQ1_S] = {.type_name                = "iq1_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq1_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq1_s,.from_float_ref           = NULL,},[GGML_TYPE_IQ1_M] = {.type_name                = "iq1_m",.blck_size                = QK_K,.type_size                = sizeof(block_iq1_m),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq1_m,.from_float_ref           = NULL,},[GGML_TYPE_IQ4_NL] = {.type_name                = "iq4_nl",.blck_size                = QK4_NL,.type_size                = sizeof(block_iq4_nl),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq4_nl,.from_float_ref           = (ggml_from_float_t)quantize_row_iq4_nl_ref,},[GGML_TYPE_IQ4_XS] = {.type_name                = "iq4_xs",.blck_size                = QK_K,.type_size                = sizeof(block_iq4_xs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq4_xs,.from_float_ref           = (ggml_from_float_t)quantize_row_iq4_xs_ref,},[GGML_TYPE_Q8_K] = {.type_name                = "q8_K",.blck_size                = QK_K,.type_size                = sizeof(block_q8_K),.is_quantized             = true,},[GGML_TYPE_BF16] = {.type_name                = "bf16",.blck_size                = 1,.type_size                = sizeof(ggml_bf16_t),.is_quantized             = false,.to_float                 = (ggml_to_float_t) ggml_bf16_to_fp32_row,.from_float_ref           = (ggml_from_float_t) ggml_fp32_to_bf16_row_ref,},[31] = { // GGML_TYPE_Q4_0_4_4.type_name                = "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[32] = { // GGML_TYPE_Q4_0_4_8.type_name                = "TYPE_Q4_0_4_8 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[33] = { // GGML_TYPE_Q4_0_8_8.type_name                = "TYPE_Q4_0_8_8 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[GGML_TYPE_TQ1_0] = {.type_name                = "tq1_0",.blck_size                = QK_K,.type_size                = sizeof(block_tq1_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_tq1_0,.from_float_ref           = (ggml_from_float_t) quantize_row_tq1_0_ref,},[GGML_TYPE_TQ2_0] = {.type_name                = "tq2_0",.blck_size                = QK_K,.type_size                = sizeof(block_tq2_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_tq2_0,.from_float_ref           = (ggml_from_float_t) quantize_row_tq2_0_ref,},[36] = { // GGML_TYPE_IQ4_NL_4_4.type_name                = "TYPE_IQ4_NL_4_4 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[37] = { // GGML_TYPE_IQ4_NL_4_8.type_name                = "TYPE_IQ4_NL_4_8 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[38] = { // GGML_TYPE_IQ4_NL_8_8.type_name                = "TYPE_IQ4_NL_8_8 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},
};

/home/yongqiang/llm_work/llama_cpp_25_01_05/llama.cpp/ggml/include/ggml.h

    // NOTE: always add types at the end of the enum to keep backward compatibilityenum ggml_type {GGML_TYPE_F32     = 0,GGML_TYPE_F16     = 1,GGML_TYPE_Q4_0    = 2,GGML_TYPE_Q4_1    = 3,// GGML_TYPE_Q4_2 = 4, support has been removed// GGML_TYPE_Q4_3 = 5, support has been removedGGML_TYPE_Q5_0    = 6,GGML_TYPE_Q5_1    = 7,GGML_TYPE_Q8_0    = 8,GGML_TYPE_Q8_1    = 9,GGML_TYPE_Q2_K    = 10,GGML_TYPE_Q3_K    = 11,GGML_TYPE_Q4_K    = 12,GGML_TYPE_Q5_K    = 13,GGML_TYPE_Q6_K    = 14,GGML_TYPE_Q8_K    = 15,GGML_TYPE_IQ2_XXS = 16,GGML_TYPE_IQ2_XS  = 17,GGML_TYPE_IQ3_XXS = 18,GGML_TYPE_IQ1_S   = 19,GGML_TYPE_IQ4_NL  = 20,GGML_TYPE_IQ3_S   = 21,GGML_TYPE_IQ2_S   = 22,GGML_TYPE_IQ4_XS  = 23,GGML_TYPE_I8      = 24,GGML_TYPE_I16     = 25,GGML_TYPE_I32     = 26,GGML_TYPE_I64     = 27,GGML_TYPE_F64     = 28,GGML_TYPE_IQ1_M   = 29,GGML_TYPE_BF16    = 30,// GGML_TYPE_Q4_0_4_4 = 31, support has been removed from gguf files// GGML_TYPE_Q4_0_4_8 = 32,// GGML_TYPE_Q4_0_8_8 = 33,GGML_TYPE_TQ1_0   = 34,GGML_TYPE_TQ2_0   = 35,// GGML_TYPE_IQ4_NL_4_4 = 36,// GGML_TYPE_IQ4_NL_4_8 = 37,// GGML_TYPE_IQ4_NL_8_8 = 38,GGML_TYPE_COUNT   = 39,};// precisionenum ggml_prec {GGML_PREC_DEFAULT,GGML_PREC_F32,};// model file typesenum ggml_ftype {GGML_FTYPE_UNKNOWN        = -1,GGML_FTYPE_ALL_F32        = 0,GGML_FTYPE_MOSTLY_F16     = 1,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_0    = 2,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_1    = 3,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16GGML_FTYPE_MOSTLY_Q8_0    = 7,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_0    = 8,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_1    = 9,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q2_K    = 10, // except 1d tensorsGGML_FTYPE_MOSTLY_Q3_K    = 11, // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_K    = 12, // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_K    = 13, // except 1d tensorsGGML_FTYPE_MOSTLY_Q6_K    = 14, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_XXS = 15, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_XS  = 16, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ3_XXS = 17, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ1_S   = 18, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ4_NL  = 19, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ3_S   = 20, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_S   = 21, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ4_XS  = 22, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ1_M   = 23, // except 1d tensorsGGML_FTYPE_MOSTLY_BF16    = 24, // except 1d tensors};

3. Q#_K_M and Q#_K

https://netraneupane.medium.com/hands-on-llms-quantization-a4c7ab1421c2

In the context of llama.cpp, Q4_K_M refers to a specific type of k-means quantization method. The naming convention is as follows:

  • Q stands for Quantization.
  • 4 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.
  • M represents the size of the model after quantization. (S = Small, M = Medium, L = Large).

Similarly, Q2_K refers to specific type of k-means quantization too. The naming convention is as follow:

  • Q stands for Quantization.
  • 2 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] huggingface/gguf, https://github.com/huggingface/huggingface.js/tree/main/packages/gguf
[3] llama.cpp, https://github.com/ggerganov/llama.cpp
[4] k-quants, https://github.com/ggerganov/llama.cpp/pull/1684

相关文章:

llama.cpp GGML Quantization Type

llama.cpp GGML Quantization Type 1. GGML Quantization Type2. static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]3. Q#_K_M and Q#_KReferences 什么神仙妖魔&#xff0c;不过是他们禁锢异族命运的枷锁&#xff01; GGUF https://huggingface.co/docs/hu…...

k8s部署go-fastdfs

前置环境:已部署k8s集群,ip地址为 192.168.10.1~192.168.10.5,总共5台机器。 1. 创建provisioner制备器(如果已存在,则不需要) 制备器的具体部署方式可参考我的上一篇文章: k8s部署rabbitmq-CSDN博客文章浏览阅读254次,点赞3次,收藏5次。k8s部署rabbitmqhttps://blo…...

Python----Python高级(并发编程:协程Coroutines,事件循环,Task对象,协程间通信,协程同步,将协程分布到线程池/进程池中)

一、协程 1.1、协程 协程&#xff0c;Coroutines&#xff0c;也叫作纤程(Fiber) 协程&#xff0c;全称是“协同程序”&#xff0c;用来实现任务协作。是一种在线程中&#xff0c;比线程更加轻量级的存在&#xff0c;由程序员自己写程序来管理。 当出现IO阻塞时&#xff0c;…...

什么是可观测性?

现代服务架构常常谈及三个性&#xff1a; 弹性&#xff0c;韧性&#xff0c;可观测性。今天且按下其他两性不表&#xff0c;着重聊一聊可观测性。本文就几个主题对可观测性展开讨论&#xff1a; 可观测性是什么可观测性是必须的吗企业的可观测性落地 可观测性理念 可观测性是…...

3. 【.NET Aspire 从入门到实战】--理论入门与环境搭建--环境搭建

构建现代云原生应用程序时&#xff0c;开发环境的搭建至关重要。NET Aspire 作为一款专为云原生应用设计的开发框架&#xff0c;提供了一整套工具、模板和集成包&#xff0c;旨在简化分布式系统的构建和管理。开始项目初始化之前&#xff0c;确保开发环境的正确配置是成功的第一…...

kubeadm构建k8s源码阅读环境

目标 前面看了minikube的源码了解到其本质是调用了kubeadm来启动k8s集群&#xff0c;并没有达到最初看代码的目的。 所以继续看看kubeadm的代码&#xff0c;看看能否用来方便地构建源码调试环境。 k8s源码编译 kubeadm源码在k8s源码库中&#xff0c;所以要先克隆k8s源码。之…...

【Flink快速入门-1.Flink 简介与环境配置】

Flink 简介与环境配置 实验介绍 在学习一门新的技术之前&#xff0c;我们首先要了解它的历史渊源&#xff0c;也就是说它为什么会出现&#xff0c;它能够解决什么业务痛点。所以本节我们的学习目的是了解 Flink 的背景&#xff0c;并运行第一个 Flink 程序&#xff0c;对它有…...

硬盘修复后,文件隐身之谜

在数字时代&#xff0c;硬盘作为数据存储的重要载体&#xff0c;承载着无数珍贵的信息与回忆。然而&#xff0c;当硬盘遭遇故障并经过修复后&#xff0c;有时我们会遇到这样一个棘手问题&#xff1a;硬盘修复后&#xff0c;文件却神秘地“隐身”&#xff0c;无法正常显示。这一…...

如何处理网络连接错误导致的fetch失败?

处理由于网络连接错误导致的 fetch 失败通常涉及捕获网络错误并提供适当的用户反馈。以下是如何在 Vue 3 中实现这一点的步骤和示例。 一、更新 useFetch 函数 在 useFetch 函数中,需要捕获网络错误,并设置相应的错误信息。网络错误通常会抛出一个 TypeError,可以根据这个…...

Qt之设置QToolBar上的按钮样式

通常给QAction设置icon后,菜单栏的菜单项和工具栏(QToolBar)上对应的按钮会同时显示该icon。工具栏还可以使用setToolButtonStyle函数设置按钮样式,其参数为枚举值: enum ToolButtonStyle {ToolButtonIconOnly,ToolButtonTextOnly,ToolButtonTextBesideIcon,ToolButtonTe…...

责任链模式(Chain Responsibility)

一、定义&#xff1a;属于行为型设计模式&#xff0c;包含传递的数据、创建处理的抽象和实现、创建链条、将数据传递给顶端节点&#xff1b; 二、UML图 三、实现 1、需要传递处理的数据类 import java.util.Date;/*** 需要处理的数据信息*/ public class RequestData {priva…...

docker安装 mongodb

1、拉取镜像 docker run -dit --name mongo \ -p 17017:27017 \ -e MONGO_INITDB_ROOT_USERNAMEadmin \ -e MONGO_INITDB_ROOT_PASSWORD2018 \ --restartalways \ mongo2、进入容器 docker exec -it mongo bash 3、进入mongo ./bin/mongosh -u admin -p 2018 --authenticat…...

RabbitMQ 从入门到精通:从工作模式到集群部署实战(五)

#作者&#xff1a;闫乾苓 系列前几篇&#xff1a; 《RabbitMQ 从入门到精通&#xff1a;从工作模式到集群部署实战&#xff08;一&#xff09;》&#xff1a;link 《RabbitMQ 从入门到精通&#xff1a;从工作模式到集群部署实战&#xff08;二&#xff09;》&#xff1a; lin…...

salesforce SF CLI 数据运维经验分享

SF CLI data默认使用bulk api v2, 数据操作效率有了极大的提高。 Bulk api v2的优点&#xff1a; 执行结果可以很直观的从Bulk Data Load Jobs中看到。相较于bulk api v1&#xff0c;只能看到job执行in progress&#xff0c;或者closed的状态&#xff0c;有了很大的改善。执行…...

5.2Internet及其作用

5.2.1Internet概述 Internet称为互联网&#xff0c;又称英特网&#xff0c;始于1969年的美国ARPANET&#xff08;阿帕网&#xff09;&#xff0c;是全球性的网络。 互连网指的是两个或多个不同类型的网络通过路由器等网络设备连接起来&#xff0c;形成一个更大的网络结构。互连…...

【蓝桥杯—单片机】第十一届省赛真题代码题解题笔记 | 省赛 | 真题 | 代码题 | 刷题 | 笔记

第十一届省赛真题代码部分 前言赛题代码思路笔记竞赛板配置内部振荡器频率设定键盘工作模式跳线扩展方式跳线 建立模板明确设计要求和初始状态显示功能部分数据界面第一部分第二部分第三部分调试时发现的问题 参数设置界面第一部分第二部分和第四部分第三部分和第五部分 按键功…...

数据分析:企业数字化转型的金钥匙

引言&#xff1a;数字化浪潮下的数据金矿 在数字化浪潮席卷全球的背景下&#xff0c;有研究表明&#xff0c;只有不到30%的企业能够充分利用手中掌握的数据&#xff0c;这是否让人深思&#xff1f;数据已然成为企业最为宝贵的资产之一。然而&#xff0c;企业是否真正准备好从数…...

网络工程师 (23)OSI模型层次结构

前言 OSI&#xff08;Open System Interconnect&#xff09;模型&#xff0c;即开放式系统互联模型&#xff0c;是一个完整的、完善的宏观模型&#xff0c;它将计算机网络体系结构划分为7层。 OSI七层模型 1. 物理层&#xff08;Physical Layer&#xff09; 功能&#xff1a;负…...

DeepSeek添加知识库

1、下载dify 项目地址:https://github.com/langgenius/dify 2、通过docker安装 端口报错 修改端口 .env文件下所有80端口替换成了其它端口 执行正常了 查看 docker容器 <...

2、k8s的cni网络插件和基本操作命令

kube-prxoy属于节点组件&#xff0c;网络代理&#xff0c;实现服务的自动发现和负载均衡。 k8s的内部网络模式 1、pod内的容器于容器之间的通信。 2、一个节点上的pod之间的通信&#xff0c;docker0网桥直接通信。 3、不同节点上的pod之间的通信&#xff1a; 通过物理网卡的…...

【Linux】C语言执行shell指令

在C语言中执行Shell指令 在C语言中&#xff0c;有几种方法可以执行Shell指令&#xff1a; 1. 使用system()函数 这是最简单的方法&#xff0c;包含在stdlib.h头文件中&#xff1a; #include <stdlib.h>int main() {system("ls -l"); // 执行ls -l命令retu…...

线程同步:确保多线程程序的安全与高效!

全文目录&#xff1a; 开篇语前序前言第一部分&#xff1a;线程同步的概念与问题1.1 线程同步的概念1.2 线程同步的问题1.3 线程同步的解决方案 第二部分&#xff1a;synchronized关键字的使用2.1 使用 synchronized修饰方法2.2 使用 synchronized修饰代码块 第三部分&#xff…...

服务器硬防的应用场景都有哪些?

服务器硬防是指一种通过硬件设备层面的安全措施来防御服务器系统受到网络攻击的方式&#xff0c;避免服务器受到各种恶意攻击和网络威胁&#xff0c;那么&#xff0c;服务器硬防通常都会应用在哪些场景当中呢&#xff1f; 硬防服务器中一般会配备入侵检测系统和预防系统&#x…...

在四层代理中还原真实客户端ngx_stream_realip_module

一、模块原理与价值 PROXY Protocol 回溯 第三方负载均衡&#xff08;如 HAProxy、AWS NLB、阿里 SLB&#xff09;发起上游连接时&#xff0c;将真实客户端 IP/Port 写入 PROXY Protocol v1/v2 头。Stream 层接收到头部后&#xff0c;ngx_stream_realip_module 从中提取原始信息…...

React19源码系列之 事件插件系统

事件类别 事件类型 定义 文档 Event Event 接口表示在 EventTarget 上出现的事件。 Event - Web API | MDN UIEvent UIEvent 接口表示简单的用户界面事件。 UIEvent - Web API | MDN KeyboardEvent KeyboardEvent 对象描述了用户与键盘的交互。 KeyboardEvent - Web…...

华为云Flexus+DeepSeek征文|DeepSeek-V3/R1 商用服务开通全流程与本地部署搭建

华为云FlexusDeepSeek征文&#xff5c;DeepSeek-V3/R1 商用服务开通全流程与本地部署搭建 前言 如今大模型其性能出色&#xff0c;华为云 ModelArts Studio_MaaS大模型即服务平台华为云内置了大模型&#xff0c;能助力我们轻松驾驭 DeepSeek-V3/R1&#xff0c;本文中将分享如何…...

return this;返回的是谁

一个审批系统的示例来演示责任链模式的实现。假设公司需要处理不同金额的采购申请&#xff0c;不同级别的经理有不同的审批权限&#xff1a; // 抽象处理者&#xff1a;审批者 abstract class Approver {protected Approver successor; // 下一个处理者// 设置下一个处理者pub…...

【Redis】笔记|第8节|大厂高并发缓存架构实战与优化

缓存架构 代码结构 代码详情 功能点&#xff1a; 多级缓存&#xff0c;先查本地缓存&#xff0c;再查Redis&#xff0c;最后才查数据库热点数据重建逻辑使用分布式锁&#xff0c;二次查询更新缓存采用读写锁提升性能采用Redis的发布订阅机制通知所有实例更新本地缓存适用读多…...

C++.OpenGL (20/64)混合(Blending)

混合(Blending) 透明效果核心原理 #mermaid-svg-SWG0UzVfJms7Sm3e {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-SWG0UzVfJms7Sm3e .error-icon{fill:#552222;}#mermaid-svg-SWG0UzVfJms7Sm3e .error-text{fill…...

力扣热题100 k个一组反转链表题解

题目: 代码: func reverseKGroup(head *ListNode, k int) *ListNode {cur : headfor i : 0; i < k; i {if cur nil {return head}cur cur.Next}newHead : reverse(head, cur)head.Next reverseKGroup(cur, k)return newHead }func reverse(start, end *ListNode) *ListN…...