【“星瑞” O6 评测】 — CPU llama.cpp不同优化速度对比

在这里插入图片描述

前言

随着大模型应用场景的不断拓展，arm cpu 凭借其独特优势在大模型推理领域的重要性日益凸显。它在性能、功耗、架构适配等多方面发挥关键作用，推动大模型在不同场景落地

1. Kleidi AI 简介

Arm Kleidi 成为解决这些挑战的理想方案，它能够为运行在 Arm CPU 上的所有 AI 推理工作负载提供无缝的性能优化。KleidiAI 是一套轻量级且高性能开源的 Arm 例程，专为 AI 加速而设计。Arm 的 KleidiAI 库，提供了针对 sme、i8mm 和点积加速等硬件功能优化的矩阵乘法内核，目前已被集成到最新版本的主流端侧 AI 框架中，包括 ExecuTorch、Llama.cpp、LiteRT （通过XNNPACK）和 MediaPipe，能让数百万名开发者无需进行额外操作，即可自动获取 AI 性能的显著提升。

这里我们对比同一个模型，CPU编译时不同优化选项带来的提升

2. 依赖安装

sudo apt install cmake libcurl4-openssl-dev

下载代码

git clone https://github.com/ggml-org/llama.cpp.git## 切换到我测试的分支（可选）
git checkout b5195

3. 编译时不同优化选项实测

3.1 不开启任何优化

cmake -B build
cmake --build build --config Release -j

3.2 下载/转换/量化模型

从https://www.modelscope.cn/models/Qwen/Qwen2.5-3B-Instruct/files下载模型

转换

pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_hf_to_gguf.py /home/radxa/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct

量化

可以将模型的权重系数量化成Q4_0

./build/bin/llama-quantize /home/radxa/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct/Qwen2.5-3B-Instruct-F16.gguf asserts/Qwen2.5-3B-Instruct-Q4_0.gguf Q4_0

验证模型正确性

taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -c 4096 -t 8 --conversation

打印信息

> hello
Hello! How can I assist you today? Do you have any questions or topics you'd like to discuss?> 
llama_perf_sampler_print:    sampling time =       2.79 ms /    32 runs   (    0.09 ms per token, 11477.76 tokens per second)
llama_perf_context_print:        load time =     498.94 ms
llama_perf_context_print: prompt eval time =     592.82 ms /     9 tokens (   65.87 ms per token,    15.18 tokens per second)
llama_perf_context_print:        eval time =    1711.00 ms /    22 runs   (   77.77 ms per token,    12.86 tokens per second)
llama_perf_context_print:       total time =    6498.13 ms /    31 tokens
Interrupted by user

3.3 不开启任何优化的benchmark

taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-bench -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -p 128 -n 128 -t 8

结果

model	size	params	backend	threads	test	t/s
qwen2 3B Q4_0	1.69 GiB	3.09 B	CPU	8	pp128	17.16 ± 0.08
qwen2 3B Q4_0	1.69 GiB	3.09 B	CPU	8	tg128	12.85 ± 0.09

3.4 开启avmv9优化

编译

cmake -B build_armv9 -DCMAKE_CXX_FLAGS="-march=armv9-a" -DCMAKE_C_FLAGS="-march=armv9-a"
cmake --build build_armv9 --config Release -j

benchmark命令: taskset -c 0,5,6,7,8,9,10,11 ./build_armv9/bin/llama-bench -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -p 128 -n 128 -t 8

结果

model	size	params	backend	threads	test	t/s
qwen2 3B Q4_0	1.69 GiB	3.09 B	CPU	8	pp128	84.39 ± 0.80
qwen2 3B Q4_0	1.69 GiB	3.09 B	CPU	8	tg128	18.76 ± 0.22

3.5 开启kleidiai优化

kleidiai已经集成到llama.cpp的后端，只需要编译时给定正确的选项就行。

官方给的编译，我有报错

cmake -B build_kle -DGGML_CPU_KLEIDIAI=ON
cmake --build build_kle --config Release -j

报错：

/home/radxa/1_AI_models/llama.cpp/ggml/src/ggml-cpu/kleidiai/kernels.cpp:22:30: error: zero-size array ‘gemm_gemv_kernels’22 | static ggml_kleidiai_kernels gemm_gemv_kernels[] = {|                              ^~~~~~~~~~~~~~~~~
gmake[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:272: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/kleidiai/kernels.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....

于是改用clang++编译器,

## 安装依赖
sudo apt install clang libomp-dev## 编译
cmake -B build_kle -DGGML_CPU_KLEIDIAI=ON -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++
cmake --build build_kle --config Release -j

benchmark命令: taskset -c 0,5,6,7,8,9,10,11 ./build_kle/bin/llama-bench -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -p 128 -n 128 -t 8

结果

model	size	params	backend	threads	test	t/s
qwen2 3B Q4_0	1.69 GiB	3.09 B	CPU	8	pp128	129.53 ± 6.59
qwen2 3B Q4_0	1.69 GiB	3.09 B	CPU	8	tg128	16.25 ± 0.18

打印中有load_tensors: CPU_KLEIDIAI model buffer size = 1488.38 MiB和KLEIDIAI = 1表明编译选项正确打开。
全部的打印信息。

build: 5195 (2d451c80) with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 35 key-value pairs and 434 tensors from asserts/Qwen2.5-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen-research
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 3B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-3B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 36
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q4_0:  252 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 1.69 GiB (4.71 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.09 B
print_info: general.name     = Qwen2.5 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  1720.63 MiB
load_tensors: CPU_KLEIDIAI model buffer size =  1488.38 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
init:        CPU KV buffer size =   144.00 MiB
llama_context: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_context:        CPU compute buffer size =   300.75 MiB
llama_context: graph nodes  = 1338
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistantsystem_info: n_threads = 8 (n_threads_batch = 8) / 12 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | KLEIDIAI = 1 | AARCH64_REPACK = 1 | main: interactive mode on.
sampler seed: 3948005486
sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0== Running in interactive mode. ==- Press Ctrl+C to interject at any time.- Press Return to return control to the AI.- To return control without starting a new line, end your input with '/'.- If you want to submit another line, end your input with '\'.- Not using system message. To change it, set a different value via -sys PROMPT

问题
但是这样编译出来的可执行程序，执行测试的时候，模型效果是有问题，还需要排查。

./build_kle/bin/llama-cli -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -c 4096 -t 8 --conversation## 打印
> hello
共和国owan続きMAR composition composition分 mutationorphAug AovOransition""""""""""" "" "" "amyamy.tom Entriesreta_suffix"卫生ventions警MessageBox