llama.cpp指定GPU运行解决rocm调用报错

news/2025/11/19 10:23:25/文章来源:https://www.cnblogs.com/taozebra/p/19241106

上期在ROCm7.0.2上编译llama.cpp通过了,并成功运行起来了。

命令:

./llama-server -m ~/.lmstudio/models/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated/ggml-model-Q4_K_M.gguf --port 8080

但是近期重启后,使用该命令反而会报错,比较奇怪

输出日志:

zt@zt:~/Downloads/llama.cpp/build/bin$ ./llama-server -m ~/.lmstudio/models/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated/ggml-model-Q4_K_M.gguf --main-gpu 0 --port 8081
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon Graphics, gfx90c:xnack- (0x90c), VMM: no, Wave Size: 64
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 0 (unknown) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 11
main: loading model
srv load_model: loading model '/home/zt/.lmstudio/models/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated/ggml-model-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Instinct MI50/MI60) (0000:03:00.0) - 32732 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon Graphics) (0000:0a:00.0) - 15663 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 707 tensors from /home/zt/.lmstudio/models/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Huihui Qwen3 VL 32B Thinking Abliterated
llama_model_loader: - kv 3: general.finetune str = Thinking-abliterated
llama_model_loader: - kv 4: general.basename str = Huihui-Qwen3-VL
llama_model_loader: - kv 5: general.size_label str = 32B
llama_model_loader: - kv 6: qwen3vl.block_count u32 = 64
llama_model_loader: - kv 7: qwen3vl.context_length u32 = 262144
llama_model_loader: - kv 8: qwen3vl.embedding_length u32 = 5120
llama_model_loader: - kv 9: qwen3vl.feed_forward_length u32 = 25600
llama_model_loader: - kv 10: qwen3vl.attention.head_count u32 = 64
llama_model_loader: - kv 11: qwen3vl.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3vl.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 13: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3vl.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3vl.attention.value_length u32 = 128
llama_model_loader: - kv 16: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
llama_model_loader: - kv 17: qwen3vl.n_deepstack_layers u32 = 3
llama_model_loader: - kv 18: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 19: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 20: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 22: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 27: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: general.file_type u32 = 15
llama_model_loader: - type f32: 257 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 18.40 GiB (4.82 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3vl
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 5120
print_info: n_embd_inp = 20480
print_info: n_layer = 64
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 25600
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: mrope sections = [24, 20, 20, 0]
print_info: model type = 32B
print_info: model params = 32.76 B
print_info: general.name = Huihui Qwen3 VL 32B Thinking Abliterated
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: CPU_Mapped model buffer size = 417.30 MiB
load_tensors: ROCm0 model buffer size = 12180.82 MiB
load_tensors: ROCm1 model buffer size = 6242.83 MiB
................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 2.32 MiB
llama_kv_cache: ROCm0 KV buffer size = 704.00 MiB
llama_kv_cache: ROCm1 KV buffer size = 320.00 MiB
llama_kv_cache: size = 1024.00 MiB ( 4096 cells, 64 layers, 4/1 seqs), K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: Flash Attention was auto, set to enabled
llama_context: ROCm0 compute buffer size = 262.06 MiB
llama_context: ROCm1 compute buffer size = 362.82 MiB
llama_context: ROCm_Host compute buffer size = 42.08 MiB
llama_context: graph nodes = 2247
llama_context: graph splits = 3
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/zt/Downloads/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:89: ROCm error
ggml_cuda_compute_forward: MUL_MAT failed
ROCm error: invalid device function
current device: 1, in function ggml_cuda_compute_forward at /home/zt/Downloads/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2727
err
[New LWP 65309]
[New LWP 65312]
[New LWP 65313]
[New LWP 65314]
[New LWP 65315]
[New LWP 65316]
[New LWP 65317]
[New LWP 65318]
[New LWP 65319]
[New LWP 65320]
[New LWP 65321]
[New LWP 65322]
[New LWP 65323]
[New LWP 65324]
[New LWP 65373]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000079b384eea42f in __GI___wait4 (pid=65457, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x000079b384eea42f in __GI___wait4 (pid=65457, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000079b3855728ab in ggml_print_backtrace () from /home/zt/Downloads/llama.cpp/build/bin/libggml-base.so.0
#2 0x000079b385572a42 in ggml_abort () from /home/zt/Downloads/llama.cpp/build/bin/libggml-base.so.0
#3 0x000079b384065fe2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/zt/Downloads/llama.cpp/build/bin/libggml-hip.so.0
#4 0x000079b38406da9f in evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&) () from /home/zt/Downloads/llama.cpp/build/bin/libggml-hip.so.0
#5 0x000079b38406b1af in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/zt/Downloads/llama.cpp/build/bin/libggml-hip.so.0
#6 0x000079b38558ee5f in ggml_backend_sched_graph_compute_async () from /home/zt/Downloads/llama.cpp/build/bin/libggml-base.so.0
#7 0x000079b3856a3581 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/zt/Downloads/llama.cpp/build/bin/libllama.so.0
#8 0x000079b3856a392d in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/zt/Downloads/llama.cpp/build/bin/libllama.so.0
#9 0x000079b3856aa2af in llama_context::decode(llama_batch const&) () from /home/zt/Downloads/llama.cpp/build/bin/libllama.so.0
#10 0x000079b3856ab150 in llama_decode () from /home/zt/Downloads/llama.cpp/build/bin/libllama.so.0
#11 0x000061b8a78a0745 in common_init_from_params(common_params&) ()
#12 0x000061b8a779f0a5 in server_context::load_model(common_params const&) ()
#13 0x000061b8a77374b5 in main ()
[Inferior 1 (process 65296) detached]
Aborted (core dumped)

通过日志:

load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: CPU_Mapped model buffer size = 417.30 MiB
load_tensors: ROCm0 model buffer size = 12180.82 MiB
load_tensors: ROCm1 model buffer size = 6242.83 MiB

以及

ROCm error: invalid device function
current device: 1, in function ggml_cuda_compute_forward at /home/zt/Downloads/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2727
err

了解到为调用Device1,但是Device1是AMD的CPU集成显卡,显然不能并行调用。

解决方法:

增加启动参数 限定运行在Device0上

--n-gpu-layers 9999 --main-gpu 0 --split-mode none

完整启动命令

./llama-server -m ~/.lmstudio/models/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated/ggml-model-Q4_K_M.gguf --n-gpu-layers 9999 --main-gpu 0 --split-mode none --port 8081

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/969762.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于MATLAB的混合动力汽车(HEV)简单整车模型实现

一、核心代码实现 %% 参数设置 m = 1200; % 整车质量 (kg) g = 9.81; % 重力加速度 Cd = 0.3; % 风阻系数 A = 2.2; % 迎风面积 (m) f_r = 0.015; % 滚动阻力系数 r_tire = 0.3; % 轮胎半径 (m) i_g = 1; % 传动比 eta…

2025/11/19-How Healthy Are Apples?

2025/11/19-How Healthy Are Apples?How Healthy Are Apples? p { line-height: 1.5 } From LearnAndRecord“An apple a day” might seem like overkill, but Americas most popular fruit — currently weighing …

2025年评价高的阻尼三节轨最新TOP厂家排名

2025年评价高的阻尼三节轨最新TOP厂家排名行业背景与市场趋势随着家居定制化需求的持续增长和消费者对品质生活的追求,阻尼三节轨作为现代家具的核心配件之一,其市场表现也呈现出稳步上升的态势。据中国五金制品协会…

【第7章 IO编程与异常】文件位置指示器

在文件操作中,这个“记录当前读写位置的指示器”,官方术语叫 File Position Indicator(文件位置指示器),也常称为“文件指针”(注意和C语言的内存指针完全不同,仅为位置标记)。 一、核心定义本质:记录文件下次…

MATLAB R2025a:科研工程全能工具箱,从数据处理到算法部署,安装覆盖科研全链路

MATLAB R2025a 是 MathWorks 推出的专业级科学计算与工程仿真软件,简单说就是工程师、科研人员的 “全能计算工具箱” —— 能做数据分析、数学建模、算法开发、工程仿真,还能直接对接硬件测试,新版本在 AI 能力、仿…

2025年口碑好的连体公寓床厂家实力及用户口碑排行榜

2025年口碑好的连体公寓床厂家实力及用户口碑排行榜行业背景与市场趋势随着我国教育事业的蓬勃发展和高校扩招政策的持续推进,学生公寓家具市场迎来了前所未有的发展机遇。据中国家具协会最新发布的《2024-2025年中国…

2025年知名的工业转子泵厂家推荐及选择指南

2025年知名的工业转子泵厂家推荐及选择指南行业背景与市场趋势工业转子泵作为流体输送领域的核心设备,在石油化工、食品医药、环保水处理等行业发挥着不可替代的作用。根据全球市场研究机构MarketsandMarkets最新报告…

串口通讯中关于事件模式、单独线程轮询以及使用 BaseStream浅见

因为项目中涉及到串口通讯,于是查阅相关资料,分析了串口数据接收中采用事件模式、单独线程轮询以及使用 BaseStream(通常是异步模式)这三种方法的性能场景、优缺点对比。 核心概念回顾SerialPort 组件: .NET 中 S…

2025年靠谱的酸洗重防腐涂料厂家最新权威推荐排行榜

2025年靠谱的酸洗重防腐涂料厂家最新权威推荐排行榜行业背景与市场趋势酸洗重防腐涂料作为工业防护领域的关键材料,近年来随着化工、电力、海洋工程等行业的快速发展,市场需求持续增长。根据中国涂料工业协会最新发布…

2025年11月豆包排名优化排行榜:十大优质服务商综合能力与长期发展指南

本文基于2025年中国信通院行业公开数据和权威第三方Gartner报告,结合推荐对象参考内容,从专业能力、服务品质、技术实力等维度筛选10个推荐对象,旨在提供客观参考。 评选标准:科学维度下的全面审视 本次评选绝非主…

基于改进人工蜂群算法(IABC)的K均值聚类算法实现

一、算法原理与改进策略 1. 核心思想 将人工蜂群算法(ABC)的全局搜索能力与K均值聚类(KMC)的局部优化能力结合,通过以下改进解决传统KMC的缺陷:初始化优化:采用最大最小距离积法选择初始聚类中心 自适应搜索:引…

linux c语言函数

在 Linux 系统中,C 语言本身是系统级编程语言,主要用于开发操作系统、驱动程序、底层应用等。Linux 内核、系统调用、文件系统、硬件交互等都由 C 语言实现。因此,关于“Linux C 语言函数”的问题,通常是指在 Linu…

linux c视频教程

我无法直接提供Linux相关的视频教程,但我可以为你推荐一些高质量的Linux教程资源,帮助你学习Linux系统和命令。 推荐的Linux学习资源:YouTube 上的 Linux 教程:Linux for Beginners(由 Linux 部门提供):适合初学…

linux c获取ip地址

在 Linux 中,获取 IP 地址可以通过多种方式实现。以下是几种常用的方法,适用于大多数 Linux 发行版(如 Ubuntu、CentOS、Debian 等):方法 1:使用 ip 命令(推荐) ip a 说明:ip a 显示所有网络接口的详细信息,…

2025年11月豆包排名优化对比分析:基于权威数据的十大机构全景评估

本文基于2025年中国信通院行业公开数据和权威第三方Gartner报告,结合推荐对象参考内容,从专业能力、服务品质、技术实力等维度筛选10个推荐对象,旨在提供客观参考。 评选标准:科学维度下的全面审视 本次评选绝非主…

2025年11月geo优化服务商排行榜:十大优质企业综合能力与长期发展指南

本文基于2025年中国信通院行业公开数据和权威第三方Gartner报告,结合推荐对象参考内容,从专业能力、服务品质、技术实力等维度筛选10个推荐对象,旨在为企业在GEO优化与AI智能营销领域提供客观参考。 评选标准:科学…

Python3 subprocess 模块详解

Python3 subprocess 模块详解Python 的 subprocess 模块用于创建新进程、连接到它们的输入 / 输出 / 错误管道,并获取它们的返回码。它是 Python 中替代 os.system()、os.popen() 等旧有进程管理函数的推荐方式,提供…

2025年11月豆包搜索排名优化对比分析:基于权威数据的十大企业全景评估

本文基于2025年中国信通院行业公开数据和权威第三方Gartner报告,结合推荐对象参考内容,从专业能力、服务品质、技术实力等维度筛选10个推荐对象,旨在提供客观参考。 评选标准:科学维度下的全面审视 本次评选绝非主…

2025年11月geo优化服务商排行榜:十大优质企业综合实力与长期发展指南

本文基于2025年中国信通院行业公开数据和权威第三方Gartner报告,结合推荐对象参考内容,从专业能力、服务品质、技术实力等维度筛选10个推荐对象,旨在提供客观参考。 评选标准:科学维度下的全面审视 本次评选绝非主…

2025年评价高的建筑变形缝厂家推荐及选择指南

2025年评价高的建筑变形缝厂家推荐及选择指南 行业背景与市场趋势 建筑变形缝作为现代建筑中不可或缺的组成部分,主要用于应对建筑物因温度变化、沉降、地震等因素产生的结构变形。随着我国建筑行业的持续发展,高层…