7GB显存如何部署bf16精度的DeepSeek-R1 70B大模型？

构建RAG混合开发---PythonAI+JavaEE+Vue.js前端的实践-CSDN博客

服务容错治理框架resilience4j&sentinel基础应用---微服务的限流/熔断/降级解决方案-CSDN博客

conda管理python环境-CSDN博客

快速搭建对象存储服务 - Minio，并解决临时地址暴露ip、短链接请求改变浏览器地址等问题-CSDN博客

大模型LLMs的MCP入门-CSDN博客

使用LangGraph构建多代理Agent、RAG-CSDN博客

大模型LLMs框架Langchain之链详解_langchain.llms.base.llm详解-CSDN博客

大模型LLMs基于Langchain+FAISS+Ollama/Deepseek/Qwen/OpenAI的RAG检索方法以及优化_faiss ollamaembeddings-CSDN博客

大模型LLM基于PEFT的LoRA微调详细步骤---第二篇：环境及其详细流程篇-CSDN博客

大模型LLM基于PEFT的LoRA微调详细步骤---第一篇：模型下载篇_vocab.json merges.txt资源文件下载-CSDN博客使用docker-compose安装Redis的主从+哨兵模式_使用docker部署redis 一主一从一哨兵模式 csdn-CSDN博客

docker-compose安装canal并利用rabbitmq同步多个mysql数据_docker-compose canal-CSDN博客

写在前文

效果展示

输出日志展示

内存使用情况

编辑curl测试

编辑页面请求流式效果

名词解释

显存计算方法

技术体系

为什么选择vLLM

Step1：租用服务器

Step2、环境部署

扩容内存

使用无卡模式启动

更新/下载环境

Step3、下载模型

Step4、脚本参数

32GB显存配置

7GB显存的配置

Step5、启动

Step6、日志分析

正常日志

enforce-eager对性能的影响

编辑异常日志处理

CUDA out of memory.

KV缓存不够(No available memory for the cache blocks)

内存占用

写在前文

以下在AutoDL算力云平台的显卡RTX3080 20G * 2 和 vGPU-32GB * 1测试通过；

---- 没有用真正的7GB显卡，用的是显存32GB的服务器，但是通过“nvidia-smi- l 2”显示显存只用了7GB；

---- 想测试3090ti和4090的，没找到合适的服务器...。

本文核心在于：在保持数值稳定性的前提下通过降低精度bfloat16将模型显存需求降低50%，再执行INT4对称量化，将模型体积压缩至原始尺寸的25%（140G→35G），降低资源分配(通过牺牲一定的生成性能、推理速度、并发数、上下文token长度、生成token长度、每批次处理的序列长度等)将模型所需显存极致压缩，然后启用vLLM PageAttention显存管理引擎，实现显存碎片率降低80%+，关闭CUDA Graph模式，解除计算图固化约束，提升显存复用灵活性。最后结合GPU-CPU交换空间，设置虚拟内存功能，大幅度降低对显存的需求，以达到单机单卡7GB即可启动DeepSeek-R1 70B 的大模型。

效果展示

输出日志展示

输出token：0.4token/s，貌似有点慢~~~[😂，不过可以理解，毕竟只花了7GB显存]

内存使用情况

curl测试

页面请求流式效果

PS：真的好慢....自己玩玩就行了。理论是一样的。

名词解释

DeepSeek-R1 70B FP16：

B：指billion，10亿；70B即70*10=700亿参数；

FP16，大模型精度是16精度，除了FP16以外还有双精度(FP64)、单精度(FP32、TF32)、半精度(FP16、BF16)、8位精度(FP8)、4精度(FP4、NF4)；量化精度(INT8、INT4)

32bit就是1位符号位+8位指数位+23位尾数位，可以表达6-7个有效的小数位；
而16bit可用bfloat16/float16，bfloat16，1位符号位+8位指数位+7位尾数位以牺牲位数位为代价指数位与32bit保持一致，避免在训练时梯度溢出/下降(梯度爆炸)；
float16呢，就是1位符号位+5位指数位+10位尾数位，占用的内存都是32bit的一半；
而8bit、4bit就是将高精度的32bit通过一定算法(比如缩放因子、零点近似原数值等)映射到8bit/4bit上面；
而Q4量化，是Lora中对4bit量化中的一种方法；

显存计算方法

1字节=8位，也就是说，

32bit精度的1参数代表4字节byte；16bit精度的1参数代表2byte；8bit精度的1参数代表1byte；4bit精度的1参数代表0.5byte；显存1GB≈10亿字节；

不采用任何优化的情况下，DeepSeek-R1所需要的显存：

        要部署完整的DeepSeek-R1 70B FP16精度或者4bit量化的话，FP16参数体积70*10*2byte=1400亿字节=140GB，即显存表面最低需140GB显存，加上推理过程损耗(实际情况需要增加20~30%用于计算中间激活值比如KV缓存等)以及不同模型版本FP16精度的不同，实际所需显存一般需要在参数上面增加20~30%，实际显存占用168~210G；
        Q8参数体积70GB，显存占用84~105GB；
        Q4参数体积35GB，显存占用42~42.5GB；

如果要用于微调/训练，那么需要额外存储优化器状态、梯度、参数等信息，显存可能超过3~4倍也就是400~600GB以上。

所以，如果我们在未优化的情况下，要部署一个DeepSeek-R1 70B模型的话，最低显存都需要160G以上；但是我们可以通过4bit量化、PagedAttention、KV缓存、前缀缓存、虚拟/临时内存等优化方法优化，后可以在32GB的内存上运行。

技术体系

云服务器AutoDL + 大模型DeepSeek-R1 70B + 推理框架vLLM + 量化框架bitsandbytes

为什么选择vLLM

vLLM专用于服务器-vLLM只有Linux版本，高吞吐量服务(性能较高、运行速度快)--很适合；
拥有量化功能：AWQ(一种量化方法)、FP8、INT4(采用bitsandbytes)
兼容OpenAI相关API --- 可以部署为服务器，像请求ChatGPT的API一样请求，也可以流输出；
支持缓存：前缀缓存、KV缓存

使用 PagedAttention 有效管理注意力键和值内存

可以一键配置张量并行/流水线并行计算，在单卡单节点压力大的情况下，可以迅速切换多卡甚至多节点

.......

Step1：租用服务器

本文采用AutoDL算力云平台；

Step2、环境部署

# 因为本文采用的模型是Deepseek-R1 70B模型，140G大小，所以需要扩容；

扩容内存

使用无卡模式启动

更新/下载环境

1、查看python/pip版本

python --version

pip --version

2、更新pip

python -m pip install --upgrade pip

3、安装依赖

pip install modelscope vllm torch transformers[torch] bitsandbytes

# 本文主要采用vllm推理框架和bitsandbytes量化方法；

# modelscope是用来下载模型的。

下载完成，可以通过“vllm --version”查看vllm版本

4、下载SSH本地代理

启动其中的AutoDL.exe;

Step3、下载模型

from modelscope import snapshot_download
model_names = ["deepseek-ai/DeepSeek-R1-Distill-Llama-70B"]
local_model_dir = r'/root/autodl-tmp/llms' # 自己指定要下载的目录
for model_name in model_names:model_dir = snapshot_download(model_name,  # 要下载的模型名称cache_dir=local_model_dir,  # 下载到那个目录revision='master',ignore_file_pattern='.pth',  # 配置过滤，不下载“.pth”的原始文件。只下载“.safetensors”的模型文件)print(f"{model_name} is downloaded to {model_dir}")

剩下时间就是等待...

Step4、脚本参数

本文采用的是vllm的config结合.yaml的形式，将启动参数保存在了一个文件中；

新建文件config.yaml，内容如下：

修改参数：dtype(模型数据类型)、quantization(量化类型)、cpu-offload-gb(虚拟内存，也可以把它看成临时内存)、gpu-memory-utilization(预分配内存)、enforce-eager(是否使用混合模式)；

其中核心通过定义“cpu-offload-gb”的大小，即主要作用于推理过程中 KV缓存和临时张量的CPU交换空间。

当显存使用超过 gpu-memory-utilization 阈值时，按 cpu-offload-gb 定义的最大值迁移KV缓存数据到CPU。

比如，我们cpu-offload-gb设置为10G，模型总显存需要20G，其中KV缓存有10G，那么此时我们只需要一个10G显存服务就可以启动这个需要20G显存的模型；

理论上讲，我们要启动DeepSeek-R1 70B模型的话，我们适当调整max-num-batched-tokens、max-model-len、swap-space、max-seq-len-to-capture、block-size结合cpu-offload-gb我们可以在显存仅7GB的显卡上运行---(我在vGPU32G，通过测试，实际显存最低可以达到7173MB，也就是7G作左右显存)

基础参数：model、served-model-name、host、port、api-key，其余参数配置按需配置即可。

32GB显存配置

# config.yaml
model: "/root/autodl-tmp/llms/deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
served-model-name: "DeepSeek-R1" # API中要使用的名称，如果不指定那么在通过API调用的时候必须是和“model”保持一致
host: "0.0.0.0"               # host 0.0.0.0  \  # 服务器监听的主机地址，0.0.0.0表示接受所有网络接口的连接。
port: 6333
api-key: "ltingzx"                # API密钥，用于请求验证，确保服务安全。
# 未开时/开启bfloat16/half，29500M；开启bitsandbytes--->850M；
# vLLM占用的内存由gpu-memory-utilization决定，所以无论如何我们设置dtype或者quantization时，系统都是“显存*gpu-memory-utilization”的大小X
# 此时如果我们的max-model-len参数>X，会导致报错；
dtype: "bfloat16"
quantization: "bitsandbytes"    # 量化：加载时bitsandbytes、fp8；预训练并且使用autoawq---要使用awq必须是awq预训练完成的量化模型
gpu-memory-utilization: 0.9    # 降低显存利用率阈值，默认0.9 --- 如果设置0.9，假如我们服务器32G*0.9=28.8G这个就是vLLM预分配的内存大小，无论模型多小，都会预分配；此时如果使用“nvidia-smi -l 2”查看，其中的“Memory-Usage”就是显示的预分配大小；
cpu-offload-gb: 20             # 虚拟内存大小.每个 GPU 要卸载到 CPU 的空间 (GiB)。默认值为 0，表示不卸载
swap-space: 4                   # 每个 GPU 的 CPU 交换空间大小.默认值4
max-num-seqs: 256               # 提高并发处理数（默认: 256）.每次迭代的最大序列数pipeline-parallel-size: 1        #-pp：管道并行阶段数，默认1 --- 即部署在多节点/服务器分布式推理时需要
tensor-parallel-size: 1         #-tp：张量并行副本数，默认1 --- 单节点/服务器上面有多张卡时需要，可以通过“nvidia-smi -L”查看
max-num-batched-tokens: 8192    # 单批次最大token数 
max-model-len: 8192             # 模型上下文长度 --- 默认是4096，此时如果我们将gpu-memory-utilization设置太小，预分配小于参数的话比如4G(使用Qwen-7B测试时，将gpu-memory-utilization设置为0.2会报错)，那么会报错
enable-prefix-caching: true     # 启用前缀缓存
block-size: 32                  # KV缓存块大小 (默认: 16---必须大余16).32减少内存碎片
max-seq-len-to-capture: 8192    # 匹配max-model-len（原4096 → 1024） 图覆盖的最大序列长度 默认8192
#kv-cache-dtype: fp8     # kv缓存的数据类型auto, fp8(CUDA 11.8+), fp8_e5m2(CUDA 11.8+), fp8_e4m3（AMD GPU）。默认采用和模型保持一致的类型。可能报错“ERROR xxx [engine.py:448] AttributeError: 'Tensor' object has no attribute 'bnb_quant_state'”
#calculate-kv-scales:  # 当 kv-cache-dtype 为 fp8 时，启用 k_scale 和 v_scale 的动态计算。如果 calculate-kv-scales 为 false，则将从模型检查点加载比例（如果可用）。否则，比例将默认为 1.0。enforce-eager: false            # 默认False使用eager模式和CUDA图的混合模式来执行操作，必须关闭以启用PagedAttention。CUDA图是PyTorch中用于优化性能的一种技术。 禁用CUDA图（即设置 enforce_eager 为True）可能会影响性能，但可以减少内存需求
#max-parallel-loading-workers:   #多批次顺序加载模型时，并行加载工作器的最大数量。用于避免在使用张量并行和大型模型时出现 RAM OOM。
disable-custom-all-reduce: true # 禁用多GPU自定义通信，自定义AllReduce（减少多GPU显存开销）默认#uvicorn-log-level: "info"
#disable-log-stats: true         # 禁用统计日志记录。
disable-log-requests: true      # 禁用请求日志记录
trust-remote-code: true         # 允许加载远程代码（来自Hugging Face等），对于某些特定模型可能需要。

7GB显存的配置

# config.yaml
model: "xxxxxdeepseek-ai/DeepSeek-R1-Distill-Llama-70B"
served-model-name: "DeepSeek-R1" # API中要使用的名称，如果不指定那么在通过API调用的时候必须是和“model”保持一致
host: "0.0.0.0"               # host 0.0.0.0  \  # 服务器监听的主机地址，0.0.0.0表示接受所有网络接口的连接。
port: 6333
api-key: "xxxxx"                # API密钥，用于请求验证，确保服务安全。
# 未开时/开启bfloat16/half，29500M；开启bitsandbytes--->850M；
# vLLM占用的内存由gpu-memory-utilization决定，所以无论如何我们设置dtype或者quantization时，系统都是“显存*gpu-memory-utilization”的大小X
# 此时如果我们的max-model-len参数>X，会导致报错；
dtype: "bfloat16"
quantization: "bitsandbytes"    # 量化：加载时bitsandbytes、fp8；预训练并且使用autoawq---要使用awq必须是awq预训练完成的量化模型
gpu-memory-utilization: 0.25    # 降低显存利用率阈值，默认0.9 --- 如果设置0.9，假如我们服务器32G*0.9=28.8G这个就是vLLM预分配的内存大小，无论模型多小，都会预分配；此时如果使用“nvidia-smi -l 2”查看，其中的“Memory-Usage”就是显示的预分配大小；
cpu-offload-gb: 60             # 虚拟内存大小.每个 GPU 要卸载到 CPU 的空间 (GiB)。默认值为 0，表示不卸载
swap-space: 20                   # 每个 GPU 的 CPU 交换空间大小.默认值4
max-num-seqs: 8               # 提高并发处理数（默认: 256）.每次迭代的最大序列数pipeline-parallel-size: 1        #-pp：管道并行阶段数，默认1 --- 即部署在多节点/服务器分布式推理时需要
tensor-parallel-size: 1         #-tp：张量并行副本数，默认1 --- 单节点/服务器上面有多张卡时需要，可以通过“nvidia-smi -L”查看
max-num-batched-tokens: 512    # 单批次最大token数 
max-model-len: 512             # 模型上下文长度 --- 默认是4096，此时如果我们将gpu-memory-utilization设置太小，预分配小于参数的话比如4G(使用Qwen-7B测试时，将gpu-memory-utilization设置为0.2会报错)，那么会报错
enable-prefix-caching: true     # 启用前缀缓存
block-size: 16                  # KV缓存块大小 (默认: 16---必须大余16).32减少内存碎片
max-seq-len-to-capture: 512    # 匹配max-model-len（原4096 → 1024） 图覆盖的最大序列长度 默认8192
#kv-cache-dtype: fp8     # kv缓存的数据类型auto, fp8(CUDA 11.8+), fp8_e5m2(CUDA 11.8+), fp8_e4m3（AMD GPU）。默认采用和模型保持一致的类型。可能报错“ERROR xxx [engine.py:448] AttributeError: 'Tensor' object has no attribute 'bnb_quant_state'”
#calculate-kv-scales:  # 当 kv-cache-dtype 为 fp8 时，启用 k_scale 和 v_scale 的动态计算。如果 calculate-kv-scales 为 false，则将从模型检查点加载比例（如果可用）。否则，比例将默认为 1.0。enforce-eager: true            # 默认False使用eager模式和CUDA图的混合模式来执行操作，必须关闭以启用PagedAttention。CUDA图是PyTorch中用于优化性能的一种技术。 禁用CUDA图（即设置 enforce_eager 为True）可能会影响性能，但可以减少内存需求
#max-parallel-loading-workers:   #多批次顺序加载模型时，并行加载工作器的最大数量。用于避免在使用张量并行和大型模型时出现 RAM OOM。
disable-custom-all-reduce: true # 禁用多GPU自定义通信，自定义AllReduce（减少多GPU显存开销）默认#uvicorn-log-level: "info"
#disable-log-stats: true         # 禁用统计日志记录。
disable-log-requests: true      # 禁用请求日志记录
trust-remote-code: true         # 允许加载远程代码（来自Hugging Face等），对于某些特定模型可能需要。

参数详解

基本配置

--host HOST: 指定服务器主机名。
--port PORT: 指定服务器端口号。
--uvicorn-log-level {debug,info,warning,error,critical,trace}: 设置 Uvicorn 的日志级别。
--allow-credentials: 允许跨域请求时携带凭证。
--allowed-origins ALLOWED_ORIGINS: 允许跨域请求的来源。
--allowed-methods ALLOWED_METHODS: 允许跨域请求的方法。
--allowed-headers ALLOWED_HEADERS: 允许跨域请求的头部。
--api-key API_KEY: 如果提供，该服务器将要求在请求头中提供此密钥。

模型配置

--served-model-name SERVED_MODEL_NAME: API 使用的模型名称。如果未指定，则默认为 Hugging Face 模型名称。
--model MODEL: 使用的 Hugging Face 模型名称或路径。
--tokenizer TOKENIZER: 使用的 Hugging Face 分词器名称或路径。
--revision REVISION: 使用的特定模型版本（分支名、标签名或提交 ID）。若未指定，将使用默认版本。
--code-revision CODE_REVISION: 使用 Hugging Face Hub 上的特定模型代码版本。
--tokenizer-revision TOKENIZER_REVISION: 使用的特定分词器版本（分支名、标签名或提交 ID）。
--tokenizer-mode {auto,slow}: 分词器模式。"auto" 使用快速分词器（如有），"slow" 始终使用慢速分词器。
--trust-remote-code: 信任 Hugging Face 远程代码。

模型加载和量化

--download-dir DOWNLOAD_DIR: 下载和加载权重的目录，默认为 Hugging Face 默认缓存目录。
--load-format {auto,pt,safetensors,npcache,dummy}: 模型权重的加载格式。 "auto" 尝试使用 safetensors 格式加载权重，如果不可用则回退到 pytorch bin 格式。
--dtype {auto,half,float16,bfloat16,float,float32}: 模型权重和激活的数据类型。"auto" 对于 FP32 和 FP16 模型使用 FP16 精度，对于 BF16 模型使用 BF16 精度。
--kv-cache-dtype {auto,fp8_e5m2}: 键值缓存存储的数据类型。 "auto" 使用模型数据类型。注意，FP8 仅在 CUDA 版本高于 11.8 时支持。
--max-model-len MAX_MODEL_LEN: 模型上下文长度。如果未指定，将自动从模型中推导。

性能优化

--pipeline-parallel-size PIPELINE_PARALLEL_SIZE, -pp PIPELINE_PARALLEL_SIZE: 管道并行阶段的数量。
--tensor-parallel-size TENSOR_PARALLEL_SIZE, -tp TENSOR_PARALLEL_SIZE: 张量并行副本的数量。
--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS: 使用多个批次顺序加载模型，以避免使用张量并行和大模型时的内存不足。
--gpu-memory-utilization GPU_MEMORY_UTILIZATION: 模型执行器使用的 GPU 内存百分比，范围为 0 到 1。若未指定，将使用默认值 0.9。
--swap-space SWAP_SPACE: 每个 GPU 的 CPU 交换空间大小（以 GiB 为单位）。
--block-size {8,16,32,128}: 令牌块大小。
--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS: 每次迭代的最大批处理令牌数。
--max-num-seqs MAX_NUM_SEQS: 每次迭代的最大序列数。
--max-logprobs MAX_LOGPROBS: 返回的最大日志概率数量。

SSL 配置

--ssl-keyfile SSL_KEYFILE: SSL 密钥文件的路径。
--ssl-certfile SSL_CERTFILE: SSL 证书文件的路径。
--ssl-ca-certs SSL_CA_CERTS: CA 证书文件。
--ssl-cert-reqs SSL_CERT_REQS: 是否需要客户端证书（参见标准库的 SSL 模块）。

其他配置

--root-path ROOT_PATH: 在基于路径的路由代理后面时使用的 FastAPI 根路径。
--middleware MIDDLEWARE: 应用到应用程序的额外 ASGI 中间件。可以接受多个 --middleware 参数，值应为导入路径。
--max-log-len MAX_LOG_LEN: 日志中打印的最大提示字符或提示 ID 数量，默认无限制。
--enable-prefix-caching: 启用自动前缀缓存。
--use-v2-block-manager: 使用 BlockSpaceMangerV2。
--num-lookahead-slots NUM_LOOKAHEAD_SLOTS: 实验性调度配置，必要时进行投机解码。

分布式和异步配置

--worker-use-ray: 使用 Ray 进行分布式服务，当使用超过 1 个 GPU 时将自动设置。
--engine-use-ray: 使用 Ray 在单独的进程中启动 LLM 引擎。
--ray-workers-use-nsight: 如果指定，使用 nsight 对 Ray 工作者进行分析。
--max-cpu-loras MAX_CPU_LORAS: 存储在 CPU 内存中的最大 LoRAs 数量。必须 >= max_num_seqs，默认为 max_num_seqs。
--device {auto,cuda,neuron,cpu}: vLLM 执行的设备类型。
--enable-lora: 如果为 True，启用 LoRA 适配器处理。

图像输入配置

--image-input-type {pixel_values,image_features}: 传递给 vLLM 的图像输入类型，应为 "pixel_values" 或 "image_features"。
--image-token-id IMAGE_TOKEN_ID: 图像令牌的输入 ID。
--image-input-shape IMAGE_INPUT_SHAPE: 给定输入类型的最大图像输入形状（最差内存占用），仅用于 vLLM 的 profile_run。
--image-feature-size IMAGE_FEATURE_SIZE: 沿上下文维度的图像特征大小。

Step5、启动

监控内存：nvidia-smi -l 2

下面是32GB版本

正式启动：vllm serve --config config.yaml

启动成功后可以直接使用：

查看启动模型的信息：curl -X GET "http://localhost:端口/v1/models" -H "Authorization: Bearer config中设置的api-key"

完成会话：
curl -X POST "http://localhost:6333/v1/chat/completions" \-H "Authorization: Bearer ltingzx" \-H "Content-Type: application/json" \-d '{"model": "DeepSeek-R1","messages": [{"role": "system", "content": "你是一个机器人，名字叫“小花”"},{"role": "user", "content": "你叫啥？"}]}'

网页版本：见我上一篇“构建RAG混合开发---PythonAI+JavaEE+Vue.js前端的实践-CSDN博客”修改Python项目中的self.llm如下即可

self.llm = ChatOpenAI(
model='DeepSeek-R1---启动配置中的served-model-name名称，不然要和model保持一致',
api_key='xxxx',
base_url='http://localhost:6333/v1---需要本地代理',
)

Step6、日志分析

正常日志

INFO 05-16 23:06:03 [__init__.py:239] Automatically detected platform cuda.
INFO 05-16 23:06:07 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-16 23:06:07 [api_server.py:1044] args: Namespace(subparser='serve', model_tag=None, config='', host='0.0.0.0', port=6333, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='ltingzx', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/autodl-tmp/llms/Qwen/Qwen3-8B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=32, gpu_memory_utilization=0.5, swap_space=4.0, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=5.0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['Qwen3-8B'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=4096, max_num_seqs=256, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f36272bfce0>)
INFO 05-16 23:06:14 [config.py:717] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 05-16 23:06:14 [config.py:1770] Defaulting to use mp for distributed inference
INFO 05-16 23:06:14 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=4096.
WARNING 05-16 23:06:14 [cuda.py:93] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 05-16 23:06:18 [__init__.py:239] Automatically detected platform cuda.
INFO 05-16 23:06:21 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='/root/autodl-tmp/llms/Qwen/Qwen3-8B', speculative_config=None, tokenizer='/root/autodl-tmp/llms/Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen3-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
INFO 05-16 23:06:21 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_49a90570'), local_subscribe_addr='ipc:///tmp/faa9cb44-6d06-4d89-9862-759b0f3d52d4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-16 23:06:25 [__init__.py:239] Automatically detected platform cuda.
INFO 05-16 23:06:25 [__init__.py:239] Automatically detected platform cuda.
WARNING 05-16 23:06:28 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f761b89d880>
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:28 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9435994d'), local_subscribe_addr='ipc:///tmp/ad5dbedf-8abd-4e6c-94c3-094c812f8cf0', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 05-16 23:06:29 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f6f9e1a0770>
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_33628a58'), local_subscribe_addr='ipc:///tmp/bd547b68-a9d8-4798-9c36-144250db6a30', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [utils.py:1055] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [utils.py:1055] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=2874) WARNING 05-16 23:06:29 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=1 pid=2875) WARNING 05-16 23:06:29 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_42d23171'), local_subscribe_addr='ipc:///tmp/f32d862a-0a6a-4675-864d-01e99cc7dac8', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=2874) WARNING 05-16 23:06:29 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=2875) WARNING 05-16 23:06:29 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:29 [gpu_model_runner.py:1329] Starting to load model /root/autodl-tmp/llms/Qwen/Qwen3-8B...
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:29 [gpu_model_runner.py:1329] Starting to load model /root/autodl-tmp/llms/Qwen/Qwen3-8B...
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.86it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:01,  2.18it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.00it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:01<00:00,  2.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.37it/s]
(VllmWorker rank=0 pid=2874) 
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:35 [loader.py:458] Loading weights took 2.16 seconds
(VllmWorker rank=0 pid=2874) INFO 05-16 23:06:35 [gpu_model_runner.py:1347] Model loading took 2.6077 GiB and 5.417584 seconds
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:35 [loader.py:458] Loading weights took 2.37 seconds
(VllmWorker rank=1 pid=2875) INFO 05-16 23:06:35 [gpu_model_runner.py:1347] Model loading took 2.6077 GiB and 5.831182 seconds
INFO 05-16 23:06:53 [kv_cache_utils.py:634] GPU KV cache size: 79,040 tokens
INFO 05-16 23:06:53 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 19.30x
INFO 05-16 23:06:53 [kv_cache_utils.py:634] GPU KV cache size: 79,040 tokens
INFO 05-16 23:06:53 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 19.30x
INFO 05-16 23:06:54 [core.py:159] init engine (profile, create kv cache, warmup model) took 19.09 seconds
INFO 05-16 23:06:54 [core_client.py:439] Core engine process 0 ready.
WARNING 05-16 23:06:54 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-16 23:06:54 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-16 23:06:54 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-16 23:06:54 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:6333
INFO 05-16 23:06:54 [launcher.py:28] Available routes are:
INFO 05-16 23:06:54 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 05-16 23:06:54 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 05-16 23:06:54 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-16 23:06:54 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
.......

enforce-eager对性能的影响

enforce-eager = false

enforce-eager = true

1个并发

8个并发

异常日志处理

(以下为了快速测试，使用的是Qwen7B模型FP16精度+4bit量化)

CUDA out of memory.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 31.48 GiB of which 447.56 MiB is free. Including non-PyTorch memory, this process has 30.97 GiB memory in use. Of the allocated memory 27.98 GiB is allocated by PyTorch, with 619.88 MiB allocated in private pools (e.g., CUDA Graphs), and 808.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W517 15:35:52.765542648 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):File "/root/miniconda3/bin/vllm", line 8, in <module>原因是：CUDA要分配的内存不够；比如CUDA需要2G，但是你通过“gpu-memory-utilization”将31G显存进行了预分配，但事实上其中可能有28G内存可能实际可能没用到的，这部分内存可以通过预分配内存降低，将更多的内存分配给CUDA；
降低预分配：gpu-memory-utilization: 0.9  由原来得0.95--->0.9

KV缓存不够(No available memory for the cache blocks)

(VllmWorker rank=1 pid=3244) INFO 05-16 23:09:33 [gpu_model_runner.py:1347] Model loading took 1.2216 GiB and 5.555432 seconds
(VllmWorker rank=0 pid=3243) INFO 05-16 23:09:33 [gpu_model_runner.py:1347] Model loading took 1.2216 GiB and 5.552125 seconds
ERROR 05-16 23:09:36 [core.py:396] EngineCore failed to start.
ERROR 05-16 23:09:36 [core.py:396] Traceback (most recent call last):
ERROR 05-16 23:09:36 [core.py:396]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-16 23:09:36 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-16 23:09:36 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-16 23:09:36 [core.py:396]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-16 23:09:36 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-16 23:09:36 [core.py:396]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 05-16 23:09:36 [core.py:396]     self._initialize_kv_caches(vllm_config)
ERROR 05-16 23:09:36 [core.py:396]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
ERROR 05-16 23:09:36 [core.py:396]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 05-16 23:09:36 [core.py:396]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
ERROR 05-16 23:09:36 [core.py:396]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 05-16 23:09:36 [core.py:396]   File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
ERROR 05-16 23:09:36 [core.py:396]     raise ValueError("No available memory for the cache blocks. "
ERROR 05-16 23:09:36 [core.py:396] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
ERROR 05-16 23:09:37 [multiproc_executor.py:123] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.1.模型加载显存：
- 每个GPU加载Qwen3-8B模型约占用 1.22 GiB（日志显示 Model loading took 1.2216 GiB）。
- 若你有2个GPU（从日志的 rank=0 和 rank=1 推断），则总模型显存占用约为 2.44 GiB。
2.KV缓存显存需求：
- vLLM默认的 gpu_memory_utilization 参数是 0.9（即90%的GPU显存用于模型和KV缓存）。
- 你需要确保剩余的显存（总显存 × gpu_memory_utilization - 模型占用显存）足够分配给KV缓存。
可用显存 = X × gpu_memory_utilization
KV缓存需求 = 可用显存 - 模型占用显存

model需要1.2G显存，但是我们通过“gpu_memory_utilization=0.12”分配了0.12*20=2.4，按道理来说应该够用，但是系统依然报错“No available memory for the cache blocks.”----KV缓存不够

解决方法：降低KV缓存

降低max-num-batched-tokens(单批次最大token数)；
降低模型上下文长度(max-model-len)
降低并发数(max-num-seqs)
关禁用cuda图enforce-eager=true降低内存消耗
降低最大序列长度(max-seq-len-to-capture)
禁用多GPU自定义通信(disable-custom-all-reduce=true)达到减少多GPU显存开销

内存占用

模型只占用了1.2G，但是nvidia-smi显示占用了4.5G，剩下3G去哪儿了?

(VllmWorker rank=0 pid=5361) 
(VllmWorker rank=0 pid=5361) INFO 05-16 23:43:35 [gpu_model_runner.py:1347] Model loading took 1.2216 GiB and 5.527624 seconds
(VllmWorker rank=1 pid=5362) INFO 05-16 23:43:35 [gpu_model_runner.py:1347] Model loading took 1.2216 GiB and 5.531894 seconds
(VllmWorker rank=1 pid=5362) INFO 05-16 23:43:50 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/c15482133e/rank_1_0 for vLLM's torch.compile
(VllmWorker rank=1 pid=5362) INFO 05-16 23:43:50 [backends.py:430] Dynamo bytecode transform time: 15.34 s
(VllmWorker rank=0 pid=5361) INFO 05-16 23:43:50 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/c15482133e/rank_0_0 for vLLM's torch.compile
(VllmWorker rank=0 pid=5361) INFO 05-16 23:43:50 [backends.py:430] Dynamo bytecode transform time: 15.67 s
(VllmWorker rank=1 pid=5362) INFO 05-16 23:44:02 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 10.311 s
(VllmWorker rank=0 pid=5361) INFO 05-16 23:44:02 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 10.182 s
(VllmWorker rank=1 pid=5362) INFO 05-16 23:44:05 [monitor.py:33] torch.compile takes 15.34 s in total
(VllmWorker rank=0 pid=5361) INFO 05-16 23:44:05 [monitor.py:33] torch.compile takes 15.67 s in total
INFO 05-16 23:44:06 [kv_cache_utils.py:634] GPU KV cache size: 7,216 tokens
INFO 05-16 23:44:06 [kv_cache_utils.py:637] Maximum concurrency for 1,024 tokens per request: 7.05x
INFO 05-16 23:44:06 [kv_cache_utils.py:634] GPU KV cache size: 7,216 tokens
INFO 05-16 23:44:06 [kv_cache_utils.py:637] Maximum concurrency for 1,024 tokens per request: 7.05x
(VllmWorker rank=0 pid=5361) INFO 05-16 23:44:49 [gpu_model_runner.py:1686] Graph capturing finished in 44 secs, took 1.99 GiB
(VllmWorker rank=1 pid=5362) INFO 05-16 23:44:49 [gpu_model_runner.py:1686] Graph capturing finished in 44 secs, took 1.99 GiB
INFO 05-16 23:44:50 [core.py:159] init engine (profile, create kv cache, warmup model) took 74.74 seconds
INFO 05-16 23:44:50 [core_client.py:439] Core engine process 0 ready.
WARNING 05-16 23:44:50 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm通过：nvidia-smi -l 2 查看其中的Memory-Usage内存使用+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        On  |   00000000:98:00.0 Off |                  N/A |
|  0%   30C    P8             26W /  320W |    4546MiB /  20480MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3080        On  |   00000000:B1:00.0 Off |                  N/A |
|  0%   30C    P8             21W /  320W |    4546MiB /  20480MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

类别	显存占用	说明
模型参数	1.2 GB	量化后Qwen3-8B的实际占用（bitsandbytes生效）
KV缓存预分配	2.5 GB	block-size=16 + max-model-len=512 的预分配空间（日志显示7216 tokens）
框架运行时开销	0.8 GB	PyTorch CUDA上下文 + vLLM通信缓存 + 动态批处理队列
CUDA内核编译缓存	0.5 GB	torch.compile 生成的编译缓存（/root/.cache/vllm/torch_compile_cache）