AI 推理 | vLLM 快速部署指南

本文是 AI 推理系列的第一篇,近期将更新与 vLLM 的相关内容。本篇从 vLLM 的部署开始,介绍 vLLM GPU/CPU 后端的安装方式,后续将陆续讲解 vLLM 的核心特性,如 PD 分离、Speculative Decoding、Prefix Caching 等,敬请关注。

1 什么是 vLLM?

vLLM 是一个高效、易用的大语言模型(LLM)推理和服务框架,专注于优化推理速度和吞吐量,尤其适合高并发的生产环境。它由加州大学伯克利分校的研究团队开发,并因其出色的性能成为当前最受欢迎的 LLM 推理引擎之一。

vLLM 同时支持在 GPU 和 CPU 上运行,本文将会分别介绍 vLLM 使用 GPU 和 CPU 作为后端时的安装与运行方法。

2 前提准备

2.1 购买虚拟机

如果本地不具备 GPU 环境,可考虑通过云服务提供商(如阿里云、腾讯云等)购买 GPU 服务器。

操作系统建议选择 Ubuntu 22.04,GPU 型号可根据实际需求进行选择。由于大语言模型通常占用较多磁盘空间,建议适当增加磁盘容量。

2.2 虚拟环境

推荐使用 uv 来管理 python 虚拟环境,执行以下命令安装 uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

3 安装

3.1 使用 GPU 作为 vLLM 后端

3.1.1 系统要求

vLLM 包含预编译的 C++ 和 CUDA (12.6) 二进制文件,需满足以下条件:

  • 操作系统:Linux
  • Python 版本:3.9 ~ 3.12
  • GPU:计算能力 7.0 或更高(如 V100、T4、RTX20xx、A10、A100、L4、H100 等)

注:计算能力(Compute capability)定义了每个 NVIDIA GPU 架构的硬件特性和支持的指令。计算能力决定你是否可以使用某些 CUDA 或 Tensor 核心功能(如 Unified Memory、Tensor Core、动态并行等),并不直接代表 GPU 的计算性能。

3.1.2 安装和配置 GPU 依赖

可以使用以下命令一键安装相关依赖,该脚本会安装 NVIDIA GPU Driver,NVIDIA Container Toolkit,以及配置 NVIDIA Container Runtime(后续通过 Docker 运行 vLLM 时需要)。

curl -sS https://raw.githubusercontent.com/cr7258/hands-on-lab/refs/heads/main/ai/gpu/setup/docker-only-install.sh | bash
3.1.3 安装 vLLM

创建 Python 虚拟环境:

# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
uv venv --python 3.12 --seed
source .venv/bin/activate

安装 vLLM:

uv pip install vllm

3.2 使用 CPU 作为 vLLM 后端

3.3 系统要求
  • 操作系统:Linux
  • Python 版本:3.9 ~ 3.12
  • 编译器: gcc/g++ >= 12.3.0 (可选,建议)
3.4 安装编译依赖

vLLM 当前并没有为 CPU 提供预构建的安装包或者镜像,需要自己根据源码进行编译。

uv venv vllm-cpu --python 3.12 --seed
source vllm-cpu/bin/activate

首先,安装推荐的编译器。建议使用 gcc/g++ >= 12.3.0 作为默认编译器以避免潜在问题。例如,在 Ubuntu 22.4 上,你可以运行:

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

然后克隆 vLLM 仓库:

git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source

接着,安装用于构建 vLLM CPU 后端的 Python 包:

pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel-extension-for-pytorch
3.5 安装 vLLM

最后,构建并安装 vLLM CPU 后端:

VLLM_TARGET_DEVICE=cpu python setup.py install

3.3 使用 Docker 运行 vLLM

vLLM 官方提供了 Docker 镜像用于部署。执行以下命令使用 Docker 来运行 GPU 后端的 vLLM:

docker run --runtime nvidia --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--ipc=host \vllm/vllm-openai:latest \--model Qwen/Qwen2.5-1.5B-Instruct

以下是命令参数的解释:

  • --runtime nvidia: 指定使用 NVIDIA 容器运行时,这是运行需要 GPU 的容器的必要设置。
  • --gpus all: 允许容器访问主机上的所有 GPU 资源。如果你只想使用特定的 GPU,可以指定 GPU ID,例如使用 --gpus '"device=0,1"' 指定使用 0 和 1 号 GPU(注意:device 前后还有个单引号,否则会看到这个报错:cannot set both Count and DeviceIDs on device request.)。如果想直接设置指定数量的 GPU,例如可以使用 --gpus 2,使用 2 个 GPU。
  • -v ~/.cache/huggingface:/root/.cache/huggingface: 将主机上的 Hugging Face 缓存目录挂载到容器内。这样可以在容器重启后重用已下载的模型,避免重复下载,节省时间和网络流量。
  • -p 8000:8000: 将容器内的 8000 端口映射到主机的 8000 端口,使得可以通过 http://localhost:8000 访问 vLLM 服务。
  • 你可以使用 ipc=host 标志或 --shm-size 标志,让容器访问宿主机的共享内存。vLLM 使用 PyTorch,而 PyTorch 在底层通过共享内存在进程之间传递数据,尤其是在进行张量并行推理时。在使用多个 GPU 进行推理时需要设置该参数。
  • vllm/vllm-openai:latest: 使用的 Docker 镜像,latest 表示最新版本的 vLLM OpenAI 兼容服务器镜像。
  • --model Qwen/Qwen2.5-1.5B-Instruct: 传递给 vLLM 服务的参数,指定要加载的模型为 Qwen/Qwen2.5-1.5B-Instruct

如果要使用 CPU 作为后端来运行 vLLM 可以使用这个仓库的镜像。

docker run --rm \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--ipc=host \public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5 \--model=Qwen/Qwen2.5-1.5B-Instruct

vLLM 提供了两种主要的推理模式:离线推理(Offline Inference)和在线推理(Online Serving),适用于不同的应用场景和需求。这两种模式的区别如下:

4 离线推理和在线推理的区别

离线推理(offline inference)和在线推理(online inference)的主要区别在于使用场景、延迟要求、资源调度方式等方面,简单总结如下:

4.1 离线推理

定义: 对一批输入数据进行集中处理,通常不要求实时返回结果。特点如下:

  • 批处理:常用于处理大量输入,如日志分析、推荐系统预计算。
  • 低延迟要求:结果可以晚些返回,不影响用户体验。
  • 资源利用高:系统可以在空闲时充分利用 GPU/CPU 资源。
  • 示例场景:每天夜间跑用户兴趣画像、预生成广告文案等。

4.2 在线推理

定义: 针对用户实时请求进行推理,立即返回结果。特点如下:

  • 实时响应:响应时间通常要求在几百毫秒以内。
  • 延迟敏感:高并发、低延迟是核心指标。
  • 资源分配稳定:服务需长时间在线、资源预留固定。
  • 示例场景:聊天机器人、搜索联想、智能客服等。

5 离线推理

安装好 vLLM 后,你可以开始对一系列输入提示词进行文本生成(即离线批量推理)。以下代码是 vLLM 官网提供的示例:

# basic.py
# https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic/basic.pyfrom vllm import LLM, SamplingParams# Sample prompts.
prompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)def main():# Create an LLM.llm = LLM(model="facebook/opt-125m")# Generate texts from the prompts.# The output is a list of RequestOutput objects# that contain the prompt, generated text, and other information.outputs = llm.generate(prompts, sampling_params)# Print the outputs.print("\nGenerated Outputs:\n" + "-" * 60)for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt:    {prompt!r}")print(f"Output:    {generated_text!r}")print("-" * 60)if __name__ == "__main__":main()

在上面的代码中,使用了 SamplingParams 来指定采样过程的参数。采样温度设置为 0.8,核采样概率设置为 0.95。下面解释一下这两个参数的用途:

Sampling Temperature(采样温度) 和 Nucleus Sampling Probability(核采样概率/Top-p) 是大语言模型生成文本时常用的两个采样参数,用于控制输出文本的多样性和质量。

  • 采样温度(Sampling Temperature)

    • 作用:控制生成文本的“随机性”或“创造性”。
    • 原理:温度会对模型输出的概率分布进行缩放。温度越低(如0.5),高概率的词更容易被选中,生成结果更确定、重复性更高;温度越高(如1.2),低概率的词被选中的机会增加,文本更有多样性但可能更混乱。
    • 具体来说,temperature=0.8 表示在概率分布上做了一定程度的“平滑”,比默认的 1.0 更偏向于选择高概率词,但仍保留一定的多样性。
  • 核采样概率(Nucleus Sampling Probability / Top-p)

    • 作用:控制每一步生成时考虑的候选词集合大小,动态平衡文本的多样性和合理性。
    • 原理:Top-p(核采样)会将所有词按概率从高到低排序,累加概率,直到总和首次超过 0.95 为止,只在这部分“核心”词中随机采样。

这里用一个例子来解释这两个采样参数之间的关系,假设模型下一步可以说 “猫 狗 老虎 大象 猴子 乌龟 老鹰 鳄鱼 蚂蚁 …”(有上万个词):

  • Temperature 是调整每个词出现的概率(“猫”的概率是 30%,你可以把它调得更大或更小);
  • Top-p 是把所有词排序后,只保留累计概率达到 95% 的前几个词,比如前 7 个,然后从中挑一个。

输出是通过 llm.generate 方法生成的。该方法会将输入提示加入 vLLM 引擎的等待队列,并调用 vLLM 引擎以高吞吐量生成输出。最终输出会以 RequestOutput 对象列表的形式返回,每个对象包含完整的输出 token。

执行以下代码运行程序:

python basic.py# 输出
INFO 05-09 22:19:49 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:20:08 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
INFO 05-09 22:20:13 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 6.99MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.90MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 13.1MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 5.21MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.70MB/s]
INFO 05-09 22:20:18 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 22:20:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78d0197039e0>
INFO 05-09 22:20:21 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 22:20:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 22:20:21 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 22:20:21 [gpu_model_runner.py:1329] Starting to load model facebook/opt-125m...
INFO 05-09 22:20:22 [weight_utils.py:265] Using model weights format ['*.bin']
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:27<00:00, 8.96MB/s]
INFO 05-09 22:20:51 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 28.962338 seconds
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.84it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.83it/s]INFO 05-09 22:20:51 [loader.py:458] Loading weights took 0.17 seconds
INFO 05-09 22:20:51 [gpu_model_runner.py:1347] Model loading took 0.2389 GiB and 29.934842 seconds
INFO 05-09 22:20:53 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/4f59edd943/rank_0_0 for vLLM's torch.compile
INFO 05-09 22:20:53 [backends.py:430] Dynamo bytecode transform time: 2.05 s
INFO 05-09 22:20:55 [backends.py:136] Cache the graph of shape None for later use
INFO 05-09 22:20:59 [backends.py:148] Compiling a graph for general shape takes 6.06 s
INFO 05-09 22:21:02 [monitor.py:33] torch.compile takes 8.11 s in total
INFO 05-09 22:21:02 [kv_cache_utils.py:634] GPU KV cache size: 549,216 tokens
INFO 05-09 22:21:02 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 268.17x
INFO 05-09 22:21:17 [gpu_model_runner.py:1686] Graph capturing finished in 15 secs, took 0.20 GiB
INFO 05-09 22:21:17 [core.py:159] init engine (profile, create kv cache, warmup model) took 26.22 seconds
INFO 05-09 22:21:17 [core_client.py:439] Core engine process 0 ready.
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32.86it/s, est. speed input: 213.68 toks/s, output: 525.96 toks/s]Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Paul D. I'm 28 years old. I'm a girl, but I"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' facing the same problems as a former CIA director, who has been accused of repeatedly'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' exploding with touristy people.\nWhat do you mean?  Did you mean'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' set for a big shift in the near term. The world is in the hands'
------------------------------------------------------------

默认情况下,vLLM 会从 Hugging Face 下载模型。如果你想使用来自 ModelScope 的模型,请设置 VLLM_USE_MODELSCOPE 环境变量。

6 在线推理

vLLM 可以部署为实现 OpenAI API 协议的服务器。默认情况下,服务器在 http://localhost:8000 启动。你可以通过 --host--port 参数指定地址。服务器目前一次只能托管一个模型,实现了 list models,create chat completion 等端点。

运行以下命令以启动 vLLM 服务器,并使用 Qwen/Qwen2.5-1.5B-Instruct 模型:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

启动后的输出如下(GPU 后端):

INFO 05-09 22:51:11 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:51:18 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-09 22:51:18 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-1.5B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e5c1f36b740>)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 660/660 [00:00<00:00, 6.78MB/s]
INFO 05-09 22:51:33 [config.py:717] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 05-09 22:51:38 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30k/7.30k [00:00<00:00, 6.34MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.78M/2.78M [00:00<00:00, 3.32MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.67M/1.67M [00:00<00:00, 2.64MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:00<00:00, 8.10MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242/242 [00:00<00:00, 2.76MB/s]
INFO 05-09 22:51:51 [__init__.py:239] Automatically detected platform cuda.
INFO 05-09 22:51:56 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 22:51:57 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78bea674d970>
INFO 05-09 22:51:58 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 22:51:58 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 22:51:58 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 22:51:58 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
INFO 05-09 22:51:59 [weight_utils.py:265] Using model weights format ['*.safetensors']
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [03:44<00:00, 13.8MB/s]
INFO 05-09 22:55:44 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-1.5B-Instruct: 225.167288 seconds
INFO 05-09 22:55:45 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]INFO 05-09 22:55:45 [loader.py:458] Loading weights took 0.50 seconds
INFO 05-09 22:55:45 [gpu_model_runner.py:1347] Model loading took 2.8871 GiB and 226.791553 seconds
INFO 05-09 22:55:52 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/c822c41d5a/rank_0_0 for vLLM's torch.compile
INFO 05-09 22:55:52 [backends.py:430] Dynamo bytecode transform time: 6.61 s
INFO 05-09 22:55:54 [backends.py:136] Cache the graph of shape None for later use
INFO 05-09 22:56:17 [backends.py:148] Compiling a graph for general shape takes 24.18 s
INFO 05-09 22:56:26 [monitor.py:33] torch.compile takes 30.79 s in total
INFO 05-09 22:56:27 [kv_cache_utils.py:634] GPU KV cache size: 567,984 tokens
INFO 05-09 22:56:27 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 17.33x
INFO 05-09 22:56:47 [gpu_model_runner.py:1686] Graph capturing finished in 21 secs, took 0.45 GiB
INFO 05-09 22:56:47 [core.py:159] init engine (profile, create kv cache, warmup model) took 62.02 seconds
INFO 05-09 22:56:47 [core_client.py:439] Core engine process 0 ready.
WARNING 05-09 22:56:48 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-09 22:56:48 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-09 22:56:48 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-09 22:56:48 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-09 22:56:48 [launcher.py:28] Available routes are:
INFO 05-09 22:56:48 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /health, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /load, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /version, Methods: GET
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /score, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-09 22:56:48 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [49305]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

启动后的输出如下(CPU 后端):

INFO 05-10 11:48:14 [__init__.py:248] Automatically detected platform cpu.
INFO 05-10 11:48:27 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:27 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-10 11:48:27 [config.py:2019] max_model_len was is not set. Defaulting to arbitrary value of 8192.
WARNING 05-10 11:48:27 [config.py:2025] max_num_seqs was is not set. Defaulting to arbitrary value of 128.
INFO 05-10 11:48:29 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:29 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:30 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:30 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:31 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:31 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:31 [api_server.py:1044] vLLM API server version 0.8.5.dev572+g246e3e0a3
INFO 05-10 11:48:32 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 05-10 11:48:32 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-10 11:48:32 [cli_args.py:297] non-default args: {'model': 'Qwen/Qwen2.5-1.5B-Instruct'}
INFO 05-10 11:48:41 [config.py:760] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 05-10 11:48:41 [arg_utils.py:1533] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 05-10 11:48:41 [config.py:1857] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-10 11:48:41 [cpu.py:118] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 05-10 11:48:41 [cpu.py:131] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 05-10 11:48:41 [api_server.py:247] Started engine process with PID 23528
INFO 05-10 11:48:45 [__init__.py:248] Automatically detected platform cpu.
INFO 05-10 11:48:47 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.dev572+g246e3e0a3) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=True, 
INFO 05-10 11:48:49 [cpu.py:57] Using Torch SDPA backend.
INFO 05-10 11:48:49 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-10 11:48:49 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 05-10 11:48:50 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.56it/s]INFO 05-10 11:48:50 [default_loader.py:278] Loading weights took 0.19 seconds
INFO 05-10 11:48:50 [executor_base.py:112] # cpu blocks: 9362, # CPU blocks: 0
INFO 05-10 11:48:50 [executor_base.py:117] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 05-10 11:48:50 [llm_engine.py:435] init engine (profile, create kv cache, warmup model) took 0.15 seconds
WARNING 05-10 11:48:51 [config.py:1283] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-10 11:48:51 [serving_chat.py:116] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-10 11:48:51 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 05-10 11:48:51 [api_server.py:1091] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-10 11:48:51 [launcher.py:28] Available routes are:
INFO 05-10 11:48:51 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 05-10 11:48:51 [launcher.py:36] Route: /health, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /load, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /version, Methods: GET
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /score, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-10 11:48:51 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [23290]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

这个服务器可以像 OpenAI API 一样以相同的格式进行请求。例如,要列出模型:

curl -sS http://localhost:8000/v1/models | jq# 输出
{"object": "list","data": [{"id": "Qwen/Qwen2.5-1.5B-Instruct","object": "model","created": 1746802639,"owned_by": "vllm","root": "Qwen/Qwen2.5-1.5B-Instruct","parent": null,"max_model_len": 32768,"permission": [{"id": "modelperm-63355d89a63946e58de0095fa60cd62b","object": "model_permission","created": 1746802639,"allow_create_engine": false,"allow_sampling": true,"allow_logprobs": true,"allow_search_indices": false,"allow_view": true,"allow_fine_tuning": false,"organization": "*","group": null,"is_blocking": false}]}]
}

6.1 使用 vLLM 的 OpenAI Completions API

你可以使用输入提示词查询模型:

curl -sS http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","prompt": "San Francisco is a"}' | jq# 输出
{"id": "cmpl-8bca6ac3579a41178ae70c6c5ba1d6c6","object": "text_completion","created": 1746802658,"model": "Qwen/Qwen2.5-1.5B-Instruct","choices": [{"index": 0,"text": " city in the state of California, United States. It is the county seat of","logprobs": null,"finish_reason": "length","stop_reason": null,"prompt_logprobs": null}],"usage": {"prompt_tokens": 4,"total_tokens": 20,"completion_tokens": 16,"prompt_tokens_details": null}
}

由于服务器与 OpenAI API 兼容,你也可以直接使用 openai 的 Python SDK 进行请求。

from openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",prompt="San Francisco is a")
print("Completion result:", completion)

执行以上代码的输出如下:

Completion result: Completion(id='cmpl-4b2b13ce66c24de3afc73ee67dc94607', 
choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, 
text=' city of contrasts. It has its share of luxury and wealth, but it also', 
stop_reason=None, prompt_logprobs=None)], created=1746802708, 
model='Qwen/Qwen2.5-1.5B-Instruct', object='text_completion', system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=16, prompt_tokens=4, total_tokens=20, 
completion_tokens_details=None, prompt_tokens_details=None))

6.2 使用 vLLM 的 OpenAI Chat Completions API

vLLM 还支持 OpenAI 的 Chat Completions API。该 API 提供了一种更加动态、交互式的模型交流方式,支持往返对话并在聊天历史中保存上下文。这种方式特别适用于需要保持上下文连贯性或需要更详细解释的任务,能够让模型理解之前的对话内容,从而提供更加连贯、个性化的响应。

curl -sS http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the world series in 2020?"}]}' | jq# 输出
{"id": "chatcmpl-ff58ca762837455e8388a033245dc254","object": "chat.completion","created": 1746802744,"model": "Qwen/Qwen2.5-1.5B-Instruct","choices": [{"index": 0,"message": {"role": "assistant","reasoning_content": null,"content": "The New York Yankees won the World Series in 2020, defeating the Tampa Bay Rays in seven games.","tool_calls": []},"logprobs": null,"finish_reason": "stop","stop_reason": null}],"usage": {"prompt_tokens": 31,"total_tokens": 56,"completion_tokens": 25,"prompt_tokens_details": null},"prompt_logprobs": null
}

你同样可以使用 OpenAI 的 Python SDK 进行请求。

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Tell me a joke."},]
)
print("Chat response:", chat_response)

响应如下:

Chat response: ChatCompletion(id='chatcmpl-3ba198d6ca6649ea8e621d6e51abc30a', 
choices=[Choice(finish_reason='stop', index=0, logprobs=None, 
message=ChatCompletionMessage(content="Sure, here's one for you:\n\nWhy couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!", 
refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], 
reasoning_content=None), stop_reason=None)], created=1746802826, model='Qwen/Qwen2.5-1.5B-Instruct', 
object='chat.completion', service_tier=None, system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=26, prompt_tokens=24, total_tokens=50, 
completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

7 vLLM 在使用 GPU 和 CPU 后端时的性能对比

CPU 后端与 GPU 后端在性能上存在显著差距,因为 vLLM 的架构最初是专为 GPU 优化设计的。若使用 CPU 后端,需要进行一系列优化以提升其运行效率。此外,GPU 拥有更强的并行计算能力,在大语言模型推理任务中相较于 CPU 更具优势。下面我们将简要对比 vLLM 在使用 CPU 和 GPU 后端时的推理性能差距。

当前进行测试的服务器配置参数如下:

  • CPU:32 vCPU
  • 内存:188 GiB
  • GPU:NVIDIA A10

客户端使用以下命令循环向 vLLM 服务器发送请求:

for i in {1..100}; doecho "Request: $i"curl -sS http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","messages": [{"role": "user", "content": "Tell me a joke"}]}'echo ""
done

vLLM 在 CPU 后端下的启动命令如下:

vllm serve Qwen/Qwen2.5-1.5B-Instruct 

vLLM CPU 后端每秒生成的 token 数大概在 10 几个左右。

INFO 05-10 11:54:24 [metrics.py:486] Avg prompt throughput: 18.1 tokens/s, Avg generation throughput: 12.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-10 11:54:29 [metrics.py:486] Avg prompt throughput: 6.5 tokens/s, Avg generation throughput: 15.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

在 GPU 后端中,vLLM 默认启用 Prefix Caching(v1 版本默认启用,v0 默认禁用),以提升推理性能;而在 CPU 后端中,该功能默认关闭。为确保对比的公平性,添加 --no-enable-prefix-caching 参数手动禁用该功能。

vllm serve Qwen/Qwen2.5-1.5B-Instruct --no-enable-prefix-caching

vLLM GPU 后端每秒生成的 token 数大概在 130 几个左右,性能大概是 CPU 后端的 10 倍左右。

INFO 05-11 11:31:38 [loggers.py:111] Engine 000: Avg prompt throughput: 135.3 tokens/s, Avg generation throughput: 108.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 05-11 11:31:48 [loggers.py:111] Engine 000: Avg prompt throughput: 135.3 tokens/s, Avg generation throughput: 108.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

总结

本文系统介绍了高性能 LLM 推理框架 vLLM 的部署实践,涵盖环境准备、GPU/CPU 后端配置、离线推理与在线推理部署等环节。最后通过实际测试,深入比较了两种后端在推理吞吐量和响应速度方面的性能差异。

欢迎关注

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/905578.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Python-MCPInspector调试

Python-MCPInspector调试 使用FastMCP开发MCPServer&#xff0c;熟悉【McpServer编码过程】【MCPInspector调试方法】-> 可以这样理解&#xff1a;只编写一个McpServer&#xff0c;然后使用MCPInspector作为McpClient进行McpServer的调试 1-核心知识点 1-熟悉【McpServer编…

Linux 常用命令 -hostnamectl【主机名控制】

简介 hostnamectl 命令中的 “hostname” 顾名思义&#xff0c;指的是计算机在网络上的名称&#xff0c;“ctl” 是 “control” 的缩写&#xff0c;意味着控制。hostnamectl 命令用于查询和修改系统主机名以及相关的设置。它通过与 systemd 系统管理器交互&#xff0c;允许用…

力扣-二叉树-101 对称二叉树

思路 分解问题为&#xff0c;该节点的左孩子的左子树和右孩子的右子树是不是同一棵树 && 该节点的左孩子的右字数和右孩子的左子树是不是同一课树 && 该节点的左右孩子的值相不相同 代码 class Solution {public boolean isSymmetric(TreeNode root) {// 层…

Nginx技术方案【学习记录】

文章目录 1. 需求分析1.1 应用场景1.2 实现目标 2. Nginx反向代理与实现均衡负载2.1 部署架构2.2 架构描述2.2.1 Nginx代理服务器2.2.2 API服务器与API服务器&#xff08;Backup&#xff09;2.2.3 nginx.conf配置文件2.2.4 测试方法 3. 高速会话缓存技术3.1 问题背景3.2 使用 R…

Ubuntu22.04怎么退出Emergency Mode(紧急模式)

1.使用nano /etc/fstab命令进入fstab文件下&#xff1b; 2.将挂载项首行加#注释掉&#xff0c;修改完之后使用ctrlX退出; 3.重启即可退出紧急模式&#xff01;

Unity 红点系统

首先明确一个&#xff0c;即红点系统的数据结构是一颗树&#xff0c;并且红点的数据结构的初始化需要放在游戏的初始化中&#xff0c;之后再是对应的红点UI侧的注册&#xff0c;对应的红点UI在销毁时需要注销对红点UI的显示回调注册&#xff0c;但是不销毁数据侧的红点注册 - …

极新携手火山引擎,共探AI时代生态共建的破局点与增长引擎

在生成式AI与行业大模型的双重驱动下&#xff0c;人工智能正以前所未有的速度重构互联网产业生态。从内容创作、用户交互到商业决策&#xff0c;AI技术渗透至产品研发、运营的全链条&#xff0c;推动效率跃升与创新模式变革。然而&#xff0c;面对AI技术迭代的爆发期&#xff0…

【Redis】SDS结构

目录 1、背景2、SDS底层实现 1、背景 redis作为高性能的内存数据库&#xff0c;对字符串操作&#xff08;如键、值的存储&#xff09;有极高的要求。c语言原生字符串&#xff08;以\0结尾的字符串数据&#xff09;有一些缺点&#xff1a;长度计算需要遍历&#xff08;O(n)时间…

STM32硬件I2C驱动OLED屏幕

本文基于STM32硬件I2C驱动SSD1306 OLED屏幕&#xff0c;提供完整的代码实现及关键注意事项&#xff0c;适用于128x32或128x64分辨率屏幕。代码通过模块化设计&#xff0c;支持显示字符、数字、汉字及位图&#xff0c;并优化了显存刷新机制。 零、完整代码 完整代码: 1&#x…

鸿蒙 PC 发布之后,想在技术上聊聊它的未来可能

最近鸿蒙 PC 刚发布完&#xff0c;但是发布会没公布太多技术细节&#xff0c;基本上一些细节都是通过自媒体渠道获取&#xff0c;首先可以确定的是&#xff0c;鸿蒙 PC 本身肯定是无法「直接」运行 win 原本的应用&#xff0c;但是可以支持手机上「原生鸿蒙」的应用&#xff0c…

【JAVA】抽象类与接口:设计模式中的应用对比(16)

核心知识点详细解释 Java抽象类和接口的定义、特点和使用场景 抽象类 抽象类是使用 abstract 关键字修饰的类。它不能被实例化&#xff0c;主要用于作为其他类的基类&#xff0c;提供一些通用的属性和方法。抽象类可以包含抽象方法和具体方法。抽象方法是使用 abstract 关键…

HTML 颜色全解析:从命名规则到 RGBA/HSL 值,附透明度设置与场景应用指南

一、HTML 颜色系统详解 HTML 中的颜色可以通过多种方式定义&#xff0c;包括颜色名称、RGB 值、十六进制值、HSL 值等&#xff0c;同时支持透明度调整。以下是详细分类及应用场景&#xff1a; 1. 颜色名称&#xff08;预定义关键字&#xff09; HTML 预定义了 140 个标准颜色名…

LVS负载均衡群集和keepalive

目录 一. 集群概述 1.1 集群的定义 1.2 集群的分类 1. 高可用集群 HA 2. 高性能运输群集 HPC 3.负载均衡群集 LB 4. 分布式存储集群 二. LVS概述 2.1 LVS的定义 2.2 LVS的工作原理 2.3 LVS 的三种工作模式 2.4 LVS 三种工作模式的对比 2.5 LVS 调度算法 1. 静态…

ZTE 7551N 中兴小鲜60 远航60 努比亚小牛 解锁BL 刷机包 刷root 展讯 T760 bl

ZTE 7551N 中兴小鲜60 远航60 努比亚小牛 解锁BL 刷机包 刷root 3款机型是一个型号&#xff0c;包通用&#xff0c; ro.product.system.modelZTE 7551N ro.product.system.nameCN_P720S15 #################################### # from generate-common-build-props # Th…

单片机-STM32部分:12、I2C

飞书文档https://x509p6c8to.feishu.cn/wiki/MsB7wLebki07eUkAZ1ec12W3nsh 一、简介 IIC协议&#xff0c;又称I2C协议&#xff0c;是由PHILP公司在80年代开发的两线式串行总线&#xff0c;用于连接微控制器及其外围设备&#xff0c;IIC属于半双工同步通信方式。 IIC是一种同步…

Virtualized Table 虚拟化表格 el-table-v2 表头分组 多级表头的简单示例

注意添加这个属性,会影响到有多少个层级的表头: :header-height“[50, 40]”,即后面的columnIndex 如果有fix的列CustomizedHeader会被调用多次,如果有多个层级的表头,也会被调用多次, 实际被调用次数是(fix数 1 * 表头层级数量) 以下代码均删除了JSX TS版本代码 <templ…

防御保护-----第十二章:VPN概述

文章目录 第二部分&#xff0c;数据安全第十二章&#xff1a;VPN概述VPN概述VPN分类VPN关键技术隧道技术身份认证技术加解密技术数据认证技术 数据的安全传输密码学发展史 对称加密算法 --- 传统密码算法密钥解释流加密分组加密 --- 块加密算法填充算法PKCS7算法分组模式 公钥密…

前端项目打包部署流程j

1.打包前端项目(运行build这个文件) 2.打包完成后&#xff0c;控制台如下所示:(没有报错即代表成功) 3.左侧出现dist文件夹 4.准备好我们下载的nginx(可以到官网下载一个),然后在一个没有中文路径下的文件夹里面解压。 5.在继承终端内打开我们的项目&#xff0c;找到前面打包好…

Go语言标识符

文章目录 标识符的组成规则Go语言关键字预定义标识符标识符命名惯例 特殊标识符标识符访问权限控制 在Go语言中&#xff0c;标识符(Identifier)是用来命名变量、函数、类型、常量等程序实体的名称。 标识符的组成规则 1、必须以字母或下划线(_)开头&#xff1a; 字母包括Unico…

CST软件对OPERACST软件联合仿真汽车无线充电站对人体的影响

上海又收紧了新能源车的免费上牌政策。所以年前一些伙伴和我探讨过买新能源汽车的问题&#xff0c;小伙伴们基本纠结的点是买插电还是纯电&#xff1f;我个人是很抗拒新能源车的&#xff0c;也开过坐过。个人有几个观点&#xff1a; 溢价过高&#xff0c;不保值。实际并不环保…