记录一次复杂的 ONNX 到 TensorRT 动态 Shape 转换排错过程

我在将 encoder 的 ONNX 模型转换成 TensorRT 格式时遇到了错误：“shape tensor must have build-time extent”。从报错信息看，ONNX 的 Range 算子在转换时被视为 shape tensor，而 TensorRT 要求 shape tensor 在 build 时维度必须是已知常量。

通过 Netron.app 可视化发现，Range 算子的 limit 参数依赖于上一个 Cast 算子的输出：

为了进一步排查，我向 ONNX 图中添加了额外的输出 tensor，打印 Range 算子上下游的输入和输出：

definspect_all_tensors(onnx_path,target_tensor_name,input_feed):model=onnx.load(onnx_path)inferred_model=onnx.shape_inference.infer_shapes(model)all_tensors=(list(inferred_model.graph.value_info)+list(inferred_model.graph.input)+list(inferred_model.graph.output))fortinall_tensors:ift.nameintarget_tensor_name:model.graph.output.append(t)modified_model_bytes=model.SerializeToString()sess=onnxruntime.InferenceSession(modified_model_bytes)output_names=[output.nameforoutputinsess.get_outputs()]outputs=sess.run(output_names,input_feed)results={name:valueforname,valueinzip(output_names,outputs)}forname,valueinresults.items():print(f"{name}:{value.shape}{value.dtype}{value}")

打印结果如下：

onnx::ReduceSum_851: (1,) int64 [1] onnx::Unsqueeze_953: (1,) int64 [0] /ReduceMax_output_0: () int64 13 /Range_output_0: (13,) int64 [ 0 1 2 3 4 5 6 7 8 9 10 11 12] /Unsqueeze_output_0: (1, 13) int64 [[ 0 1 2 3 4 5 6 7 8 9 10 11 12]] /Where_output_0: (2,) int64 [ 1 13] /Expand_output_0: (1, 13) int64 [[ 0 1 2 3 4 5 6 7 8 9 10 11 12]]

这里的 Range 输出长度达 13，考虑到后续图中不太可能存在如此高维的 tensor，因此错误日志的含义可能是：

这里试图让一个数据张量（Range 的输出）的长度（extent）依赖于一个形状张量，而 TensorRT 在构建时无法将这个依赖关系静态化。

替换后的结构图如下。由于 TopK 仅支持 f32 和 f16 类型的 tensor，因此需要在 TopK 的输入输出端加上 Cast 算子：

我尝试用 TopK 算子代替 ReduceMax，使其生成一个 shape tensor 而非 data tensor，但仍然出现错误：

Error[9]: [graph.cpp::computeInputExecutionUses::553] Error Code 9: Internal Error (/TopK_for_/ReduceMax: ITopKLayer cannot be used to compute a shape tensor)

TopK 的输出依然不能作为 shape tensor，因为：

TensorRT 可以处理动态形状，但有一个前提：所有用来决定张量形状的计算，必须只依赖于输入的【形状】，而不能依赖于输入的【内容/值】。

因此，根源在于输入设置不合理。这里的输入x_lens是一个一维动态 tensor，表示这个 batch 的输入长度，最终在上图的一些节点计算下变成一个 mask tensor。如果直接将输入改为该 mask tensor，就可以避免上述错误：

defsolve_replace_dynamic_range_with_mask_input(onnx_path):model=onnx.load(onnx_path)ir_version=model.ir_version graph=gs.import_onnx(model)x_len_mask=gs.Variable(name=f"x_len_mask",dtype=np.bool_,shape=['N','L'])fornodeingraph.nodes:fori,input_tinenumerate(node.inputs):ifinput_t.name=="/GreaterOrEqual_output_0":print(f"found node :{node}")node.inputs[i]=x_len_mask inputs_to_delete_set=["x_lens"]outputs_to_delete_set=["encoder_out_lens"]graph.inputs=[inpforinpingraph.inputsifinp.namenotininputs_to_delete_set]graph.outputs=[outforoutingraph.outputsifout.namenotinoutputs_to_delete_set]graph.inputs.append(x_len_mask)graph.cleanup()graph.toposort()onnx.save(gs.export_onnx(graph,ir_version=ir_version),"./onnx/encoder_mask_input_solved.onnx")

我直接将通过x_lens输入得到的 tensor/GreaterOrEqual_output_0替换成动态输入 tensorx_len_mask，然后删除原有的输入x_lens以及通过x_lens得到的输出encoder_out_lens。接着重新通过 trtexec 进行转换，整个转译时间比之前长了很多，这个问题大概率得到解决，但后面出现了新的问题：

Error Code 2: Internal Error (/encoder/Slice_2 requires bool I/O but node can not be handled by Myelin. Dynamic shapes are not equal for slice params.)

原因出在/encoder/Slice_2节点的输入是动态输入x_len_mask。因此，我只能把 Slice 节点也替换掉。这里使用Range + Gather来替换 Slice 节点，但问题在于 Range 节点的 limit 参数不支持 INF，只能通过 Shape 以及 Gather 算子获取。最终的网络结构如下：

继续转换后，这个过程持续时间很长，上面的算子不兼容问题应该算是解决了，但在构建优化配置的时候出现了错误：

[12/13/2025-17:14:45] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[12/13/2025-17:14:45] [V] [TRT] Constructing optimization profile number 0 [1/1].
[12/13/2025-17:14:45] [E] Error[2]: [shapeContext.cpp::checkVolume::2379] Error Code 2: Internal Error (Assertion bound >= 0 failed.)

很奇怪，ONNX 模型本身可以正常运行，且结果与原模型一致，但在转换时就出现错误：

/usr/src/tensorrt/bin/trtexec --onnx=./onnx/encoder.onnx --saveEngine=./tensorrt/encoder.engine --minShapes=x:1x16x80,x_len_mask:1x4 --optShapes=x:8x512x80,x_len_mask:8x252 --maxShapes=x:16x384000x80,x_len_mask:16x191996 --verbose

一开始以为是 minShapes 给的太小导致出现了小于 0 的 shape，但后面扩大到x:1x128x80,x_len_mask:1x60也没能解决问题。我想通过打印每个 tensor 的 shape 来检查，但始终不是很方便。后来使用 polygraphy 检查是哪个 tensor 的 shape 出现了问题：

polygraphy run --onnxrt ./onnx/encoder_range_gather_mask_input_solved_v1.onnx --input-shapes 'x:[1,16,80]' 'x_len_mask:[1,8]' polygraphy run --onnxrt ./onnx/encoder-iter-5576000-avg-20-simplified.onnx --input-shapes 'x:[1,16,80]' 'x_lens:[16]'

但这次出现了不一样的错误：

'/encoder/0/layers.0/self_attn_weights/Where' Status Message ... axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 4 by 8

问题越来越多了，我怀疑是 TensorRT 的版本太低，导致 dynamic shape 支持得不好。随后找 SRE 升级了 CUDA 驱动至 12.4，然后使用基于 TensorRT 10.0 的基础镜像，Dockerfile 如下：

# nVidia TensorRT Base Image # https://docs.nvidia.com/deeplearning/frameworks/container-release-notes/index.html#rel-25-06 # tensorrt 10.x support more onnx ops with dynamic shape ARG TRT_CONTAINER_VERSION=24.05 FROM nvcr.io/nvidia/tensorrt:${TRT_CONTAINER_VERSION}-py3 ARG ONNXRUNTIME_REPO=https://github.com/Microsoft/onnxruntime ARG ONNXRUNTIME_BRANCH=main # Adjust as needed # Check your CUDA arch: https://developer.nvidia.com/cuda-gpus ARG CMAKE_CUDA_ARCHITECTURES=75 RUN apt-get update &&\ apt-get install -y sudo git bash unattended-upgrades RUN unattended-upgrade WORKDIR /code ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/code/cmake-3.31.5-linux-x86_64/bin:${PATH} # Prepare onnxruntime repository & build onnxruntime with TensorRT RUN git clone --single-branch --branch ${ONNXRUNTIME_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\ /bin/sh onnxruntime/dockerfiles/scripts/install_common_deps.sh &&\ trt_version=${TRT_VERSION:0:4} &&\ # /bin/sh onnxruntime/dockerfiles/scripts/checkout_submodules.sh ${trt_version} &&\ cd onnxruntime &&\ /bin/sh build.sh --allow_running_as_root --parallel --build_shared_lib --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --use_tensorrt --tensorrt_home /usr/lib/x86_64-linux-gnu/ --config Release --build_wheel --skip_tests --skip_submodule_sync --cmake_extra_defines '"CMAKE_CUDA_ARCHITECTURES='${CMAKE_CUDA_ARCHITECTURES}'"' &&\ pip install /code/onnxruntime/build/Linux/Release/dist/*.whl &&\ cd ..

然后直接使用最原始的 ONNX 模型进行转换，果然，之前的错误消失了，但又出现了新的问题：

[01/06/2026-12:05:53] [E] Error[4]: [graphShapeAnalyzer.cpp::analyzeShapes::2084] Error Code 4: Miscellaneous (IConditionalOutputLayer /encoder/1/encoder/encoder_pos/If_OutputLayer: /encoder/1/encoder/encoder_pos/If_OutputLayer: dimensions not compatible for if-conditional outputs)

通过 Netron.app 发现这个节点的 then 分支输出是空的，估计是之前用 onnxoptimizer 给优化掉了。查看从 PyTorch 导出的原始 ONNX 模型发现，两个分支的输出是完整的。随后使用最原始的 ONNX 模型进行转换，出现错误：

[01/06/2026-12:19:30] [E] [TRT] ModelImporter.cpp:836: ERROR: ModelImporter.cpp:194 In function parseNode: [6] Invalid Node - /encoder/1/downsample/Softmax

原因在于 TensorRT 要求 Softmax 算子的输入 Tensor 必须是二维及以上，第一维表示 batch_size。这里的输入其实是常量 bias，在 ONNX 中是 initializer，因此直接将该常量输入和 Softmax 算子合并成一个常量 initializer。但再次转换后，问题依然存在：

Miscellaneous (IConditionalOutputLayer /encoder/1/encoder/encoder_pos/If_OutputLayer: /encoder/1/encoder/encoder_pos/If_OutputLayer: dimensions not compatible for if-conditional outputs)

这里的 ONNX 模型中存在一个 If 算子，它的 else_branch 和 then_branch 子图输出 shape 不一致，导致上述问题。其中 then_branch 输出的是固定 size 的 tensor，但 else_branch 输出的是 dynamic size 的 tensor。这个问题比较棘手，因为 else_branch 的输出 size 有可能大于 then_branch 的输出 size，无法通过 padding 到固定 size 解决。可能可行的办法是预设一个较大的 max_len，然后在两个 branch 的输出都进行 padding，再使用 mask 控制参与计算的部分，但这种方案无疑过于复杂。因此，这次尝试暂时告一段落。