分布式多卡训练(DDP)踩坑

多卡训练最近在跑yolov10版本的RT-DETR,用来进行目标检测。

单卡训练语句(正常运行):

python main.py

多卡训练语句:

需要通过torch.distributed.launch来启动,一般是单节点,其中CUDA_VISIBLE_DEVICES设置用的显卡编号,也可以不用,直接在main.py里面指定device也行,–nproc_pre_node 每个节点的显卡数量。

python -m torch.distributed.run --nproc_per_node=3 main.pyCUDA_VISIBLE_DEVICES=0,6,7 python -m torch.distributed.run --nproc_per_node=3 main.py

但是运行多卡训练之后,会报错,有的时候训练进程会卡住。错误信息如下,

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/zyy23/yolov10/run_detr.py", line 5, in <module>
[rank0]:     model.train(pretrained=True,
[rank0]:   File "/home/zyy23/yolov10/ultralytics/engine/model.py", line 657, in train
[rank0]:     self.trainer.train()
[rank0]:   File "/home/zyy23/yolov10/ultralytics/engine/trainer.py", line 213, in train
[rank0]:     self._do_train(world_size)
[rank0]:   File "/home/zyy23/yolov10/ultralytics/engine/trainer.py", line 381, in _do_train
[rank0]:     self.loss, self.loss_items = self.model(batch)
[rank0]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1632, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1523, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameters which did not receive grad for rank 0: model.28.dec_bbox_head.5.layers.2.bias, model.28.dec_bbox_head.5.layers.2.weight, model.28.dec_bbox_head.5.layers.1.bias, model.28.dec_bbox_head.5.layers.1.weight, model.28.dec_bbox_head.5.layers.0.bias, model.28.dec_bbox_head.5.layers.0.weight, model.28.dec_bbox_head.4.layers.2.bias, model.28.dec_bbox_head.4.layers.2.weight, model.28.dec_bbox_head.4.layers.1.bias, model.28.dec_bbox_head.4.layers.1.weight, model.28.dec_bbox_head.4.layers.0.bias, model.28.dec_bbox_head.4.layers.0.weight, model.28.dec_bbox_head.3.layers.2.bias, model.28.dec_bbox_head.3.layers.2.weight, model.28.dec_bbox_head.3.layers.1.bias, model.28.dec_bbox_head.3.layers.1.weight, model.28.dec_bbox_head.3.layers.0.bias, model.28.dec_bbox_head.3.layers.0.weight, model.28.dec_bbox_head.2.layers.2.bias, model.28.dec_bbox_head.2.layers.2.weight, model.28.dec_bbox_head.2.layers.1.bias, model.28.dec_bbox_head.2.layers.1.weight, model.28.dec_bbox_head.2.layers.0.bias, model.28.dec_bbox_head.2.layers.0.weight, model.28.dec_bbox_head.1.layers.2.bias, model.28.dec_bbox_head.1.layers.2.weight, model.28.dec_bbox_head.1.layers.1.bias, model.28.dec_bbox_head.1.layers.1.weight, model.28.dec_bbox_head.1.layers.0.bias, model.28.dec_bbox_head.1.layers.0.weight, model.28.dec_bbox_head.0.layers.2.bias, model.28.dec_bbox_head.0.layers.2.weight, model.28.dec_bbox_head.0.layers.1.bias, model.28.dec_bbox_head.0.layers.1.weight, model.28.dec_bbox_head.0.layers.0.bias, model.28.dec_bbox_head.0.layers.0.weight, model.28.enc_bbox_head.layers.2.bias, model.28.enc_bbox_head.layers.2.weight, model.28.enc_bbox_head.layers.1.bias, model.28.enc_bbox_head.layers.1.weight, model.28.enc_bbox_head.layers.0.bias, model.28.enc_bbox_head.layers.0.weight, model.28.denoising_class_embed.weight
[rank0]: Parameter indices which did not receive grad for rank 0: 510 521 522 523 524 525 526 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574
[rank1]:[E1122 21:12:02.018431947 ProcessGroupGloo.cpp:143] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:[E1122 21:12:02.018445283 ProcessGroupGloo.cpp:143] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/zyy23/yolov10/run_detr.py", line 5, in <module>
[rank1]:     model.train(pretrained=True,
[rank1]:   File "/home/zyy23/yolov10/ultralytics/engine/model.py", line 657, in train
[rank1]:     self.trainer.train()
[rank1]:   File "/home/zyy23/yolov10/ultralytics/engine/trainer.py", line 213, in train
[rank1]:     self._do_train(world_size)
[rank1]:   File "/home/zyy23/yolov10/ultralytics/engine/trainer.py", line 389, in _do_train
[rank1]:     self.scaler.scale(self.loss).backward()
[rank1]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/_tensor.py", line 521, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 289, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]: RuntimeError: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank1]:  Original exception:
[rank1]: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:27022
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/zyy23/yolov10/run_detr.py", line 5, in <module>
[rank2]:     model.train(pretrained=True,
[rank2]:   File "/home/zyy23/yolov10/ultralytics/engine/model.py", line 657, in train
[rank2]:     self.trainer.train()
[rank2]:   File "/home/zyy23/yolov10/ultralytics/engine/trainer.py", line 213, in train
[rank2]:     self._do_train(world_size)
[rank2]:   File "/home/zyy23/yolov10/ultralytics/engine/trainer.py", line 389, in _do_train
[rank2]:     self.scaler.scale(self.loss).backward()
[rank2]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/_tensor.py", line 521, in backward
[rank2]:     torch.autograd.backward(
[rank2]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 289, in backward
[rank2]:     _engine_run_backward(
[rank2]:   File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
[rank2]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank2]: RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:  Original exception:
[rank2]: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:27022
W1122 21:12:02.606069 139664836297920 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1281666 closing signal SIGTERM
W1122 21:12:02.608416 139664836297920 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1281667 closing signal SIGTERM
E1122 21:12:02.987694 139664836297920 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1281665) of binary: /home/zyy23/anaconda3/envs/mypytorch_3.9/bin/python
Traceback (most recent call last):File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/runpy.py", line 87, in _run_codeexec(code, run_globals)File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>main()File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapperreturn f(*args, **kwargs)File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in mainrun(args)File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in runelastic_launch(File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/zyy23/anaconda3/envs/mypytorch_3.9/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agentraise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_detr.py FAILED
------------------------------------------------------------
Failures:<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:time      : 2024-11-22_21:12:02host      : lab10rank      : 0 (local_rank: 0)exitcode  : 1 (pid: 1281665)error_file: <N/A>traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

发生了runtimerror

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and bymaking sure all `forward` function outputs participate in calculating loss.If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

看不懂的话,用翻译软件翻译一下

运行时错误:预计在开始新迭代之前已完成前一次迭代的减少。此错误表明您的模块具有未用于产生损耗的参数。您可以通过 (1) 将关键字参数 find_unused_parameters=True 传递给 torch.nn.parallel.DistributedDataParallel 来启用未使用的参数检测; (2) 确保所有 forward 函数输出都参与计算损失。如果您已经完成了上述两个步骤,那么分布式数据并行模块无法在模块的 forward 函数的返回值中定位输出张量。报告此问题时,请包括损失函数和模块 forward 返回值的结构(例如 list、dict、iterable)。

错误原因:

  • 定义了网络层却没在forward()中使用

  • forward()返回的参数未用于梯度计算

  • 使用了不进行梯度回归的参数进行优化

有两种解决方法
一是在to torch.nn.parallel.DistributedDataParallel中加入find_unused_parameters参数并设置初始值为True,find_unused_parameters 是 PyTorch 中的一个参数,用于在分布式训练时优化梯度计算。
self.model = nn.parallel.DistributedDataParallel(self.model, 
device_ids=[RANK],find_unused_parameters=True)

find_unused_parameters参数的作用:

检测未使用的参数:当设置为 True 时,PyTorch 会在每个前向传播过程中检查哪些参数没有被使用。这对于某些模型来说非常有用,特别是当模型的某些部分在特定输入下可能不会被触发时。

减少损失计算时的内存开销,提高性能:通过识别并忽略未使用的参数,PyTorch 可以在计算梯度时减少内存开销,从而提高训练效率。在大型模型或复杂网络中,识别未使用的参数可以加速训练过程,尤其是在分布式设置中,因为可以避免不必要的梯度同步。

适用于动态计算图:对于需要动态变化的模型架构,使用 find_unused_parameters=True 可以确保所有参数都被正确处理。

但是它会增加额外的计算开销用于验证哪些参数是未使用的不用参加到损失的计算中,所以最好是仅在需要时使用此参数,尤其是在模型中确实存在未使用参数的情况下。

由于用的是yolov10,封装的太严密,一直找不到这条语句在哪个位置,找了好久,也尝试在模型初始化的位置和命令行加参数,都没成功,后来在ultralytics/engine里面的trainer.py找到了,在_setup_train函数下面。

def _setup_train(self, world_size):"""Builds dataloaders and optimizer on correct rank process."""# Modelself.run_callbacks("on_pretrain_routine_start")ckpt = self.setup_model()self.model = self.model.to(self.device)self.set_model_attributes()# Freeze layersfreeze_list = (self.args.freezeif isinstance(self.args.freeze, list)else range(self.args.freeze)if isinstance(self.args.freeze, int)else [])always_freeze_names = [".dfl"]  # always freeze these layersfreeze_layer_names = [f"model.{x}." for x in freeze_list] + always_freeze_namesfor k, v in self.model.named_parameters():# v.register_hook(lambda x: torch.nan_to_num(x))  # NaN to 0 (commented for erratic training results)if any(x in k for x in freeze_layer_names):LOGGER.info(f"Freezing layer '{k}'")v.requires_grad = Falseelif not v.requires_grad and v.dtype.is_floating_point:  # only floating point Tensor can require gradientsLOGGER.info(f"WARNING ?? setting 'requires_grad=True' for frozen layer '{k}'. ""See ultralytics.engine.trainer for customization of frozen layers.")v.requires_grad = True# Check AMPself.amp = torch.tensor(self.args.amp).to(self.device)  # True or Falseif self.amp and RANK in (-1, 0):  # Single-GPU and DDPcallbacks_backup = callbacks.default_callbacks.copy()  # backup callbacks as check_amp() resets themself.amp = torch.tensor(check_amp(self.model), device=self.device)callbacks.default_callbacks = callbacks_backup  # restore callbacksif RANK > -1 and world_size > 1:  # DDPdist.broadcast(self.amp, src=0)  # broadcast the tensor from rank 0 to all other ranks (returns None)self.amp = bool(self.amp)  # as booleanself.scaler = torch.cuda.amp.GradScaler(enabled=self.amp)if world_size > 1:self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[RANK],find_unused_parameters=True)
二是将环境变量TORCH_DISTRIBUTED_DEBUG设置为INFO或DETAIL,打印有关哪些特定参数在此级别上没有收到梯度的信息,作为此错误的一部分。上述报错信息没有显示这个提示,是因为我们已经使用了这条语句,使用之后,会看到哪些参数没有收到梯度信息。
TORCH_DISTRIBUTED_DEBUG=DETAIL python -m torch.distributed.run --nproc_per_node=3 main.py

或者是下面的语句,也可以看哪些参数没有更新

            for name, param in self.model.named_parameters():if param.grad is None:print("The None grad model is:")print(name)

在使用nn.Module的__init__方法时,如果使用self.xx这样的语句定义了层,但是这个层的计算结果后续没有用来计算loss,或者这个self层没有使用,都会导致报错。只需要在模型中仔细检查forward函数和init函数,检查__init__() 方法中是否存在self参数,在 forward 中没有使用的,注释掉即可。(要么使用,要么删除)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/71471.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

LLM大型语言模型(一)

1. 什么是 LLM&#xff1f; LLM&#xff08;大型语言模型&#xff09;是一种神经网络&#xff0c;专门用于理解、生成并对人类文本作出响应。这些模型是深度神经网络&#xff0c;通常训练于海量文本数据上&#xff0c;有时甚至覆盖了整个互联网的公开文本。 LLM 中的 “大” …

确保初始化和销毁操作的线程安全-初始化和销毁

你想为代码中的每行加上注释解释,以下是详细的注释: // 定义初始化函数,接收一个 InitOptions 类型的参数 int initGBB(InitOptions _opts) {// 使用原子操作检查初始化/销毁计数器,并增加计数。如果当前是第一次初始化,执行以下操作if (initFiniCnt_.fetch_add(1, std

蓝桥杯备考:动态规划dp入门题目之数字三角形

依然是按照动态规划dp的顺序来 step1&#xff1a;定义状态表示 f[i][j]表示的是到,j这个坐标的结点时的最大权值 step2: 定义状态方程 i,j坐标可能是从i-1 j-1 到i,j 也可能是从i-1 j到 i,j 所以状态方程应该是 f[i][j] max(f[i-1][j-1],f[i-1][j]) a[i][j] step3:初始化…

HarmonyOS NEXT开发进阶(十一):应用层架构介绍

文章目录 一、前言二、应用与应用程序包三、应用的多Module设计机制四、 Module类型五、Stage模型应用程序包结构六、拓展阅读 一、前言 在应用模型章节&#xff0c;可以看到主推的Stage模型中&#xff0c;多个应用组件共享同一个ArkTS引擎实例&#xff1b;应用组件之间可以方…

深入C语言:指针与数组的经典笔试题剖析

1. sizeof和strlen的对比 1.1 sizeof sizeof 是C语言中的一个操作符&#xff0c;用于计算变量或数据类型所占内存空间的大小&#xff0c;单位是字节。它不关心内存中存储的具体数据内容&#xff0c;只关注内存空间的大小。 #include <stdio.h> int main() {int a 10;…

deepseek+mermaid【自动生成流程图】

成果&#xff1a; 第一步打开deepseek官网(或百度版&#xff08;更快一点&#xff09;)&#xff1a; 百度AI搜索 - 办公学习一站解决 第二步&#xff0c;生成对应的Mermaid流程图&#xff1a; 丢给deepseek代码&#xff0c;或题目要求 生成mermaid代码 第三步将代码复制到me…

Solon AI —— RAG

说明 当前大模型与外部打交道的方式有两种&#xff0c;一种是 Prompt&#xff0c;一种是 Fuction Call。在 Prompt 方面&#xff0c;应用系统可以通过 Prompt 模版和补充上下文的方式&#xff0c;调整用户输入的提示语&#xff0c;使得大模型生成的回答更加准确。 RAG RAG &…

STM32——USART—串口发送

目录 一&#xff1a;USART简介 二&#xff1a;初始化USART 1.开启时钟 2.代码 三&#xff1a;USART发送数据 1.USART发送数据函数 2.获取标志位的状态 3.代码 4.在main.c内调用 5.串口调试 1.串口选择要与设备管理器中的端口保持一致 2.波特率、停止位等要与前面…

基于SpringBoot的在线骑行网站的设计与实现(源码+SQL脚本+LW+部署讲解等)

专注于大学生项目实战开发,讲解,毕业答疑辅导&#xff0c;欢迎高校老师/同行前辈交流合作✌。 技术范围&#xff1a;SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、安卓app、大数据、物联网、机器学习等设计与开发。 主要内容&#xff1a;…

通义万相2.1:开启视频生成新时代

文章摘要&#xff1a;通义万相 2.1 是一款在人工智能视频生成领域具有里程碑意义的工具&#xff0c;它通过核心技术的升级和创新&#xff0c;为创作者提供了更强大、更智能的创作能力。本文详细介绍了通义万相 2.1 的背景、核心技术、功能特性、性能评测、用户反馈以及应用场景…

3.3.2 Proteus第一个仿真图

文章目录 文章介绍0 效果图1 新建“点灯”项目2 添加元器件3 元器件布局接线4 补充 文章介绍 本文介绍&#xff1a;使用Proteus仿真软件画第一个仿真图 0 效果图 1 新建“点灯”项目 修改项目名称和路径&#xff0c;之后一直点“下一步”直到完成 2 添加元器件 点击元…

华为OD机试-最长的密码(Java 2024 E卷 100分)

题目描述 小王正在进行游戏大闯关,有一个关卡需要输入一个密码才能通过。密码获得的条件如下: 在一个密码本中,每一页都有一个由26个小写字母组成的密码,每一页的密码不同。需要从这个密码本中寻找这样一个最长的密码,从它的末尾开始依次去掉一位得到的新密码也在密码本…

极狐GitLab 正式发布安全版本17.9.1、17.8.4、17.7.6

本分分享极狐GitLab 补丁版本 17.9.1、17.8.4、17.7.6 的详细内容。这几个版本包含重要的缺陷和安全修复代码&#xff0c;我们强烈建议所有私有化部署用户应该立即升级到上述的某一个版本。对于极狐GitLab SaaS&#xff0c;技术团队已经进行了升级&#xff0c;无需用户采取任何…

Ajax动态加载 和 网页动态渲染 之间的区别及应用场景

更多内容请见: 爬虫和逆向教程-专栏介绍和目录 文章目录 1. Ajax 动态加载2. 动态渲染3. 两者之间的关系和区别3.1 AJAX 动态加载与动态渲染的关系3.2 流程3.3 两者区别4. 实际应用场景4.1 无限滚动4.2 表单提交4.3 单页应用(SPA)4.4 案例5. 总结Ajax 动态加载 和 动态渲染 …

QT——对象树

在上一篇博客我们已经学会了QT的坏境安装以及打印一个hello world&#xff0c;但是如果有细心的朋友看了代码&#xff0c;就会发现有一个严重的问题&#xff0c;从C语法看来存在内存泄漏。 上面的代码实际上并没有发送内存泄漏&#xff0c;是不是觉得有点奇怪&#xff1f;大家有…

深度学习之-“深入理解梯度下降”

梯度下降是机器学习和深度学习的核心优化算法&#xff0c;几乎所有的模型训练都离不开它。然而&#xff0c;梯度下降并不是一个单一的算法&#xff0c;而是一个庞大的家族&#xff0c;包含了许多变体和改进方法。本文将从最基础的梯度下降开始&#xff0c;逐步深入学习&#xf…

力扣-字符串

字符串不能被修改&#xff0c;所以如果有想修改字符串的行为&#xff0c;需要转换为StringBuilder StringBuilder里也有很多封装方法String没有&#xff0c;比如reverse() StringBuilder sb new StringBuilder();// 添加字符串 sb.append("Hello"); sb.append(&qu…

flink重启策略

一、重启策略核心意义 Flink 重启策略&#xff08;Restart Strategy&#xff09;是容错机制的核心组件&#xff0c;用于定义作业在发生故障时如何恢复执行。其核心目标为&#xff1a; 最小化停机时间&#xff1a;快速恢复数据处理&#xff0c;降低业务影响。平衡资源消耗&…

Java TCP 通信:实现简单的 Echo 服务器与客户端

TCP&#xff08;Transmission Control Protocol&#xff09;是一种面向连接的、可靠的传输层协议。与 UDP 不同&#xff0c;TCP 保证了数据的顺序、可靠性和完整性&#xff0c;适用于需要可靠传输的应用场景&#xff0c;如文件传输、网页浏览等。本文将基于 Java 实现一个简单的…

Ollama+Deepseek-R1+Continue本地集成VScode

一、OllamaDeepseek-R1Continue本地集成VScode 1&#xff09;安装前知识点 Continue 介绍 详情可参照官网&#xff1a; continue官网 Continue 是 Visual Studio Code 和 JetBrains 中领先的开源 AI 代码助手。 •在侧边栏中进行聊天以理解和迭代代码。 •自动补全&#…