了解一下InternLM2

  • 大模型的出现和发展得益于增长的数据量、计算能力的提升以及算法优化等因素。这些模型在各种任务中展现出惊人的性能,比如自然语言处理、计算机视觉、语音识别等。这种模型通常采用深度神经网络结构,如 TransformerBERTGPT( Generative Pre-trained Transformer )等。大模型的优势在于其能够捕捉和理解数据中更为复杂、抽象的特征和关系。通过大规模参数的学习,它们可以提高在各种任务上的泛化能力,并在未经过大量特定领域数据训练的情况下实现较好的表现。然而,大模型也面临着一些挑战,比如巨大的计算资源需求、高昂的训练成本、对大规模数据的依赖以及模型的可解释性等问题。因此,大模型的应用和发展也需要在性能、成本和道德等多个方面进行权衡和考量。InternLM-7B 包含了一个拥有 70 亿参数的基础模型和一个为实际场景量身定制的对话模型。该模型具有以下特点:1,利用数万亿的高质量 token 进行训练,建立了一个强大的知识库;2.支持 8k token 的上下文窗口长度,使得输入序列更长并增强了推理能力。基于 InternLM 训练框架,上海人工智能实验室已经发布了两个开源的预训练模型:InternLM-7B 和 InternLM-20B

  • InternLM 是一个开源的轻量级训练框架,旨在支持大模型训练而无需大量的依赖。通过单一的代码库,它支持在拥有数千个 GPU 的大型集群上进行预训练,并在单个 GPU 上进行微调,同时实现了卓越的性能优化。在 1024GPU 上训练时,InternLM 可以实现近 90% 的加速效率。基于 InternLM 训练框架,上海人工智能实验室已经发布了两个开源的预训练模型:InternLM-7BInternLM-20BLagent 是一个轻量级、开源的基于大语言模型的智能体(agent)框架,支持用户快速地将一个大语言模型转变为多种类型的智能体,并提供了一些典型工具为大语言模型赋能。通过 Lagent 框架可以更好的发挥 InternLM 的全部性能。

  • 7B demo 的训练配置文件样例如下:

    • JOB_NAME = "7b_train"
      SEQ_LEN = 2048
      HIDDEN_SIZE = 4096
      NUM_ATTENTION_HEAD = 32
      MLP_RATIO = 8 / 3
      NUM_LAYER = 32
      VOCAB_SIZE = 103168
      MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
      # Ckpt folder format:
      # fs: 'local:/mnt/nfs/XXX'
      SAVE_CKPT_FOLDER = "local:llm_ckpts"
      LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
      # boto3 Ckpt folder format:
      # import os
      # BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
      # SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
      # LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
      CHECKPOINT_EVERY = 50
      ckpt = dict(enable_save_ckpt=False,  # enable ckpt save.save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.# load_ckpt_folder=LOAD_CKPT_FOLDER, # Ckpt path to resume training(load weights and scheduler/context states).# load_model_only_folder=MODEL_ONLY_FOLDER, # Path to initialize with given model weights.load_optimizer=True,  # Wheter to load optimizer states when continuing training.checkpoint_every=CHECKPOINT_EVERY,async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.snapshot_ckpt_folder="/".join([SAVE_CKPT_FOLDER, "snapshot"]),  # directory for snapshot ckpt storage path.oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
      )
      TRAIN_FOLDER = "/path/to/dataset"
      VALID_FOLDER = "/path/to/dataset"
      data = dict(seq_len=SEQ_LEN,# micro_num means the number of micro_batch contained in one gradient updatemicro_num=4,# packed_length = micro_bsz * SEQ_LENmicro_bsz=2,# defaults to the value of micro_numvalid_micro_num=4,# defaults to 0, means disable evaluatevalid_every=50,pack_sample_into_one=False,total_steps=50000,skip_batches="",rampup_batch_size="",# Datasets with less than 50 rows will be discardedmin_length=50,# train_folder=TRAIN_FOLDER,# valid_folder=VALID_FOLDER,
      )
      grad_scaler = dict(fp16=dict(# the initial loss scale, defaults to 2**16initial_scale=2**16,# the minimum loss scale, defaults to Nonemin_scale=1,# the number of steps to increase loss scale when no overflow occursgrowth_interval=1000,),# the multiplication factor for increasing loss scale, defaults to 2growth_factor=2,# the multiplication factor for decreasing loss scale, defaults to 0.5backoff_factor=0.5,# the maximum loss scale, defaults to Nonemax_scale=2**24,# the number of overflows before decreasing loss scale, defaults to 2hysteresis=2,
      )
      hybrid_zero_optimizer = dict(# Enable low_level_optimzer overlap_communicationoverlap_sync_grad=True,overlap_sync_param=True,# bucket size for nccl communication paramsreduce_bucket_size=512 * 1024 * 1024,# grad clippingclip_grad_norm=1.0,
      )
      loss = dict(label_smoothing=0,
      )
      adam = dict(lr=1e-4,adam_beta1=0.9,adam_beta2=0.95,adam_beta2_c=0,adam_eps=1e-8,weight_decay=0.01,
      )lr_scheduler = dict(total_steps=data["total_steps"],init_steps=0,  # optimizer_warmup_stepwarmup_ratio=0.01,eta_min=1e-5,last_epoch=-1,
      )beta2_scheduler = dict(init_beta2=adam["adam_beta2"],c=adam["adam_beta2_c"],cur_iter=-1,
      )model = dict(checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]num_attention_heads=NUM_ATTENTION_HEAD,embed_split_hidden=True,vocab_size=VOCAB_SIZE,embed_grad_scale=1,parallel_output=True,hidden_size=HIDDEN_SIZE,num_layers=NUM_LAYER,mlp_ratio=MLP_RATIO,apply_post_layer_norm=False,dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"norm_type="rmsnorm",layer_norm_epsilon=1e-5,use_flash_attn=True,num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
      )
      """
      zero1 parallel:1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,so parameters will be divided within the range of dp.2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
      pipeline parallel (dict):1. size: int, the size of pipeline parallel.2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
      tensor parallel: tensor parallel size, usually the number of GPUs per node.
      """
      parallel = dict(zero1=8,pipeline=dict(size=1, interleaved_overlap=True),sequence_parallel=False,
      )cudnn_deterministic = False
      cudnn_benchmark = False
      
  • 30B demo 训练配置文件样例如下:

    • JOB_NAME = "30b_train"
      SEQ_LEN = 2048
      HIDDEN_SIZE = 6144
      NUM_ATTENTION_HEAD = 48
      MLP_RATIO = 8 / 3
      NUM_LAYER = 60
      VOCAB_SIZE = 103168
      MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
      # Ckpt folder format:
      # fs: 'local:/mnt/nfs/XXX'
      SAVE_CKPT_FOLDER = "local:llm_ckpts"
      LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
      # boto3 Ckpt folder format:
      # import os
      # BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
      # SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
      # LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
      CHECKPOINT_EVERY = 50
      ckpt = dict(enable_save_ckpt=False,  # enable ckpt save.save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.# load_ckpt_folder=LOAD_CKPT_FOLDER, # Ckpt path to resume training(load weights and scheduler/context states).# load_model_only_folder=MODEL_ONLY_FOLDER, # Path to initialize with given model weights.load_optimizer=True,  # Wheter to load optimizer states when continuing training.checkpoint_every=CHECKPOINT_EVERY,async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.snapshot_ckpt_folder="/".join([SAVE_CKPT_FOLDER, "snapshot"]),  # directory for snapshot ckpt storage path.oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
      )
      TRAIN_FOLDER = "/path/to/dataset"
      VALID_FOLDER = "/path/to/dataset"
      data = dict(seq_len=SEQ_LEN,# micro_num means the number of micro_batch contained in one gradient updatemicro_num=4,# packed_length = micro_bsz * SEQ_LENmicro_bsz=2,# defaults to the value of micro_numvalid_micro_num=4,# defaults to 0, means disable evaluatevalid_every=50,pack_sample_into_one=False,total_steps=50000,skip_batches="",rampup_batch_size="",# Datasets with less than 50 rows will be discardedmin_length=50,# train_folder=TRAIN_FOLDER,# valid_folder=VALID_FOLDER,
      )
      grad_scaler = dict(fp16=dict(# the initial loss scale, defaults to 2**16initial_scale=2**16,# the minimum loss scale, defaults to Nonemin_scale=1,# the number of steps to increase loss scale when no overflow occursgrowth_interval=1000,),# the multiplication factor for increasing loss scale, defaults to 2growth_factor=2,# the multiplication factor for decreasing loss scale, defaults to 0.5backoff_factor=0.5,# the maximum loss scale, defaults to Nonemax_scale=2**24,# the number of overflows before decreasing loss scale, defaults to 2hysteresis=2,
      )
      hybrid_zero_optimizer = dict(# Enable low_level_optimzer overlap_communicationoverlap_sync_grad=True,overlap_sync_param=True,# bucket size for nccl communication paramsreduce_bucket_size=512 * 1024 * 1024,# grad clippingclip_grad_norm=1.0,
      )
      loss = dict(label_smoothing=0,
      )
      adam = dict(lr=1e-4,adam_beta1=0.9,adam_beta2=0.95,adam_beta2_c=0,adam_eps=1e-8,weight_decay=0.01,
      )lr_scheduler = dict(total_steps=data["total_steps"],init_steps=0,  # optimizer_warmup_stepwarmup_ratio=0.01,eta_min=1e-5,last_epoch=-1,
      )
      beta2_scheduler = dict(init_beta2=adam["adam_beta2"],c=adam["adam_beta2_c"],cur_iter=-1,
      )model = dict(checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]num_attention_heads=NUM_ATTENTION_HEAD,embed_split_hidden=True,vocab_size=VOCAB_SIZE,embed_grad_scale=1,parallel_output=True,hidden_size=HIDDEN_SIZE,num_layers=NUM_LAYER,mlp_ratio=MLP_RATIO,apply_post_layer_norm=False,dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"norm_type="rmsnorm",layer_norm_epsilon=1e-5,use_flash_attn=True,num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
      )
      """
      zero1 parallel:1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,so parameters will be divided within the range of dp.2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
      pipeline parallel (dict):1. size: int, the size of pipeline parallel.2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
      tensor parallel: tensor parallel size, usually the number of GPUs per node.
      """
      parallel = dict(zero1=-1,tensor=4,pipeline=dict(size=1, interleaved_overlap=True),sequence_parallel=False,
      )
      cudnn_deterministic = False
      cudnn_benchmark = False
      

30B Demo — InternLM 0.2.0 文档

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/610566.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

云平台架构知识点总结

云平台架构 内部特供 版权所有 翻版必究 项目1 云计算的认知与体验 1.1 云计算定义 ​ 维基百科定义:云计算将IT相关的能力以服务的方式提供给用户&#xff0c;允许用户在不了解提供服务的技术﹑没有相关知识以及设备操作能力的情况下&#xff0c;通过Internet 获取需要的…

Guava:StopWatch 计时器

简介 StopWatch 用来计算经过的时间&#xff08;精确到纳秒&#xff09;。 这个类比调用 System.now() 优势在于&#xff1a; 性能 表现形式更丰富 类方法说明 官方文档&#xff1a;Stopwatch (Guava: Google Core Libraries for Java 27.0.1-jre API) 方法名称方法描述…

浅谈有源滤波在某棉纺企业低压配电室节能应用

叶根胜 安科瑞电气股份有限公司 上海嘉定 201801 0引言 在电力系统中&#xff0c;谐波的根本原因是非线性负载。由于电子设备、变频器、整流器等。在棉纺企业的大部分设备中被广泛使用&#xff0c;因此会产生大量的谐波电流。我公司的环锭纱线和筒车间的配电设备受到谐波的影…

c语言:输入成绩,统计不及格人数|练习题

一、题目 输入学生成绩&#xff0c;统计不及格的学生人数 二、代码截图【带注释】 三、源代码【带注释】 #include <stdio.h> //题目&#xff1a;输入成绩&#xff0c;统计不及格人数 //思考分析 //1、由于学生人数是未知数&#xff0c;所以可以在输入时&#xff0c;以0…

TextDiffuser-2:超越DALLE-3的文本图像融合技术

概述 近年来&#xff0c;扩散模型在图像生成领域取得了显著进展&#xff0c;但在文本图像融合方面依然存在挑战。TextDiffuser-2的出现&#xff0c;标志着在这一领域的一个重要突破&#xff0c;它成功地结合了大型语言模型的能力&#xff0c;以实现更高效、多样化且美观的文本…

前端开发Docker了解

1&#xff0c;docker简介 docker主要解决了最初软件开发环境配置的困难&#xff0c;完善了虚拟机部署的资源占用多&#xff0c;启动慢等缺点&#xff0c;保证了一致的运行环境&#xff0c;可以更轻松的维护和扩展。docker在linux容器的基础上进行了进一步的封装&#xff0c;提…

Java--业务场景:获取请求的ip属地信息

文章目录 前言步骤在pom文件中引入下列依赖IpUtil工具类在Controller层编写接口&#xff0c;获取请求的IP属地测试接口 IpInfo类中的方法 前言 很多时候&#xff0c;项目里需要展示用户的IP属地信息&#xff0c;所以这篇文章就记录一下如何在Java Spring boot项目里获取请求的…

springboot集成jsp

首先pom中引入依赖包 <!--引入servlet--> <dependency><groupId>javax.servlet</groupId><artifactId>javax.servlet-api</artifactId> </dependency> <!--引入jstl标签库--> <dependency><groupId>javax.servle…

2023年全国职业院校技能大赛(高职组)“云计算应用”赛项赛卷③

2023年全国职业院校技能大赛&#xff08;高职组&#xff09; “云计算应用”赛项赛卷3 目录 需要竞赛软件包环境以及备赛资源可私信博主&#xff01;&#xff01;&#xff01; 2023年全国职业院校技能大赛&#xff08;高职组&#xff09; “云计算应用”赛项赛卷3 模块一 …

springboot2.7集成sharding-jdbc4.1.1实现业务分表

1、引入maven <dependency><groupId>org.apache.shardingsphere</groupId><artifactId>sharding-jdbc-spring-boot-starter</artifactId><version>4.1.1</version></dependency> 2、基本代码示例 基本逻辑&#xff1a;利用数…

软件测试|Django 入门:构建Python Web应用的全面指南

引言 Django 是一个强大的Python Web框架&#xff0c;它以快速开发和高度可扩展性而闻名。本文将带您深入了解Django的基本概念和核心功能&#xff0c;帮助您从零开始构建一个简单的Web应用。 什么是Django&#xff1f; Django 是一个基于MVC&#xff08;模型-视图-控制器&a…

A preview error may have occurred. Switch to the Log tab to view details.

记录一下当时刚开始学习鸿蒙开发犯的错误 UIAbility内页面间的跳转内容的时候会遇到页面无法跳转的问题 并伴随标题错误 我们跳转页面需要进行注册 路由表路径&#xff1a; entry > src > main > resources > base > profile > main_pages.json 或者是页面…

Docker(网络,网络通信,资源控制)

docker网络 网络实现原理 Docker 网络是指由 Docker 为应用程序创建的虚拟环境的一部分&#xff0c;它允许应用程序从宿主机操作系统的网络环境中独立出来&#xff0c;形成容器自有的网络设备、IP 协议栈、端口套接字、IP 路由表、防火墙等与网络相关的模块。Docker 的网络功能…

鸿鹄电子招投标系统:企业战略布局下的采购寻源解决方案

在数字化采购领域&#xff0c;企业需要一个高效、透明和规范的管理系统。通过采用Spring Cloud、Spring Boot2、Mybatis等先进技术&#xff0c;我们打造了全过程数字化采购管理平台。该平台具备内外协同的能力&#xff0c;通过待办消息、招标公告、中标公告和信息发布等功能模块…

面试宝典进阶之Java线程面试题

T1、【初级】线程和进程有什么区别&#xff1f; &#xff08;1&#xff09;线程是CPU调度的最小单位&#xff0c;进程是计算分配资源的最小单位。 &#xff08;2&#xff09;一个进程至少要有一个线程。 &#xff08;3&#xff09;进程之间的内存是隔离的&#xff0c;而同一个…

Untiy HTC Vive VRTK 开发记录

目录 一.概述 二.功能实现 1.模型抓取 1&#xff09;基础抓取脚本 2&#xff09;抓取物体在手柄上的角度 2.模型放置区域高亮并吸附 1&#xff09;VRTK_SnapDropZone 2&#xff09;VRTK_PolicyList 3&#xff09;VRTK_SnapDropZone_UnityEvents 3.交互滑动条 4.交互旋…

秋招阿里巴巴java笔试试题-精

一、单项选择题 1、以下函数的时间复杂度是 &#xff08; &#xff09; 1 2 3 4 5 6 7 8 9 void func(int x,int y, int z){ if(x<0) printf("%d, %d\n", y, z); else { func(x-1,y1,z); func(x-1,y,z1); } } A.O(x*y*z) B.O(x^2*y^2) C.O(2^x) D.O(2^x*…

coredump+gdb调试

1、什么是coredump Coredump&#xff08;核心转储&#xff09;是操作系统在程序异常终止&#xff08;例如由于段错误或其他严重错误&#xff09;时创建的一种文件。这个文件包含了程序崩溃时刻进程的内存镜像&#xff0c;通常还包括程序计数器、寄存器内容和堆栈内存等信息&am…

nginx 二级目录部署vue项目

主要是vue项目得更改资源路径 通过.env环境变量来设置 修改项目的基础路径&#xff0c;我的是vite项目&#xff0c;所以我要在vite.config.js中修改base属性 为 ‘/threejs/’修改vue-router的base路径为’/threejs’ 1.vite项目的基础路径 getEnvConfig 方法是封装的获取环境…

【Axure视频教程】可视化饼图

今天教大家在Axure制作可视化饼图的原型模板&#xff0c;鼠标移入饼图对应的扇形区域&#xff0c;该区域的扇形会高亮变色&#xff0c;而且显示该区域对应的数据&#xff0c;那这个模板是用Axure的原生元件制作的&#xff0c;不需要联网或者调用接口&#xff0c;通过基础元件和…