【论文速递】2025年04周 (Robotics/Embodied AI/LLM)

目录

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
    • 摘要
  • Evolving Deeper LLM Thinking
    • 摘要
  • Kimi k1.5: Scaling Reinforcement Learning with LLMs
    • 摘要
  • Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
    • 摘要
  • VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
    • 摘要
  • MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
    • 摘要
  • FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
    • 摘要
  • SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
    • 摘要
  • Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
    • 摘要
  • GameFactory: Creating New Games with Generative Interactive Videos
    • 摘要
  • Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
    • 摘要
  • UI-TARS: Pioneering Automated GUI Interaction with Native Agents
    • 摘要
  • Improving Video Generation with Human Feedback
    • 摘要
  • PaSa: An LLM Agent for Comprehensive Academic Paper Search
    • 摘要
  • Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
    • 摘要
  • TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
    • 摘要
  • InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
    • 摘要
  • Autonomy-of-Experts Models
    • 摘要
  • Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
    • 摘要
  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
    • 摘要
  • Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
    • 摘要
  • Reasoning Language Models: A Blueprint
    • 摘要
  • Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
    • 摘要
  • VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
    • 摘要
  • O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
    • 摘要

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  • 作者: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12948

摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero andDeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcementlearning (RL) withoutsupervised fine-tuning (SFT)as a preliminary step,demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zeronaturally emerges with numerous powerful and intriguingreasoning behaviors.However, it encounters challenges such as poor readability, and languagemixing. To address these issues and further enhance reasoning performance, weintroduce DeepSeek-R1, which incorporatesmulti-stage trainingand cold-startdata before RL. DeepSeek-R1 achieves performance comparable toOpenAI-o1-1217onreasoning tasks. To support the research community, we open-sourceDeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,70B) distilled from DeepSeek-R1 based onQwenandLlama.


Evolving Deeper LLM Thinking

  • 作者: Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen

  • 日期: 2025-01-17

  • 论文链接: https://arxiv.org/pdf/2501.09891

摘要

We explore anevolutionary search strategyfor scalinginference time computeinLarge Language Models. The proposed approach,Mind Evolution, uses alanguage model to generate, recombine and refinecandidate responses. Theproposed approach avoids the need to formalize the underlying inference problemwhenever asolution evaluatoris available. Controlling for inference cost, wefind thatMind Evolutionsignificantly outperforms other inference strategiessuch as Best-of-N and Sequential Revision innatural language planning tasks.In theTravelPlannerandNatural Plan benchmarks,Mind Evolutionsolves morethan 98% of the problem instances usingGemini 1.5 Prowithout the use of aformal solver.


Kimi k1.5: Scaling Reinforcement Learning with LLMs

  • 作者: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12599

摘要

Language model pretraining with next token prediction has proved effectivefor scaling compute but is limited to the amount of available training data.Scaling reinforcement learning (RL) unlocks a new axis for the continuedimprovement of artificial intelligence, with the promise that large languagemodels (LLMs) can scale their training data by learning to explore withrewards. However, prior published work has not produced competitive results. Inlight of this, we report on the training practice of Kimi k1.5, our latestmulti-modalLLMtrained withRL, including itsRLtraining techniques,multi-modal data recipes, and infrastructure optimization.Long context scalingand improvedpolicy optimizationmethods are key ingredients of our approach,which establishes a simplistic, effectiveRLframework without relying on morecomplex techniques such asMonte Carlo tree search,value functions, andprocess reward models. Notably, our system achieves state-of-the-art reasoningperformance across multiple benchmarks and modalities – e.g., 77.5 onAIME,96.2 onMATH 500, 94-th percentile onCodeforces, 74.9 onMathVista-- matchingOpenAI’s o1. Moreover, we present effectivelong2shortmethods that uselong-CoTtechniques to improveshort-CoTmodels, yielding state-of-the-artshort-CoTreasoning results – e.g., 60.8 onAIME, 94.6 on MATH500, 47.3 onLiveCodeBench-- outperforming existingshort-CoTmodels such asGPT-4oandClaude Sonnet 3.5by a large margin (up to +550%).


Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

  • 作者: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.11425

摘要

Large Language Models (LLMs)agents are increasingly pivotal for addressingcomplex tasks ininteractive environments. Existing work mainly focuses onenhancing performance throughbehavior cloningfrom stronger experts, yet suchapproaches often falter in real-world applications, mainly due to the inabilityto recover from errors. However,step-level critique datais difficult andexpensive to collect. Automating and dynamically constructing self-critiquedatasets is thus crucial to empowering models with intelligent agentcapabilities. In this work, we propose aniterative self-training framework,Agent-R, that enables language Agent to Reflect on the fly. Unlike traditionalmethods thatreward or penalize actionsbased on correctness,Agent-RleveragesMCTS to construct training data that recovercorrect trajectoriesfromerroneous ones. A key challenge of agent reflection lies in the necessity fortimely revisionrather than waiting until the end of a rollout. To addressthis, we introduce a model-guided critique construction mechanism: the actormodel identifies the first error step (within its current capability) in afailed trajectory. Starting from it, we splice it with the adjacent correctpath, which shares the same parent node in the tree. This strategy enables themodel to learn reflection based on its current policy, therefore yieldingbetterlearning efficiency. To further explore the scalability of thisself-improvement paradigm, we investigateiterative refinementof both errorcorrection capabilities and dataset construction. Our findings demonstrate thatAgent-Rcontinuously improves the model’s ability to recover from errors andenablestimely error correction. Experiments on threeinteractive environmentsshow thatAgent-Reffectively equips agents to correct erroneous actions whileavoiding loops, achieving superior performance compared tobaseline methods(+5.59%).


VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

  • 作者: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.13106

  • 项目链接: https://github.com/DAMO-NLP-SG/VideoLLaMA3

摘要

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of “vision-centric” is two-fold: thevision-centric training paradigmandvision-centric framework design. The keyinsight of ourvision-centric training paradigmis that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up thevision encoderandprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder,projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model’scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrainedvision encoderisadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.


MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

  • 作者: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12380

  • 项目链接: https://mmvu-benchmark.github.io/

摘要

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmarkfor evaluating foundation models in video understanding. MMVU includes 3,000expert-annotated questions spanning 27 subjects across four core disciplines:Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared toprior benchmarks, MMVU features three key advancements. First, it challengesmodels to apply domain-specific knowledge and perform expert-level reasoning toanalyze specialized-domain videos, moving beyond the basic visual perceptiontypically assessed in current video benchmarks. Second, each example isannotated by human experts from scratch. We implement strict data qualitycontrols to ensure the high quality of the dataset. Finally, each example isenriched with expert-annotated reasoning rationals and relevant domainknowledge, facilitating in-depth analysis. We conduct an extensive evaluationof 32 frontier multimodal foundation models on MMVU. The latestSystem-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highestperformance among the tested models. However, they still fall short of matchinghuman expertise. Through in-depth error analyses and case studies, we offeractionable insights for future advancements in expert-level,knowledge-intensive video understanding for specialized domains.


FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

  • 作者: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12909

摘要

Virtual film production requires intricate decision-making processes,including scriptwriting, virtual cinematography, and precise actor positioningand actions. Motivated by recent advances in automated decision-making withlanguage agent-based societies, this paper introduces FilmAgent, a novelLLM-basedmulti-agentcollaborative frameworkforend-to-endfilm automationinour constructed 3D virtual spaces. FilmAgent simulates various crew roles,including directors, screenwriters, actors, and cinematographers, and coverskey stages of a film production workflow: (1) idea development transformsbrainstormed ideas into structured story outlines; (2) scriptwriting elaborateson dialogue and character actions for each scene; (3) cinematography determinesthe camera setups for each shot. A team of agents collaborates throughiterative feedbackand revisions, thereby verifying intermediate scripts andreducinghallucinations. We evaluate the generated videos on 15 ideas and 4 keyaspects.Human evaluationshows that FilmAgent outperforms all baselines acrossall aspects and scores 3.98 out of 5 on average, showing the feasibility ofmulti-agent collaborationin filmmaking. Further analysis reveals thatFilmAgent, despite using the less advancedGPT-4omodel, surpasses thesingle-agent o1, showing the advantage of a well-coordinatedmulti-agentsystem. Lastly, we discuss the complementary strengths and weaknesses ofOpenAI’stext-to-video modelSoraand our FilmAgent in filmmaking.


SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

  • 作者: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.13200

摘要

Multi-agent reinforcement learning (MARL)demonstrates significant progressin solving cooperative and competitive multi-agent problems in variousenvironments. One of the principal challenges in MARL is the need for explicitprediction of the agents’ behavior to achieve cooperation. To resolve thisissue, we propose theShared Recurrent Memory Transformer (SRMT)which extendsmemory transformersto multi-agent settings by pooling and globallybroadcasting individual working memories, enabling agents to exchangeinformation implicitly and coordinate their actions. We evaluate SRMT on thePartially Observable Multi-Agent Pathfinding problem in a toy Bottlenecknavigation task that requires agents to pass through a narrow corridor and on aPOGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistentlyoutperforms a variety of reinforcement learning baselines, especially undersparse rewards, and generalizes effectively to longer corridors than those seenduring training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT iscompetitive with recent MARL, hybrid, and planning-based algorithms. Theseresults suggest that incorporating shared recurrent memory into thetransformer-based architectures can enhance coordination in decentralizedmulti-agent systems. The source code for training and evaluation is availableon GitHub: https://github.com/Aloriosa/srmt.


Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

  • 作者: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.11873

摘要

This paper revisits the implementation ofLoad-balancing Loss (LBL)when trainingMixture-of-Experts (MoEs)models. Specifically, LBL for MoEs is defined as N_Esum_{i=1}^{N_E} f_i p_i, where N_E is the total number ofexperts, f_irepresents thefrequencyof expert i being selected, and p_i denotes theaveragegating scoreof the expert i. Existing MoE training frameworksusually employ theparallel training strategyso that f_i and the LBL arecalculated within amicro-batchand then averaged across parallelgroups. In essence, amicro-batchfor training billion-scale LLMs normallycontains very fewsequences. So, themicro-batchLBL is almost at thesequencelevel, and therouteris pushed to distribute thetokenevenly within eachsequence. Under this strict constraint, eventokens from a domain-specificsequence(e.g., code) are uniformly routed to allexperts, therebyinhibiting expert specialization. In this work, we propose calculating LBLusing aglobal-batchto loose this constraint. Because aglobal-batchcontains much more diversesequences than amicro-batch, whichwill encourage load balance at thecorpus level. Specifically, we introduce anextracommunication stepto synchronize f_i acrossmicro-batches and then useit to calculate the LBL. Through experiments on training MoEs-based LLMs (up to42.8B total parameters and 400Btokens), we surprisinglyfind that theglobal-batchLBL strategy yields excellent performance gains inboth pre-training perplexity and downstream tasks. Our analysis reveals thattheglobal-batchLBL also greatly improves thedomain specializationof MoEexperts.


GameFactory: Creating New Games with Generative Interactive Videos

  • 作者: Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

  • 日期: 2025-01-14

  • 论文链接: https://arxiv.org/pdf/2501.08325

  • 项目链接: https://yujiwen.github.io/gamefactory/

摘要

Generative game engines have the potential to revolutionize game developmentby autonomously creating new content and reducing manual workload. However,existing video-based game generation methods fail to address the criticalchallenge ofscene generalization, limiting their applicability to existinggames with fixed styles and scenes. In this paper, we present GameFactory, aframework focused on exploringscene generalizationin game video generation.To enable the creation of entirely new and diverse games, we leveragepre-trainedvideo diffusion modelstrained onopen-domain video data. To bridgethe domain gap between open-domain priors and small-scale game dataset, wepropose amulti-phase trainingstrategy that decouplesgame style learningfromaction control, preserving open-domain generalization while achieving actioncontrollability. Using Minecraft as our data source, we releaseGF-Minecraft, ahigh-quality and diversity action-annotated video dataset for research.Furthermore, we extend our framework to enable autoregressiveaction-controllable game video generation, allowing the production ofunlimited-length interactive game videos. Experimental results demonstrate thatGameFactory effectively generates open-domain, diverse, and action-controllablegame videos, representing a significant step forward in AI-driven gamegeneration. Our dataset and project page are publicly available athttps://vvictoryuki.github.io/gamefactory/.


Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

  • 作者: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12895

摘要

Large language models (LLMs) demonstrate impressive performance but lack theflexibility to adapt to human preferences quickly without retraining. In thiswork, we introduceTest-time Preference Optimization (TPO), a framework thataligns LLM outputs with human preferences during inference, removing the needto update model parameters. Rather than relying on purely numerical rewards,TPO translates reward signals intotextual critiquesand uses them as textualrewards to iteratively refine its response. Evaluations on benchmarks coveringinstruction following,preference alignment,safety, andmathematicsrevealthat TPO progressively improves alignment with human preferences. Notably,after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model cansurpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPOscales efficiently with both thesearch widthand depth during inference.Through case studies, we illustrate how TPO exploits the innate capacity of LLMtointerpretandact upon reward signals. Our findings establish TPO as apractical, lightweight alternative for test-time preference optimization,achieving alignment on the fly. Our code is publicly available athttps://github.com/yafuly/TPO.


UI-TARS: Pioneering Automated GUI Interaction with Native Agents

  • 作者: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12326

摘要

This paper introduces UI-TARS, a native GUI agent model that solely perceivesthe screenshots as input and performshuman-like interactions(e.g., keyboardandmouse operations). Unlike prevailing agent frameworks that depend onheavily wrapped commercial models (e.g.,GPT-4o) withexpert-crafted promptsandworkflows, UI-TARS is anend-to-end modelthat outperforms thesesophisticated frameworks. Experiments demonstrate its superior performance:UI-TARS achievesSOTA performancein 10+GUI agent benchmarksevaluatingperception,grounding, and GUI task execution. Notably, in the OSWorldbenchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,UI-TARS achieves 46.6, surpassingGPT-4o(34.5). UI-TARS incorporates severalkey innovations: (1) Enhanced Perception: leveraging a large-scale dataset ofGUI screenshots for context-aware understanding of UI elements and precisecaptioning; (2)Unified Action Modeling, which standardizes actions into aunified spaceacross platforms and achieves precisegroundingand interactionthrough large-scale action traces; (3)System-2 Reasoning, which incorporatesdeliberate reasoninginto multi-step decision making, involving multiplereasoning patterns such astask decomposition,reflection thinking, milestonerecognition, etc. (4)Iterative Training with Reflective Online Traces, whichaddresses the data bottleneck by automatically collecting,filtering, andreflectively refining newinteraction traceson hundreds ofvirtual machines.Through iterative training and reflection tuning, UI-TARS continuously learnsfrom its mistakes and adapts to unforeseen situations with minimal humanintervention. We also analyze theevolution pathofGUI agentsto guide thefurther development of this domain.


Improving Video Generation with Human Feedback

  • 作者: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang

  • 日期: 2025-01-23

  • 论文链接: https://arxiv.org/pdf/2501.13918

摘要

Video generation has achieved significant advances throughrectified flowtechniques, but issues like unsmoothmotionandmisalignmentbetween videos andprompts persist. In this work, we develop a systematic pipeline that harnesseshuman feedback to mitigate these problems and refine the video generationmodel. Specifically, we begin by constructing a large-scale human preferencedataset focused on modern video generation models, incorporating pairwiseannotations across multi-dimensions. We then introduceVideoReward, amulti-dimensional video reward model, and examine how annotations and variousdesign choices impact its rewarding efficacy. From a unified reinforcementlearning perspective aimed at maximizing reward withKL regularization, weintroduce threealignment algorithmsfor flow-based models by extending thosefromdiffusion models. These include two training-time strategies: directpreference optimization for flow (Flow-DPO) and reward weighted regression forflow (Flow-RWR), and an inference-time technique,Flow-NRG, which appliesreward guidance directly tonoisy videos. Experimental results indicate thatVideoRewardsignificantly outperforms existing reward models, andFlow-DPOdemonstrates superior performance compared to bothFlow-RWRand standardsupervised fine-tuning methods. Additionally,Flow-NRGlets users assign customweights to multiple objectives during inference, meeting personalized videoquality needs. Project page: https://gongyeliu.github.io/videoalign.


PaSa: An LLM Agent for Comprehensive Academic Paper Search

  • 作者: Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

  • 日期: 2025-01-17

  • 论文链接: https://arxiv.org/pdf/2501.10120

摘要

We introduce PaSa, an advanced Paper Search agent powered by large languagemodels. PaSa can autonomously make a series of decisions, including invokingsearch tools, reading papers, and selecting relevant references, to ultimatelyobtain comprehensive and accurate results for complex scholarly queries. Weoptimize PaSa using reinforcement learning with a synthetic dataset,AutoScholarQuery, which includes 35k fine-grained academic queries andcorresponding papers sourced from top-tier AI conference publications.Additionally, we develop RealScholarQuery, a benchmark collecting real-worldacademic queries to assess PaSa performance in more realistic scenarios.Despite being trained on synthetic data, PaSa significantly outperformsexisting baselines on RealScholarQuery, including Google, Google Scholar,Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o),GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably,PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78%in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% inrecall and 4.25% in precision. Model, datasets, and code are available athttps://github.com/bytedance/pasa.


Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

  • 作者: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang

  • 日期: 2025-01-23

  • 论文链接: https://arxiv.org/pdf/2501.13629

摘要

We introduce Sigma, an efficient large language model specialized for thesystem domain, empowered by a novel architecture includingDiffQKV attention,andpre-trained on our meticulously collectedsystem domain data. DiffQKVattention significantly enhances theinference efficiencyof Sigma byoptimizing theQuery (Q),Key (K), andValue (V)components in the attentionmechanism differentially, based on their varying impacts on the modelperformance andefficiency indicators. Specifically, we (1) conduct extensiveexperiments that demonstrate the model’svarying sensitivityto the compressionof K and V components, leading to the development of differentially compressedKV, and (2) proposeaugmented Qto expand the Q head dimension, which enhancesthe model’srepresentation capacitywith minimal impacts on the inferencespeed. Rigorous theoretical and empirical analyses reveal that DiffQKVattention significantly enhances efficiency, achieving up to a 33.36%improvement ininference speedover the conventional grouped-query attention(GQA) inlong-context scenarios. Wepre-trainSigma on 6T tokens from varioussources, including 19.5Bsystem domain datathat we carefully collect and 1Ttokens of synthesized and rewritten data. In general domains, Sigma achievescomparable performance to otherstate-of-arts models. In the system domain, weintroduce the first comprehensivebenchmark AIMicius, where Sigma demonstratesremarkable performance across all tasks, significantly outperformingGPT-4withan absolute improvement up to 52.5%.


TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

  • 作者: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12224

摘要

We present TokenVerse – a method for multi-concept personalization,leveraging apre-trained text-to-image diffusion model. Our framework candisentangle complex visual elements and attributes from as little as a singleimage, while enabling seamless plug-and-play generation of combinations ofconcepts extracted from multiple images. As opposed to existing works,TokenVerse can handle multiple images with multiple concepts each, and supportsa wide-range of concepts, including objects, accessories, materials, pose, andlighting. Our work exploits aDiT-based text-to-image model, in which the inputtext affects the generation through bothattentionandmodulation(shift andscale). We observe that themodulation spaceis semantic and enables localizedcontrol over complex concepts. Building on this insight, we devise anoptimization-based frameworkthat takes as input an image and a textdescription, and finds for each word a distinct direction in themodulationspace. These directions can then be used to generate new images that combinethe learned concepts in a desired configuration. We demonstrate theeffectiveness of TokenVerse in challenging personalization settings, andshowcase its advantages over existing methods. project’s webpage inhttps://token-verse.github.io/


InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

  • 作者: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12368

摘要

Despite the promising performance ofLarge Vision Language Models (LVLMs)invisual understanding, they occasionally generate incorrect outputs. Whilereward models (RMs)withreinforcement learningortest-time scalingoffer thepotential for improving generation quality, a critical gap remains: publiclyavailable multi-modal RMs for LVLMs are scarce, and the implementation detailsof proprietary models are often unclear. We bridge this gap withInternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effectivemulti-modal reward modelthat aligns LVLMs withhuman preferences. To ensurethe robustness and versatility of IXC-2.5-Reward, we set up a high-qualitymulti-modal preference corpusspanning text, image, and video inputs acrossdiverse domains, such asinstruction following, general understanding,text-rich documents, mathematical reasoning, and video understanding.IXC-2.5-Reward achieves excellent results on the latest multi-modal rewardmodel benchmark and shows competitive performance on text-only reward modelbenchmarks. We further demonstrate three key applications of IXC-2.5-Reward:(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-RewardwithProximal Policy Optimization (PPO)yields IXC-2.5-Chat, which showsconsistent improvements ininstruction followingand multi-modal open-endeddialogue; (2) Selecting the best response fromcandidate responsesfortest-time scaling; and (3) Filteringoutlier or noisy samplesfrom existingimage and video instruction tuningtraining data. To ensure reproducibility andfacilitate further research, we have open-sourced all model weights andtraining recipes at https://github.com/InternLM/InternLM-XComposer


Autonomy-of-Experts Models

  • 作者: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.13074

摘要

Mixture-of-Experts (MoE) modelsmostly use a router to assign tokens tospecificexpert modules, activating onlypartial parametersand oftenoutperformingdense models. We argue that the separation between the router’sdecision-making and the experts’ execution is a critical yet overlooked issue,leading to suboptimalexpert selectionand ineffective learning. To addressthis, we proposeAutonomy-of-Experts (AoE), a novel MoE paradigm in whichexperts autonomously selectthemselves to process inputs. AoE is based on theinsight that an expert is aware of its own capacity to effectively process atoken, an awareness reflected in the scale of itsinternal activations. In AoE,routersare removed; instead, experts pre-computeinternal activationsforinputs and are ranked based on theiractivation norms. Only the top-rankingexperts proceed with theforward pass, while the others abort. The overhead ofpre-computing activations is reduced through alow-rank weight factorization.Thisself-evaluating-then-partner-comparing approachensures improved expertselection andeffective learning. Wepre-train language modelshaving 700M upto 4B parameters, demonstrating that AoE outperforms traditional MoE modelswith comparable efficiency.


Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

  • 作者: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12202

摘要

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system forgeneratinghigh-resolution textured 3Dassets. This system includes twofoundation components: alarge-scale shape generationmodel --Hunyuan3D-DiT,and alarge-scale texture synthesismodel --Hunyuan3D-Paint. The shapegenerative model, built on ascalable flow-based diffusion transformer, aims tocreate geometry that properly aligns with a given condition image, laying asolid foundation for downstream applications. The texture synthesis model,benefiting from strong geometric and diffusion priors, produces high-resolutionand vibrant texture maps for either generated or hand-crafted meshes.Furthermore, we buildHunyuan3D-Studio-- a versatile, user-friendly productionplatform that simplifies the re-creation process of 3D assets. It allows bothprofessional and amateur users to manipulate or even animate their meshesefficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0outperforms previous state-of-the-art models, including the open-source modelsand closed-source models ingeometry details,condition alignment, texturequality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gapsin the open-source 3D community for large-scale foundationgenerative models.The code and pre-trained weights of our models are available at:https://github.com/Tencent/Hunyuan3D-2


Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

  • 作者: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

  • 日期: 2025-01-23

  • 论文链接: https://arxiv.org/pdf/2501.13926

摘要

Chain-of-Thought (CoT) reasoning has been extensively explored in largemodels to tackle complex understanding tasks. However, it still remains an openquestion whether such strategies can be applied to verifying and reinforcingimage generation scenarios. In this paper, we provide the first comprehensiveinvestigation of the potential of CoT reasoning to enhance autoregressive imagegeneration. We focus on three techniques:scaling test-time computationforverification, aligning model preferences with Direct Preference Optimization(DPO), and integrating these techniques for complementary effects. Our resultsdemonstrate that these approaches can be effectively adapted and combined tosignificantly improve image generation performance. Furthermore, given thepivotal role ofreward modelsin our findings, we propose the PotentialAssessment Reward Model (PARM) andPARM++, specialized for autoregressive imagegeneration. PARM adaptively assesses each generation step through a potentialassessment approach, merging the strengths of existingreward models, andPARM++further introduces areflection mechanismto self-correct the generatedunsatisfactory image. Using our investigated reasoning strategies, we enhance abaseline model, Show-o, to achieve superior results, with a significant +24%improvement on theGenEval benchmark, surpassingStable Diffusion 3by +15%. Wehope our study provides unique insights and paves a new path for integratingCoT reasoning withautoregressive image generation. Code and models arereleased at https://github.com/ZiyuGuo99/Image-Generation-CoT


Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

  • 作者: Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego

  • 日期: 2025-01-16

  • 论文链接: https://arxiv.org/pdf/2501.09775

摘要

One of the most widely used methods to evaluate LLMs are Multiple ChoiceQuestion (MCQ) tests.MCQ benchmarksenable the testing of LLM knowledge onalmost any topic at scale as the results can be processed automatically. Tohelp the LLM answer, a few examples called few shots can be included in theprompt. Moreover, the LLM can be asked to answer the question directly with theselected option or to first provide the reasoning and then the selected answer,which is known aschain of thought. In addition to checking whether theselected answer is correct, the evaluation can look at the LLM-estimatedprobability of its response as an indication of the confidence of the LLM inthe response. In this paper, we study how theLLM confidencein its answerdepends on whether the model has been asked to answer directly or to providethe reasoning before answering. The results of the evaluation of questions on awide range of topics in seven different models show that LLMs are moreconfident in their answers when they provide reasoning before the answer. Thisoccurs regardless of whether the selected answer is correct. Our hypothesis isthat this behavior is due to the reasoning that modifies the probability of theselected answer, as the LLM predicts the answer based on the input question andthe reasoning that supports the selection made. Therefore, LLM estimatedprobabilities seem to haveintrinsic limitationsthat should be understood inorder to use them in evaluation procedures. Interestingly, the same behaviorhas been observed in humans, for whom explaining an answer increases confidencein its correctness.


Reasoning Language Models: A Blueprint

  • 作者: Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.11223

摘要

Reasoning language models (RLMs), also known as Large Reasoning Models(LRMs), such asOpenAI’s o1and o3,DeepSeek-V3, andAlibaba’s QwQ, haveredefined AI’s problem-solving capabilities by extending large language models(LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietarynature, and complex architectures - uniquely combining Reinforcement Learning(RL),search heuristics, and LLMs - present accessibility and scalabilitychallenges. To address these, we propose a comprehensive blueprint thatorganizes RLM components into a modular framework, based on a survey andanalysis of all RLM works. This blueprint incorporates diverse reasoningstructures (chains, trees, graphs, and nested forms), reasoning strategies(e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy,value modelsand others), and supervision schemes (Output-Based and Process-BasedSupervision). We also provide detailed mathematical formulations andalgorithmic specifications to simplify RLM implementation. By showing howschemes likeLLaMA-Berry, QwQ,Journey Learning, andGraph of Thoughtsfit asspecial cases, we demonstrate the blueprint’s versatility and unifyingpotential. To illustrate its utility, we introduce x1, a modular implementationfor rapid RLM prototyping and experimentation. Using x1 and a literaturereview, we provide key insights, such asmulti-phase trainingforpolicyandvalue models, and the importance offamiliar training distributions. Finally,we outline how RLMs can integrate with a broaderLLM ecosystem, including toolsand databases. Our work demystifies RLM construction, democratizes advancedreasoning capabilities, and fosters innovation, aiming to mitigate the gapbetween “rich AI” and “poor AI” by lowering barriers to RLM development andexperimentation.


Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

  • 作者: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.11733

摘要

Smartphones have become indispensable in modern life, yet navigating complextasks on mobile devices often remains frustrating. Recent advancements in largemultimodal model (LMM)-basedmobile agentshave demonstrated the ability toperceive and act in mobile environments. However, current approaches facesignificant limitations: they fall short in addressing real-world human needs,struggle with reasoning-intensive andlong-horizontasks, and lack mechanismsto learn and improve from prior experiences. To overcome these challenges, weintroduce Mobile-Agent-E, ahierarchical multi-agent frameworkcapable ofself-evolutionthrough past experience. By hierarchical, we mean an explicitseparation ofhigh-level planningandlow-level action execution. The frameworkcomprises aManager, responsible for devising overall plans by breaking downcomplex tasks into subgoals, and four subordinate agents–Perceptor,Operator,Action Reflector, andNotetaker–which handle fine-grained visual perception,immediate action execution, error verification, and information aggregation,respectively. Mobile-Agent-E also features a novelself-evolution modulewhichmaintains a persistentlong-term memorycomprisingTipsandShortcuts.Tipsaregeneral guidance and lessons learned from prior tasks on how to effectivelyinteract with the environment.Shortcutsare reusable, executable sequences ofatomic operations tailored for specific subroutines. The inclusion ofTipsandShortcutsfacilitates continuous refinement in performance and efficiency.Alongside this framework, we introduceMobile-Eval-E, a new benchmark featuringcomplex mobile tasks requiringlong-horizon,multi-app interactions. Empiricalresults show that Mobile-Agent-E achieves a 22% absolute improvement overprevious state-of-the-art approaches across threefoundation model backbones.Project page: https://x-plug.github.io/MobileAgent.


VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

  • 作者: Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

  • 日期: 2025-01-16

  • 论文链接: https://arxiv.org/pdf/2501.09781

摘要

This work explores whether adeep generative modelcan learn complexknowledge solely from visual input, in contrast to the prevalent focus ontext-based models like large language models (LLMs). We develop VideoWorld, anauto-regressive video generation modeltrained on unlabeled video data, andtest itsknowledge acquisitionabilities in video-based Go and robotic controltasks. Our experiments reveal two key findings: (1)video-only trainingprovides sufficient information for learning knowledge, including rules,reasoning and planning capabilities, and (2) the representation of visualchange is crucial forknowledge acquisition. To improve both the efficiency andefficacy of this process, we introduce theLatent Dynamics Model (LDM)as a keycomponent of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professionallevel in theVideo-GoBenchwith just a 300-million-parameter model, withoutrelying onsearch algorithmsorreward mechanismstypical in reinforcementlearning. In robotic tasks, VideoWorld effectively learns diverse controloperations and generalizes across environments, approaching the performance oforacle models inCALVINandRLBench. This study opens new avenues for knowledgeacquisition from visual data, with all code, data, and models open-sourced forfurther research.


O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

  • 作者: Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12570

摘要

Recently,long-thought reasoningLLMs, such as OpenAI’s O1, adopt extendedreasoning processes similar to how humans ponder over complex problems. Thisreasoning paradigm significantly enhances the model’s problem-solving abilitiesand has achieved promising results. However,long-thought reasoningprocessleads to a substantial increase in inference time. A pressing challenge isreducing theinference overheadof long-thought LLMs while ensuring accuracy.In this paper, we experimentally demonstrate thatlong-thought reasoningmodelsstruggle to effectively allocatetoken budgetsbased on problem difficulty andreasoning redundancies. To address this, we propose Length-HarmonizingFine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead whilemaintaining accuracy. This effective fine-tuning method first estimates theLLM’s baseline performance throughpre-samplingand then uses RL-stylefine-tuning to encourage the model to generate shorter reasoning processesunder accuracy constraints. This allows the model to achieve efficientreasoning with lower redundancy while maintaining accuracy. Experiments onvarious mathematical reasoning benchmarks show thatO1-Prunernot onlysignificantly reducesinference overheadbut also achieves higher accuracy,providing a novel and promising solution to this challenge. Our code is comingsoon at https://github.com/StarDewXXX/O1-Pruner


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/77700.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

FortiAI 重塑Fortinet Security Fabric全面智能化进阶

专注推动网络与安全融合的全球性综合网络安全解决方案供应商 Fortinet(NASDAQ:FTNT),近日宣布,旗下 Fortinet Security Fabric 安全平台成功嵌入了 FortiAI 关键创新功能。这一举措将有效增强用户对各类新兴威胁的防护…

汽车免拆诊断案例 | 2019款大众途观L车鼓风机偶尔不工作

故障现象 一辆2019款大众途观L车,搭载DKV发动机和0DE双离合变速器,累计行驶里程约为8万km。车主进厂反映,鼓风机偶尔不工作。 故障诊断  接车后试车,鼓风机各挡位均工作正常。用故障检测仪检测,空调控制单元&#x…

MySQL为什么默认使用RR隔离级别?

大家好,我是锋哥。今天分享关于【MySQL为什么默认使用RR隔离级别?】面试题。希望对大家有帮助; MySQL为什么默认使用RR隔离级别? 1000道 互联网大厂Java工程师 精选面试题-Java资源分享网 MySQL 默认使用 RR(Repeatable Read)…

目标检测篇---R-CNN梳理

目标检测系列文章 第一章 R-CNN 目录 目标检测系列文章📄 论文标题🧠 论文逻辑梳理1. 引言部分梳理 (动机与思想) 📝 三句话总结🔍 方法逻辑梳理🚀 关键创新点🔗 方法流程图补充边界框回归 (BBR)1. BBR 的…

Java技术栈 —— 基本规范

Java技术栈 —— 基本规范 一、接口文档生成工具二、接口设计2.1 开发顺序2.2 接口规范 三、数据类封装 一、接口文档生成工具 有很多jar包都支持swagger的接口文档,这样方便了接口测试,不需要用apifox自己写接口,直接调用文档里的swagger接…

Django ORM 定义模型

提示:定义模型字段的类型 文章目录 一、字段类型二、字段属性三、元信息 一、字段类型 常用字段 字段名描述备注AutoFieldint 自增必填参数 primary_keyTrue,无该字段时,django自动创建一个 BigAutoField,一个model不能有两个Au…

[密码学基础]GB与GM国密标准深度解析:定位、差异与协同发展

[密码学基础]GB与GM国密标准深度解析:定位、差异与协同发展 导语 在国产密码技术自主可控的浪潮下,GB(国家标准)与GM(密码行业标准)共同构建了我国商用密码的技术规范体系。二者在制定主体、法律效力、技术…

Day-1 漏洞攻击实战

实训任务1 漏洞攻击实战一 使用 御剑 得到网站后台地址 数据库登录与日志配置​​ 使用默认密码 root:root 登录phpMyAdmin,执行 SHOW VARIABLES LIKE general% 查看日志状态。 开启日志功能:set global general_log "ON";(配图&…

leetcode 2563. 统计公平数对的数目 中等

给你一个下标从 0 开始、长度为 n 的整数数组 nums &#xff0c;和两个整数 lower 和 upper &#xff0c;返回 公平数对的数目 。 如果 (i, j) 数对满足以下情况&#xff0c;则认为它是一个 公平数对 &#xff1a; 0 < i < j < n&#xff0c;且lower < nums[i] …

011数论——算法备赛

素数筛 给定n, 求2~n内的所有素数 埃氏筛 利用素数的定义&#xff0c; 输出素数2&#xff0c;然后筛掉2的倍数&#xff0c;得 {2,3,5,7,9,11,13&#xff0c;…}输出素数3&#xff0c;然后筛掉3的倍数&#xff0c;得 {2,3,5,7,11,13&#xff0c;…} 继续上述步骤&#xff0…

算法之贪心算法

贪心算法 贪心算法核心思想常见应用场景典型案例案例一&#xff1a;找零问题案例二&#xff1a;活动选择问题案例三&#xff1a;货仓选址问题 贪心算法的应用详解霍夫曼编码最小生成树Dijkstra最短路径算法 总结 贪心算法 核心思想 贪心算法&#xff08;Greedy Algorithm&…

英码科技与泊川软件,携手加速AI与嵌入式系统融合创新

2025年4月15日&#xff0c;广州英码信息科技有限公司&#xff08;以下简称“英码科技”&#xff09;与广州泊川软件技术有限公司&#xff08;以下简称“泊川软件”&#xff09; 正式签署战略合作框架协议。此次合作将充分发挥双方在AI计算硬件与嵌入式操作系统领域的技术优势&a…

Flowable7.x学习笔记(九)部署 BPMN XML 流程

前言 到本篇为止&#xff0c;我们已经完成了流程定义以及其 BPMN XML 本身的查询和新增功能&#xff0c;那我们有有了XML之后就可以开始着手研究实现 Flowable7对流程的各种操作了&#xff0c;比如部署&#xff0c;挂起&#xff0c;发起等等。 首先第一步&#xff0c;我们本篇文…

electron 渲染进程按钮创建新window,报BrowserWindow is not a constructor错误;

在 Electron 中&#xff0c;有主进程和渲染进程 主进程&#xff1a;在Node.js环境中运行—意味着能够使用require模块并使用所有Node.js API 渲染进程&#xff1a;每个electron应用都会为每个打开的BrowserWindow&#xff08;与每个网页嵌入&#xff09;生成一个单独的渲染器进…

深入规划 Elasticsearch 索引:策略与实践

一、Elasticsearch 索引概述 &#xff08;一&#xff09;索引基本概念 Elasticsearch 是一个分布式、高性能的全文搜索引擎&#xff0c;其核心概念之一便是索引。索引本质上是一个存储文档的逻辑容器&#xff0c;它使得数据能够在高效的检索机制下被查询到。当我们对文档进行…

llamafactory的包安装

cuda版本12.1&#xff0c;python版本3.10&#xff0c;torch版本2.4.0&#xff0c;几个关键包版本如下&#xff1a; torch2.4.0cu121 transformers4.48.3 triton3.0.0 flash-attn2.7.1.post4 xformers0.0.27.post2 vllm0.6.3.post1 vllm-flash-attn2.6.1 unsloth2025.3.18 unsl…

Redis专题

前言 Redis的各种思想跟机组Cache和操作系统对进程的管理非常类似&#xff01; 一&#xff1a;看到你的简历上写了你的项目里面用到了redis&#xff0c;为啥用redis&#xff1f; 因为传统的关系型数据库如Mysql,已经不能适用所有的场景&#xff0c;比如秒杀的库存扣减&#xff…

【Rust 精进之路之第7篇-函数之道】定义、调用与参数传递:构建代码的基本单元

系列: Rust 精进之路:构建可靠、高效软件的底层逻辑 作者: 码觉客 发布日期: 2025-04-20 引言:封装逻辑,代码复用的基石 在之前的文章中,我们已经探索了 Rust 如何处理数据(变量、标量类型、复合类型)以及如何控制程序的执行流程(if/else、循环)。这些构成了编写简…

文件有几十个T,需要做rag,用ragFlow能否快速落地呢?

一、RAGFlow的优势 1、RAGFlow处理大规模数据性能&#xff1a; &#xff08;1&#xff09;、RAGFlow支持分布式索引构建&#xff0c;采用分片技术&#xff0c;能够处理TB级数据。 &#xff08;2&#xff09;、它结合向量搜索和关键词搜索&#xff0c;提高检索效率。 &#xf…

安卓的桌面 launcher是什么

安卓的桌面Launcher是一种安卓应用程序&#xff0c;它主要负责管理和展示手机主屏幕的界面以及相关功能&#xff0c;为用户提供与设备交互的主要入口。以下是其详细介绍&#xff1a; 功能 主屏幕管理&#xff1a;用户可以在主屏幕上添加、删除和排列各种应用程序图标、小部件…