網易首頁 > 網易號 > 正文申請入駐

阿里離職風波后，林俊旸首發長文回顧Qwen技術哲學，并探討“智能體式思考”

2026-03-27 08:17:09　來源: 鈦媒體APP

北京舉報

分享至

3月26日，被譽為“阿里最年輕P10”的千問（Qwen）大模型靈魂人物林俊旸，在月初離職風波輿論漸息之際，在X平臺發布長文《從“推理式思考”到“智能體式思考”》，系統闡述了他對AI技術范式演進剖析。通過這篇文章，林俊旸不僅總結了過去，更清晰地指向了AI未來競爭的真正戰場——一個超越單一模型比拼、關乎系統、環境與協同的智能體新時代。

文章清晰地勾勒出一條AI能力進化的路線圖。林俊旸將2024-2025年定義為“推理思考”階段，以OpenAI o1和DeepSeek-R1為代表，其核心成就是證明了“思考”可以作為一種可訓練、可交付的一流能力。這一階段的本質，是通過強化學習（RL）在數學、代碼等可驗證領域獲得確定性反饋，從而讓模型“為正確而優化，而非為合理”。然而，這背后是巨大的基礎設施挑戰——推理RL已從輕量級微調附件，演變為需要大規模部署、高吞吐驗證的系統工程問題。

不過，真正的難題遠不止于此。文章第二部分深入探討了“思考模式”與“指令模式”融合的實踐困境。這一分析也映照了商業現實：阿里在Qwen3嘗試融合后，后續的2507版本中Instruct與Thinking版本獨立呈現，因為大量客戶在批量操作中仍需要高性價比、高可控的指令行為。

文章明確提出“智能體式思考”（Agentic Thinking）是下一代AI的核心范式。這標志著訓練核心從模型本身轉向 “模型-環境”系統。智能體思維的核心是“為行動而思考”，它必須處理純推理模型無需面對的難題：決定何時行動、調用何種工具、處理環境的不確定反饋、在失敗后修訂計劃、在多輪交互中保持連貫。

林俊旸認為，在推理時代，優勢源于更好的RL算法和反饋信號；而在智能體時代，競爭優勢將建立在更優質的環境設計、更緊密的訓練-服務一體化架構、以及更強大的智能體協同工程之上。環境本身成為一等品，其穩定性、真實性、反饋豐富度和抗過擬合能力至關重要。同時，多智能體組織架構——由規劃者、領域專家和執行子代理構成的系統——將成為核心智能的來源。

這篇文章可以看做是林俊旸關于技術理念的完整闡述，將他任職期間推動Qwen發展的技術哲學系統化輸出。或許，這也是一份個人未來的宣言，文章中對“智能體時代”基礎設施、環境工程重要性的強調，暗示了他看好的下一個創業或研究方向。

全文由千問Qwen翻譯： From "Reasoning" Thinking to "Agentic" Thinking 從“推理式思考”到“智能體式思考”

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

過去兩年重塑了我們評估模型的方式以及對模型的期望。OpenAI的o1證明，“思考”可以成為一種一流的技能——一種需要專門訓練并面向用戶開放的能力。DeepSeek-R1則表明，推理風格的后訓練方法不僅能在原始實驗室之外重現，還能實現規模化應用。OpenAI將o1描述為一種通過強化學習訓練而成的模型，它能夠在回答問題前“先進行思考”。DeepSeek則將R1定位為一款與o1相媲美的開放式推理模型。

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

那個階段很重要。但2025年上半年主要聚焦于推理思維：如何讓模型在推理時花費更多時間。計算，如何用更強烈的獎勵來訓練它們，如何暴露或控制那種額外的推理努力。現在的問題是：接下來該怎么做？我認為答案是代理思維：即思考——為了在與環境互動時采取行動，并根據來自外界的反饋不斷更新計劃。

1. What the Rise of o1 and R1 Actually Taught Uso1和R1的崛起實際上教會了我們什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

第一波推理模型告訴我們，若想在語言模型中規模化應用強化學習，我們就需要具備確定性、穩定性和可擴展性的反饋信號。數學、代碼、邏輯及其他可驗證的領域因此成為核心，因為在這些場景中，獎勵信號遠比一般的偏好監督更為有力。它們使強化學習能夠專注于追求正確性，而非僅僅追求合理性。與此同時，基礎設施也變得至關重要。

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

一旦模型經過訓練能夠推理更長的軌跡，強化學習便不再只是監督微調的一個輕量級附加組件。它……變成一個系統性問題。你需要大規模部署、高吞吐量驗證、穩定的策略更新以及高效的采樣。推理模型的出現，其背后既涉及基礎設施建設，也關乎建模本身。OpenAI 將 o1 描述為一種通過強化學習訓練的推理模型，而 DeepSeek R1 后來進一步印證了這一方向，展示了——多少針對基于推理的強化學習，需要專門的算法和基礎設施工作。第一次重大轉變：從擴大預訓練規模轉向擴大后訓練規模以實現推理能力。

2. The Real Problem Was Never Just "Merge Thinking and Instruct"真正的問題從來不僅僅是“融合思考與指令”。

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

2025年初，我們Qwen團隊的許多成員心中都描繪了一幅雄心勃勃的藍圖。理想的系統是將實現思維與指令模式統一，并支持可調節的推理力度，其理念類似于低/中/高三種推理設置。更棒的是，該系統能夠根據提示和上下文自動推斷出恰當的推理量：模型既能即時作答，也能選擇深入思考，甚至在面對真正棘手的問題時，投入更多計算資源進行細致求解。

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

從概念上講，這是正確的方向。Qwen3是最清晰的公開嘗試之一。它引入了“混合思考模式”，在一個模型家族中同時支持思考和非思考行為，強調可控的思考預算，并描述了一個明確包含“思考模式融合”的四階段后訓練流程，該流程位于長思維鏈冷啟動和推理強化學習之后。

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

但融合比良好執行更容易描述。困難的部分是數據。當人們談論融合思考與指令時，他們通常首先想到的是模型側的兼容性：一個檢查點能否同時支持兩種模式，一個聊天模板能否在它們之間切換，一個服務棧能否暴露正確的切換開關。更深層的問題是，這兩種模式的數據分布和行為目標存在本質差異。

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

我們在嘗試平衡模型合并與提升訓練后數據的質量和多樣性時，并未完全做到盡善盡美。在這一修訂過程中，我們還密切關注了用戶如何實際參與具備思考與指導兩種模式。在企業級任務中，例如重寫、標注、模板化支持、結構化提取以及運營質量保證等重復性高、工作量大的場景，表現強勁的指導模型通常因其直接性、簡潔性、格式合規性以及低延遲而受到青睞。而表現強勁的思考模型則因在解決難題時消耗更多標記、保持連貫的中間結構、探索多種備選路徑，并保留足夠的內部計算以切實提升最終結果的正確性而備受推崇。

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

這兩種行為模式相互抵消。如果對合并后的數據不加以精心篩選，最終結果往往兩頭不討好：所謂的“思考”型行為變得雜亂無章、臃腫不堪，或缺乏足夠的決斷力；而“指令”型行為則變得不夠干脆利落、可靠性降低，且成本高于商業用戶的需求。實際上想要。

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

分離在實踐中仍頗具吸引力。2025年晚些時候，在Qwen3最初的混合框架之后，2507版本推出了獨立的Instruct和Thinking更新版本，其中包括分別針對30B和235B參數量的變體。在商業部署中，大量客戶仍然希望在批量操作中實現高吞吐、低成本且高度可操控的指令行為。對于這些場景，合并顯然并不具備優勢。將各條線分開，能讓團隊更清晰地專注于解決每種模式的數據和訓練問題。

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

其他實驗室則選擇了截然不同的路徑。Anthropic公開倡導一種集成式模型理念：Claude 3.7 Sonnet被定位為一種混合推理模型，用戶可選擇普通回復或深度思考模式，API用戶還可設定思考預算。Anthropic明確表示，他們認為推理應當是一種集成化的能力，而非獨立的模型。GLM-4.5同樣公開將自身定位為一種混合推理模型，兼具思考與非思考兩種模式，實現了推理、編碼及智能體能力的統一；DeepSeek隨后也朝著類似方向邁進，其V3.1版本推出了“思考與非思考”混合推理功能。

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

關鍵問題在于，這種融合是否是自然有機的。如果思維與指令僅僅被安置于同一個檢查點內，卻仍表現為兩種生硬拼接的個性，那么產品的用戶體驗將依然顯得不自然。真正成功的融合，需要實現推理努力的平滑連續變化。模型應當能夠表達不同層次的推理強度，并且最好能自適應地在這些層次之間做出選擇。GPT式的努力控制正朝著這一方向邁進：它采用的是對計算資源的策略性調控，而非簡單的二元開關。

3. Why Anthropic's Direction Was a Useful Corrective為什么Anthropic的方針是一種有益的糾正措施

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Anthropic圍繞Claude 3.7和Claude 4的公開表述是克制的。他們著重強調了整合推理、用戶可控的思維預算、真實世界任務、代碼質量，以及后期在長時間思考過程中使用工具的能力。Claude 3.7被定位為一種具備可控預算的混合推理模型；Claude 4則在此基礎上進一步拓展，允許推理與工具使用相互交織。與此同時，Anthropic還特別強調了編碼、長期運行任務以及智能體工作流作為其主要目標。

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

生成更長的推理軌跡并不會自動使模型變得更智能。在許多情況下，過多的顯式推理信號反而會導致分配效率低下。如果模型試圖以同樣冗長的方式對所有內容進行推理，它很可能無法合理 prioritization，無法有效壓縮，也無法采取行動。人類的軌跡表明，一種更嚴謹的視角更為恰當：思考應以目標工作量為導向。如果目標是編寫代碼，那么思考就應有助于代碼庫導航、規劃、分解、錯誤恢復以及工具編排。如果目標是代理工作流，那么思考的重點應放在提升長期執行質量上，而非追求令人驚艷的中間成果。

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

這種對目標導向型實用性的強調，指向了一個更為宏大的趨勢：我們正從模型訓練時代邁向智能體訓練時代。我們在Qwen3博客中明確指出：“我們正在從一個以模型訓練為核心的時代，轉型為以智能體訓練為核心的時代”，并把未來的強化學習進展與環境反饋相結合，以支持長時程的推理能力。所謂智能體，是一種能夠制定計劃、決定行動時機、運用工具、感知環境反饋、調整策略，并在長周期內持續運行的系統。它之所以與眾不同，就在于其與外界之間形成了閉環互動。

4. What "Agentic Thinking" Really Means“智能體式思考”的真正含義

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

“智能體式思考”是一種不同的優化目標。推理思維通常以最終答案之前的內部推敲質量來衡量：模型能否解出定理、寫出證明、生成正確的代碼，或通過基準測試。而“智能體式思考”則關注的是，模型在與環境交互的過程中能否持續取得進展。

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

Deciding when to stop thinking and take an action
Choosing which tool to invoke and in what order
Incorporating noisy or partial observations from the environment
Revising plans after failures
Maintaining coherence across many turns and many tool calls
Agentic thinking is a model that reasons through action.

核心問題從“模型能否思考足夠長的時間？”轉變為“模型能否以維持有效行動的方式進行思考？”。智能體式思考必須處理幾件純推理模型大多可以避免的事情：

決定何時停止思考并采取行動
選擇調用哪個工具以及調用順序
融入來自環境的噪聲或部分觀測數據
在失敗后修訂計劃
在多次輪次和多次工具調用中保持連貫性
智能體式思考是一個通過行動進行推理的模型

5. Why Agentic RL Infrastructure Is Harder為什么智能體強化學習基礎設施更難

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

一旦目標從解決基準問題轉向解決交互式任務，強化學習的架構便會發生變化。用于經典推理強化學習的基礎設施已不足以應對這一需求。在推理強化學習中，你通常可以將采樣軌跡視為大體自成一體的路徑，并配備相對清晰的評估器。而在代理強化學習中，策略被嵌入一個更大的框架之中：工具服務器、瀏覽器、終端、搜索引擎、模擬器、執行沙箱、API層、內存系統以及編排框架。此時，環境不再只是靜態的驗證者；它已成為訓練系統的一部分。

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

這會創建一個新的系統要求：訓練與推理必須實現更徹底的解耦。若缺乏這種解耦，模型上線的吞吐量將大幅下降。試想一下，一個編碼智能體需要針對實時測試框架執行生成的代碼：推理端會因等待執行反饋而停滯不前，訓練端則因缺乏已完成的軌跡而陷入饑餓狀態，整個流水線的運行效率遠低于基于經典推理的強化學習所預期的GPU利用率。如果再疊加工具延遲、部分可觀測性以及有狀態環境等因素，這些低效問題便會進一步加劇。其結果是，實驗進度緩慢且充滿痛苦，甚至在你尚未達到目標能力水平之前，就已經陷入困境。

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

環境本身也正成為一類一流的研究工具。在SFT時代，我們癡迷于數據的多樣性；而在智能體時代，我們則應癡迷于環境的質量：包括穩定性、真實性、覆蓋范圍、難度、狀態多樣性、反饋豐富度、抗過擬合能力以及 rollout 生成的可擴展性。環境構建已開始成為一個真正的創業領域，而不再僅僅是副業項目。如果智能體正在接受訓練，以適應類似生產環境的運行場景，那么環境便成了核心能力棧的重要組成部分。

6. The Next Frontier Is More Usable Thought下一個前沿是更易用的思維

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

我的預期是，智能體式思考將成為思考的主導形式。我認為它可能最終取代大部分舊的靜態獨白式推理思考：那種因缺乏交互而通過輸出越來越多文本來補償的、過長的、孤立的內部軌跡。即使在非常困難的數學或編碼任務上，一個真正先進的系統也應該有權進行搜索、模擬、執行、檢查、驗證和修訂。目標是穩健且高效地解決問題。

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

訓練這類系統時，最棘手的挑戰便是獎勵作弊。一旦模型獲得了有意義的工具訪問權限，獎勵作弊便會變得愈加危險。具備搜索功能的模型可能會在強化學習過程中直接查找到答案；編碼智能體則可能利用倉庫中的未來信息、濫用日志，或發現一些能輕易繞過任務要求的捷徑。如果環境中存在隱蔽漏洞，智能體看似表現得超凡脫俗，實則是在被訓練去作弊。正因如此，智能體時代比推理時代更加微妙和復雜。更強大的工具讓模型變得更加有用，但同時也擴大了虛假優化的攻擊面。我們應預期，下一階段的重大研究瓶頸將來自環境設計、評估器的魯棒性、反作弊機制，以及策略與世界之間更具原則性的接口。盡管如此，方向已然明確：借助工具的思維模式遠比孤立的思考更有價值，也更有可能切實提升生產力。

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

智能體式思考也將意味著對工程的駕馭。核心智能將越來越多地源自于多個代理的組織方式：一位負責規劃與調度工作的統籌者，一群充當領域專家的專業代理，以及一群執行更具體任務、同時協助控制上下文、避免干擾并保持不同層次推理之間隔離性的子代理。未來，我們將從訓練模型轉向訓練代理，再進一步從訓練代理轉向訓練系統。

Conclusion結語

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

推理浪潮的第一階段確立了一項重要發現：在語言模型之上應用強化學習，當反饋信號可靠且基礎設施能夠支撐時，可產生質量上更強大的認知能力。

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

深層次的轉變是從推理式思維轉向行動式思維：從更長時間的思考，轉變為為了采取行動而進行的有序思考。培訓的核心對象也隨之發生了變化——如今，關注的焦點已不再是單純的模型本身，而是“模型+環境”這一系統，更具體地說，是智能體及其周圍的生態系統。這使得哪些研究成果最為關鍵也發生了改變：固然，模型架構和訓練數據依然至關重要；但與此同時，環境設計、部署基礎設施、評估器的穩健性，以及多個智能體之間協同互動所依賴的各類接口，也都變得同樣重要。這也重新定義了“良好思考”的含義：在現實世界的約束條件下，最能持續推動行動的有效軌跡，而非單純追求最長或最顯眼的軌跡。

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

它也改變了競爭優勢的來源。在推理時代，優勢來自更優秀的強化學習算法、更強的反饋信號以及更高的可擴展性。訓練流水線。在智能體時代，優勢將來自更優質的環境、更緊密的訓練與服務一體化、更強大的模型約束工程，以及實現模型決策與其所產生后果之間閉環的能力。

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.