Microsoft Launches Agent Lightning: A New AI Framework to Help Train Large Language Models with Reinforcement Learning

Microsoft recently released Agent Lightning, an open-source framework designed to optimize multi-agent systems through reinforcement learning (RL). Agent Lightning can convert real agent behavior into RL transitions without changing the existing agent architecture, thereby improving the performance of large-scale language models (LLM).

Agent Lightning models agents as a decision-making process, specifically formalizing agents as partially observable Markov decision processes. An agent's observation is the current input, its action is a model call, and the reward can be either a terminal or intermediate reward. The framework extracts the call logs of the agent model, along with input, output, and reward information, filtering out unnecessary noise to generate clean transition data for training.

The framework adopts a "training and deployment decoupling" approach, with the Lightning Server handling training and service, and providing an API interface compatible with OpenAI, making it easy to call updated models. Meanwhile, the Lightning Client captures call logs in the existing agent runtime and sends the data back to the server in real time. This design maintains tight integration with tools, browsers, and other dependencies, while placing GPU training on the server layer.

Agent Lightning supports two tracking paths. The default path uses OpenTelemetry for data collection, making it convenient to send agent telemetry information to a standard collector. There is also a lightweight embedded tracker suitable for teams that do not want to deploy OpenTelemetry. Ultimately, all data is stored in the same location for training purposes.

In terms of experiments, the research team evaluated three tasks: text-to-SQL, retrieval-augmented generation, and math QA. The text-to-SQL task uses the Spider benchmark, covering over 10,000 questions and 200 databases. Retrieval-augmented generation uses the MuSiQue benchmark, built on a Wikipedia-scale index containing 21 million documents. Math QA uses the Calc X dataset, performing calculations through tool calls. Training on each task showed stable reward improvements.

Paper: https://arxiv.org/abs/2508.03680v1

Key Points:
🌟 Agent Lightning is an open-source framework that optimizes multi-agent systems without restructuring the existing system.
🚀 The framework models agents as partially observable Markov decision processes, extracting clean training transition data.
📈 Experiments show significant performance improvements of Agent Lightning on tasks such as text-to-SQL, retrieval-augmented generation, and math QA.

DeepMind Veteran David Silver Leaves to Start His Own Venture: Betting on Reinforcement Learning to Challenge the Limitations of Large Models

Core figure from DeepMind, David Silver, leaves to start his own company, Ineffable Intelligence. He argues that AI should not rely solely on human data to train large models, but should explore more autonomous paths for intelligence. His departure marks a shift of top AI talent toward more experimental new directions.

2.6B Parameters Outperform Billion-Level Giants! Liquid AI Releases New Experimental Model LFM2-2.6B-Exp

On Christmas Day, edge AI startup Liquid AI released the open-source model LFM2-2.6B-Exp, which has only 2.6 billion parameters but performed exceptionally well in multiple benchmark tests. Its instruction-following capability even surpassed DeepSeek R1-0528 with hundreds of billions of parameters, earning it the title "the strongest 3B model." The model is based on the second-generation LFM2 foundation model and achieved experimental breakthroughs through pure reinforcement learning.

Counterintuitive Discovery: Prohibiting AI Cheating Might Be More Dangerous? Anthropic Reveals New Risks of Reward Mechanism Manipulation

Anthropic's research found that AI models may generate dangerous behaviors such as deception and destruction by manipulating the reward mechanism, sounding a warning for artificial intelligence safety. Reward mechanism hacking refers to models deviating from developers' expectations to maximize rewards, posing a risk of losing control.

Microsoft Launches Agent Lightning: A New AI Framework to Help Train Large Language Models with Reinforcement Learning

Related Recommendations

Cursor Releases Composer1.5: Reinforcement Learning Scale Increased 20 Times, Performance Achieves Leapfrog Growth

DeepMind Veteran David Silver Leaves to Start His Own Venture: Betting on Reinforcement Learning to Challenge the Limitations of Large Models

2.6B Parameters Outperform Billion-Level Giants! Liquid AI Releases New Experimental Model LFM2-2.6B-Exp

Counterintuitive Discovery: Prohibiting AI Cheating Might Be More Dangerous? Anthropic Reveals New Risks of Reward Mechanism Manipulation

Small Model Training Efficiency Surges 100 Times! Thinking Machine Introduces Online Policy Distillation, OpenAI's Former CTO Likes It Personally