Amazon AWS Launches Human Benchmark Testing Team to Improve AI Model Evaluation


On December 19, 2024, at a press conference, ZhiYuan Research Institute and Tencent announced the launch of LongBench v2, a benchmark designed to evaluate the deep understanding and reasoning capabilities of large language models (LLMs) in real-world long text multi-task scenarios. The platform aims to advance long text models in understanding and reasoning, addressing the current challenges faced by large language models in practical applications.
Long context understanding is a key challenge in the field of natural language processing, especially when large language models (LLMs) handle text that exceeds their context window size. To address this issue, researchers developed the LooGLE benchmark test, aimed at assessing LLMs' capability to understand extremely long documents (averaging 19.3k words, comprising 776 articles across multiple domains). LooGLE includes 7 tasks that cover both short and long dependencies, evaluating model understanding across texts of varying lengths. The test data is sourced from open-access content after 2022.
The first systematic benchmark test for AI agents has been released, showcasing comprehensive evaluation results for 25 different language models: GPT-4 stands out uniquely. Top commercial language models perform excellently in complex environments, showing significant advantages over open-source models. The research team suggests further improving the learning capabilities of open-source models.
OpenAI's Sam Altman announced that ChatGPT's custom instructions now allow disabling long dashes, enabling users to adjust AI responses via settings. He called it a 'small but delightful victory'.....
Tencent's Q3 2025 revenue reached 192.8B yuan, up 15% YoY. ToB business grew 10% to 58.2B yuan, driven by AI-driven cloud services and WeChat Mini Store tech. Hunyuan AI model leads in rankings, highlighting strategic success.....