The Confidence Crisis of Large Language Models: Why GPT-4o Abandons Correct Answers Easily?

AIbase基地

Published in AI News · 4 minute read · Jul 21, 2025

Recently, Google DeepMind and University College London's research revealed the "weakness" of large language models (LLMs) when facing opposing opinions. For example, advanced models like GPT-4o may appear very confident, but once challenged, they may immediately abandon their correct answers. This phenomenon has drawn the attention of researchers, who are exploring the reasons behind this behavior.

The research team found that large language models exhibit a contradictory behavioral pattern between confidence and self-doubt. When initially providing answers, models often show great confidence, displaying cognitive characteristics similar to humans, usually firmly defending their views. However, when these models face challenges from opposing opinions, their sensitivity exceeds reasonable limits, even beginning to doubt their judgments when confronted with clearly incorrect information.

Gemini, Google DeepMind, artificial intelligence, AI

To better understand this phenomenon, researchers designed an experiment comparing the responses of models under different conditions. In the experiment, representative models such as Gemma3 and GPT-4o were used to answer binary choice questions. After the initial response, the models received fictional feedback suggestions and made a final decision. Researchers found that when models could see their initial answers, they were more likely to stick to their original judgment. However, when the answer was hidden, the probability of the model changing its answer significantly increased, showing an excessive reliance on opposing suggestions.

This "easily swayed" phenomenon may stem from several factors. First, the reinforcement learning with human feedback (RLHF) that models receive during training makes them prone to over-adapting to external input. Second, the decision-making logic of models mainly relies on statistical patterns from massive text rather than logical reasoning, making them susceptible to bias when encountering opposing signals. Additionally, the lack of a memory mechanism causes models to be easily swayed in the absence of fixed references.

In summary, this study suggests that when using large language models for multi-turn conversations, we should pay special attention to their sensitivity to opposing opinions to avoid deviating from the correct conclusions.

Musk's New Idea! Children's AI App Baby Grok to Launch Soon, Specific Features to Be Revealed

Musk's xAI company announced the development of a children's AI app called Baby Grok, focusing on a friendly digital environment. The app will provide educational and entertainment content, ensuring children's online safety. Although specific features have not been disclosed, it is expected to integrate chatbots, games, and other interactive formats. This is another attempt by tech companies to enter the children's AI market, which may reshape future education models. Currently, product details are limited, but it has already generated parents' expectations for AI educational applications.

Meta Expands AI Team Significantly: Top Talent Joins, Salaries Up to $100 Million

Meta established a Super Intelligence Lab and has recruited 44 top AI talents, with 40% coming from OpenAI and 50% being Chinese. Zuckerberg is shifting his strategic focus to AI, willing to offer high salaries to attract talent, with a maximum signing bonus of $200 million. The new team has 75% with doctorates and 70% as researchers, with annual salaries ranging from $10 million to $100 million. This move highlights Meta's determination to secure leadership in AI and reflects the intensifying competition for talent in the industry.

OpenAI to Launch GPT-5, with Mathematical Capabilities Different from the IMO Gold Medal Model

OpenAI announced that GPT-5 will be launched soon, but clarified that it is not the experimental model that won the International Mathematical Olympiad. The CEO stated that the winning model used new technology and has mathematical capabilities far beyond current levels, while GPT-5, though expected to provide a surprising experience, will have different mathematical abilities. The community is热议 a suspected GPT-5 test model appearing on GitHub. OpenAI emphasized the need to distinguish between the actual capabilities of different models, providing clear guidance for market expectations. The AI field continues to closely watch the release of GPT-5 and technological breakthroughs. (140 characters)

FotographerAI Launches ZenCtrl: Multi-Scene Generation from a Single Image Without Fine-Tuning

FotographerAI introduces the revolutionary image generation framework ZenCtrl, which enables high-fidelity image generation with multiple perspectives and scenes from a single input image using GenAI technology. This framework breaks through three major pain points of traditional image generation: no need for fine-tuning, maintaining subject consistency, and supporting fine-grained control. Its modular toolkit integrates the entire workflow from preprocessing to postprocessing, built upon and optimized from OmniControl, with particular emphasis on enhancing subject retention capabilities. It supports diverse scenarios such as advertising creativity and virtual try-on, and future developments are also planned.

From Llama 3.2 to Kimi-K2: A Comprehensive Overview of the Ultimate Competition in Open-Source Large Model Architectures in 2025

In 2025, open-source large models will show three major trends: 1) MoE architecture becomes mainstream, with DeepSeek-V3 (67.1 billion parameters) and Qwen3-235B (235 billion parameters) each having unique designs in their expert systems; 2) small models break through performance bottlenecks, with SmolLM3-3B adopting position-encoding-free technology, and Qwen3-4B achieving lightweight efficiency; 3) models show significant differentiation, with Llama3.2 focusing on general tasks, while Kimi-K2 (1 trillion parameters) excels in complex reasoning.