- Latestly AI
- Posts
- The 26 AI Predictions That Will Define 2026 - Pt.1
The 26 AI Predictions That Will Define 2026 - Pt.1
AI Is Scaling Faster Than Reality Can Handle. Here’s What Breaks Next.
Top 3 Things in Today’s Latestly AI Edition
1. China isn’t catching up anymore - it’s competing head-to-head, faster and cheaper, and may soon stop playing open.
2. Multimodal AI is converging, but only where users pay today - not where research roadmaps promised.
3. Agents are improving, but autonomy is stalling, forcing the industry to manage AI instead of unleashing it.
The New Standard in Digital Out of Home Advertising
Turn every head toward your brand with LookAdThat's AI powered digital backpacks, carried by trained Brand Ambassadors who engage and convert your audience.
The impact is immediate because the ads are shown exactly where you want them, allowing you to reach your target audience with precision.
LookAdThat also provides fully anonymised insights on campaign performance, through a dedicated dashboard that shows : how many people saw your ads, their dwell time, the perceived gender and age range of the audience.
Get a quote to get your business live on digital backpacks 👇👇👇
or Download the media kit here
AI STORY OF THE WEEK
The 26 AI Predictions That Will Define 2026: An Evidence-Based Analysis (Part-I)
Peter Gostev, LM Arena's AI capability analyst, just published 26 predictions for 2026 each framed as plausible scenarios sitting in the 5-60% confidence range. These aren't certainties. They're bounded bets on where acceleration, constraints, and economic pressure may surface next. Here's what the evidence says about each claim.
China's Open-Source Ascendancy: The Quiet Domination
Prediction 1: A Chinese open model leads Web Dev Arena for 1+ months
Current State: As of December 2025, Chinese open models are already competitive at the top of coding leaderboards. DeepSeek V3.2 ranks 16th on the WebDev Arena leaderboard with an Elo score of 1,361, while MiniMax M2.1 (6th place, 1,445 Elo) and GLM-4.7 (7th place, 1,441 Elo) are both Chinese open models in the top 10. Qwen3-Coder-480B ranks 25th at 1,285 Elo.
Evidence Supporting the Prediction: Chinese labs released an unprecedented volume of open-source models in 2025. DeepSeek, Alibaba's Qwen, Baidu's Ernie, and Zhipu's GLM series all went open-weight, with DeepSeek V3 trained for just $5.5 million, less than one-tenth of Western lab costs. The ATOM Project data shows total AI model downloads switched from US-dominant to China-dominant during summer 2025. Stanford's 2025 AI Index confirms that while the US produced 40 notable AI models in 2024, Chinese models closed the performance gap to near-parity on major benchmarks.
The Plausibility Case: Chinese labs now release updates at higher velocity than Western counterparts. DeepSeek shipped V3, V3.2, V3.2-Exp, and R1 between November 2024 and September 2025. If this cadence continues and if one of these iterations specifically targets web development tasks - a 1+ month lead on WebDev Arena becomes statistically probable. The prediction sits at roughly 25-35% confidence: plausible but not guaranteed.
Prediction 2: Chinese labs open-source less than 50% of their top models
Current Paradox: This prediction contradicts 2025's trend. DeepSeek, Alibaba, MiniMax, and Zhipu AI all released their flagship models as open-weight in 2025, with engineers explicitly attributing this strategy to DeepSeek's example. Nearly every notable Chinese model released in 2025 was open-source.
Counter-Evidence: Stanford analysis confirms China's open-weight strategy contrasts sharply with US labs (OpenAI, Anthropic, Google), which keep leading models closed. Chinese tech media reports that open-source became the "defining paradigm" for Chinese AI in 2025.
Why It Could Still Happen: If Chinese labs achieve clear frontier leadership, commercial incentives may shift. Alibaba already keeps some Qwen variants proprietary while open-sourcing others. As valuations rise and revenue models mature, the calculus favoring open-source may weaken.
Prediction 3: Chinese labs take #1 spots in both image and video generation for at least 3 months
Current Standing: As of November 2025, Chinese video generation models are competitive but not dominant. In image generation, no single Chinese model holds the #1 spot on LM Arena's vision leaderboard as of January 2026. However, Chinese labs like Tencent (Hunyuan Video), Alibaba (Wan AI 2.2), and Kuaishou released strong video models in 2025.
Evidence of Momentum: A November 2025 competitive analysis noted that Chinese open-source video models like WAN 2.2 and Hunyuan Video were "aggressively competing on speed, accessibility, and cost-efficiency," with quality approaching commercial-grade. Bloomberg reported in November 2025 that while OpenAI, Google, and xAI led advanced benchmarks, Chinese models were closing gaps rapidly. China's video generation ecosystem released more models in 2025 than US labs, with Wan 2.5, Hunyuan Video, and Seedance all shipping between September and December.
Plausibility: Video and image generation favor rapid iteration and massive compute areas where China excels. Google's Veo 3 release in late 2025 marked the first time in years the US led on a capability launch, suggesting the gap is narrow. If Chinese labs prioritize these modalities with dedicated training runs in Q1 2026, a 3-month lead is feasible.
Media & Multimodality: The Convergence Accelerates
Prediction 4: No diffusion-only image models in top 5 by mid-2026
Technical Context: Diffusion models dominated image generation in 2023-2024, but hybrid architectures emerged in 2025. Understanding AI's Timothy Lee noted in December 2025 that "diffusion models have several key advantages... they're much faster because they generate many tokens at once". However, autoregressive models integrated visual reasoning more naturally into language model architectures.
Current Leaderboard: As of January 2026, LM Arena's vision leaderboard top 5 includes models from Anthropic, OpenAI, and Google. None are pure diffusion architectures, all integrate transformer-based reasoning with visual generation.
Why This Matters: The shift reflects architectural convergence. Pure diffusion models excel at pixel-level quality but struggle with compositional reasoning and long-context visual understanding. Hybrid models that combine diffusion generation with transformer reasoning dominate because they can handle complex prompts requiring logical scene composition.
Prediction 5: Text, video, audio, music, and speech merge into a single model
Precedent: OpenAI's GPT-4o (released May 2024) integrated text, vision, and audio. Gemini 3 (November 2025) features native multimodal reasoning with a 1 million token context window supporting text, images, and video. Anthropic's Claude Opus 4.5 (November 2025) introduced Chrome and Excel integrations, expanding beyond pure text.
The Technical Barrier: Full modality convergence including music generation and synchronized speech requires massive training datasets and architectural innovations. As of December 2025, no single model handles all five modalities at production quality. Music generation (e.g., Suno, Udio) and speech synthesis (ElevenLabs) remain specialized.
Path to Reality: Meta's Llama team and Google DeepMind both published research in 2025 on joint embedding spaces for audio-visual-language tasks. IBM's Kaoutar El Maghraoui predicted in December 2025 that "models will be able to perceive and act in a world much more like a human... bridging language, vision and action, all together". If a major lab dedicates a 1GW training run specifically to multimodal convergence in H1 2026, this becomes possible.
Prediction 6: Proliferation of "edgy" applications—companions, erotica
Market Evidence: AI companion and adult content apps exploded in 2025. Kinkly's November 2025 review listed 46 AI sex apps, including Replika, Character.AI, CrushOn AI, and specialized platforms like DreamGF and Flirton.ai. Merlio's January 2025 guide noted that "AI companion apps have exploded in 2025... some people use them to flirt and roleplay".
User Adoption: Replika reported millions of users by mid-2025, with a significant portion engaging in romantic or NSFW interactions. Character.AI, despite content restrictions, saw widespread use for roleplay scenarios. Open-source models like DeepSeek and Qwen enabled uncensored local deployments, removing platform moderation barriers entirely.
Regulatory and Platform Dynamics: Apple and Google's app stores maintain NSFW content restrictions, but web-based platforms face no such constraints. The proliferation of open-weight models means developers can deploy adult AI experiences without relying on API providers.
Agents: From Demos to Production
Prediction 8: Computer-use agents break through and go mainstream
Current State: Anthropic released computer-use capabilities in Claude Opus 4.5 (November 2025), enabling the model to control Chrome and Excel programmatically. Google's Gemini 3 scored 54.2% on Terminal-Bench 2.0, which tests terminal-based computer operation. However, adoption remains limited to technical users and pilot programs.
Enterprise Reality Check: CityAM's January 2026 report bluntly stated: "2025 was set to be the year of AI agents. It was not". The article noted that "patchy adoption, weak integration and elusive returns kept them stuck in pilot mode". Enterprise buyers remain cautious due to reliability concerns and unclear ROI.
Counter-Evidence: METR's March 2025 report showed AI agents' time horizons doubling every 7 months from 2019-2024, accelerating to every 4 months in 2024-2025. If this trend continues, agents could handle week-long tasks with 80% reliability by March 2027. Gartner predicted that by 2029, agentic AI will autonomously resolve 80% of common customer service issues, cutting operational costs by 30%.
The Middle Ground: "Mainstream" likely means 20-30% of knowledge workers regularly using agent tools (Cursor, Cline, multi-step automation) rather than universal adoption. Given Cursor's $1 billion ARR in 2025, this threshold may already be met among developers. For broader enterprise adoption, 2026 looks like "continued pilots" rather than "mainstream."
Prediction 9: A model productively works for over 48 hours on a real task
Technical Precedent: Claude Opus 4.5 "codes autonomously for 20-30 minutes at a stretch," according to Reddit practitioner reports. METR's research showed models completing tasks that take humans 4 hours as of March 2025. The question is whether models can maintain coherence, context, and goal-direction for 48+ hours.
The Obstacles: Current models lose focus on multi-day tasks due to:
Context window limitations (even 1M tokens don't solve memory management)
Lack of persistent state tracking across sessions
Inability to self-recover from dead ends without human intervention
Path to 48 Hours: Anthropic's Opus 4.5 introduced "file system-like memory that tracks progress and adjusts strategy mid-task". If this capability matures—and if infrastructure emerges to support persistent agent sessions (prediction #10)—48-hour tasks become possible. Early candidates: large-scale refactoring projects, complex data pipeline construction, or multi-step research synthesis.
Prediction 10: Labs create new product surfaces to accommodate long-running agents
Current Gaps: Existing interfaces (ChatGPT, Claude, Gemini) are designed for short sessions. They lack:
Persistent background execution
Progress dashboards for multi-hour tasks
Graceful pause/resume for agent workflows
Cost management for long-running compute
Emerging Solutions: Cursor's success ($1B ARR) demonstrates demand for purpose-built agent interfaces. IBM's Kareem Yusuf predicted in December 2025 that "in 2026, I see agent control planes and multi-agent dashboards becoming real. You'll kick off tasks from one place, and those agents will operate across environments". flobotics' October 2025 analysis noted platforms like Lindy.ai and Gumloop emerging to orchestrate multi-step agent workflows.
What "New Product Surfaces" Means: Likely: dashboard UIs for monitoring agent progress, CLI tools for background execution, and IDE integrations with agent state persistence. This is infrastructure, not consumer apps.
Research & Capabilities: The Gigawatt Era Begins
Prediction 11: First 1 GW models get 50%+ on hardest benchmarks (FrontierMath L4, ARC-AGI-3)
Infrastructure Timeline: Multiple 1 gigawatt (GW) AI training clusters are under construction, with first deployments targeting late 2026. Meta's Prometheus (1 GW) is set to go online in 2026. OpenAI and NVIDIA's partnership aims to deploy 10 GW by late 2026, with the first phase using NVIDIA Vera Rubin platforms. xAI's Colossus 2 in Memphis approaches 1.1 GW with turbines expected operational by Q2 2027.
Current Benchmark Performance: FrontierMath, designed to test expert-level mathematical reasoning, saw leading models (Claude 3.5 Sonnet, o1-preview, GPT-4o, Gemini 1.5 Pro) score under 2% as of November 2024. OpenAI's o3 achieved 25% on FrontierMath in January 2025. ARC-AGI-3, released July 2025, saw frontier models score 0% while humans achieved 100%.
The Scaling Question: If 1 GW clusters train models 5-10x larger than GPT-5 (estimated at 1-2 trillion parameters), will they crack 50% on these benchmarks? Galois's January 2025 analysis noted that o3's jump from 2% to 25% on FrontierMath "suggests that AI mathematics is improving very rapidly". However, ARC-AGI-3's 0% scores reveal fundamental gaps in interactive reasoning that scale alone may not solve.
Realistic Assessment: 50% on FrontierMath L4 is plausible if 1 GW models incorporate RL breakthroughs. 50% on ARC-AGI-3 is unlikely unless architectural innovations (beyond scale) emerge.
Prediction 12: One fundamental issue gets solved
Candidate Problems:
Long-context reliability: Gemini 3's 1M token window exists but degrades quality beyond ~100K tokens in practice
Hallucinations down 90%: OpenAI claims GPT-5.1 has "far fewer" hallucinations, but practitioners report improvement, not elimination
10× data efficiency: DeepSeek's $5.5M training cost suggests efficiency gains, but not 10× vs. previous generation
Why This Is Hard: These problems are interconnected and may not have single solutions. Long-context reliability requires architectural changes (e.g., new attention mechanisms). Hallucinations stem from model uncertainty and training data quality. Data efficiency requires fundamental RL breakthroughs.
Most Likely Candidate: Data efficiency via synthetic data pipelines and curriculum learning. IBM's Peter Staar noted in December 2025 that "people are getting tired of scaling and are looking for new ideas". If a major lab demonstrates 10× data efficiency on a public benchmark, that counts.
Prediction 13: RL in LLMs saturates, new scaling law emerges
Context: Reinforcement learning (RL) drove major capability jumps in 2024-2025, powering models like o1, o3, and DeepSeek R1. However, Understanding AI's prediction analysis noted concerns about "diminishing returns" in both pre-training and post-training RL.
Saturation Signals: If RL scaling saturates, we'd expect:
Flattening benchmark scores despite increased compute
Labs publicly acknowledging RL limits
Research shifting to alternative paradigms (e.g., test-time compute, constitutional AI)
New Scaling Laws: Historically, new laws emerge when existing approaches plateau. Possible directions: sparse mixture-of-experts scaling, multimodal compute scaling, or agent-environment interaction scaling. IBM's El Maghraoui predicted "cooperative model routing" where "smaller models... delegate to the bigger model when needed".
Timing: Late 2026 is realistic for identifying saturation. Articulating a new scaling law requires 6-12 months of empirical validation.
We hope you enjoyed this Latestly AI edition.
We’ll come back with the second part soon.
📧 Got an AI tool for us to review or do you want to collaborate?
Send us a message and let us know!
Was this edition forwarded to you? Sign up here




