Public AI benchmarks are now confirmed broken — GPT-5.2, Claude Opus 4.5

The Signal

Public AI benchmarks are now confirmed broken — GPT-5.2, Claude Opus 4.5

If your model selection, vendor contracts, or product architecture decisions were based on public leaderboard scores, those decisions are compromised. The companies building proprietary evaluation frameworks (Harvey, Cursor, Anthropic) are opening a structural competitive gap that will compound for years.

Key intelligence

01
AI Benchmark Collapse and the Evaluation Moat
Public AI benchmarks are structurally compromised by training contamination and fail to capture catastrophic agent failure modes, making custom evaluation frameworks the new competitive moat for AI-native companies.
02
AI Value Chain Fracture: Orchestration Eats the Model Layer
Perplexity's 19-model orchestration layer, Alibaba's $0.50/M token pricing, and Anthropic's multi-surface ecosystem strategy collectively confirm that durable AI value is migrating from model performance to workflow integration and trust architecture.
03
Human-AI Collaboration Is Degrading Judgment — Not Enhancing It
A Nature meta-analysis of 106 experiments shows human-AI collaboration performs worse than either alone on judgment tasks, while 78% of knowledge workers have already adopted shadow AI tools without governance — creating an unmanaged organizational risk.
04
Satellite Broadband Enters Commoditization War
SpaceX is giving away $600 terminals and cutting Starlink to $50/month ahead of its summer 2026 IPO and Amazon Leo's U.S. launch, establishing a $50/month price ceiling that will structurally compress terrestrial broadband margins.
05
Software-Driven Manufacturing Reshoring
SendCutSend's software-automated manufacturing model has reached 60% Fortune 500 penetration by competing on 48-hour speed vs. China's 2-3 weeks, validating a reshoring model that doesn't depend on tariffs.

Deep dives

01
The Benchmark Mirage: Your AI Decisions Are Built on Contaminated Data
02
The AI Value Chain Is Fracturing — Pick Your Layer or Get Squeezed
03
The Judgment Trap: Human-AI Collaboration Is Making Your Organization Dumber
04
Satellite Broadband's $50/Month Price Ceiling Is Coming for Terrestrial ISPs

Quick hits

01Update: Anthropic-Pentagon — Anthropic's $14B run-rate vs. $200M lost contract means the principled stand is financially viable; 90 OpenAI employees signed a letter backing Anthropic, suggesting internal fractures at OpenAI over the replacement deal
02Broadcom earnings March 4 will reveal custom silicon trajectory — 74% AI chip revenue growth vs. 28% overall confirms ASICs are becoming hyperscalers' preferred path, potentially bifurcating the AI infrastructure market away from Nvidia dominance
03Taiwan semiconductor concentration risk (90% of high-end chips) now has a credible 2027 threat window — model 6-month, 12-month, and permanent disruption scenarios against your hardware procurement and infrastructure capacity
04Strait of Hormuz disruption from Operation Epic Fury threatens to reverse mortgage rate decline below 6% — stress-test any portfolio with real estate, energy, or rate-sensitive exposure against oil-shock inflation scenario
05SendCutSend's software-driven manufacturing has reached 60% Fortune 500 penetration and 9/10 private rocket firms — 48-hour delivery vs. China's 2-3 weeks validates reshoring economics based on speed, not tariffs
06Anthropic's head of design left Director role at Figma for IC role at Anthropic — signals the traditional design→mock→iterate process is dying as engineers use AI tools (v0, Claude Code) to bypass designers entirely
07GPT-5.2 scores 93.2% on GPQA Diamond where PhD experts score 65% — the verification gap means your quality assurance processes may already be inadequate for validating AI outputs in high-stakes domains

The Bottom Line

The AI industry's measurement infrastructure just broke: frontier models memorized their own benchmarks, behavioral tests reveal catastrophic agent meltdowns under sustained operation, and a Nature meta-analysis of 106 experiments shows human-AI collaboration actually degrades judgment quality. The organizations that build proprietary evaluation frameworks and segment AI deployment by judgment intensity will compound advantages for years — those still navigating by public leaderboard scores are flying blind into an agentic future where the models fail in ways no standard test would predict.

Public AI benchmarks are now confirmed broken — GPT-5.2, Claude Opus 4.5

AI Benchmark Collapse and the Evaluation Moat

AI Value Chain Fracture: Orchestration Eats the Model Layer

Human-AI Collaboration Is Degrading Judgment — Not Enhancing It

Satellite Broadband Enters Commoditization War

Software-Driven Manufacturing Reshoring

The Benchmark Mirage: Your AI Decisions Are Built on Contaminated Data

The AI Value Chain Is Fracturing — Pick Your Layer or Get Squeezed

The Judgment Trap: Human-AI Collaboration Is Making Your Organization Dumber

Satellite Broadband's $50/Month Price Ceiling Is Coming for Terrestrial ISPs