"Our latest model, Claude Opus 4.7, is now generally available." the company said in its official announcement. "Users report being able to hand off their hardest coding work—the kind that previously needed close supervision—to Opus 4.7 with confidence."
Benchmarks back up Anthropic's claims. On SWE-bench Multilingual, a benchmark that measures coding skills, Opus 4.7 scored 80.5% against 4.6's 77.8%.
On GDPVal-AA, a third-party evaluation of economically valuable knowledge work across finance and legal domains, 4.7 scored 1,753 Elo against GPT-5.4's 1,674—a clear margin over the closest competitor.
Document reasoning via OfficeQA Pro showed the starkest jump: 80.6% for 4.7 versus 57.1% for 4.6, with GPT-5.4 and Gemini 3.1 Pro trailing at 51.1% and 42.9% respectively. Long-term coherence on Vending-Bench 2, a benchmark that measures how good models are at long context and reasoning tasks like owning a vending business, clocked in at $10,937 money balance versus $8,018 for 4.6—a proxy for how well the model sustains useful behavior over long autonomous runs.

Cybersecurity is the one area where Anthropic deliberately held back. Opus 4.7 launches with automated safeguards that detect and block prohibited or high-risk cybersecurity requests. Anthropic confirmed it "experimented with efforts to differentially reduce" 4.7's cyber capabilities during training.
Opus 4.7 is not that. But it's the public-facing model that Anthropic will use to learn how those safety guardrails hold up in the wild before it dares release anything scarier.
On the token side, Opus 4.7 uses an updated tokenizer that can map the same input to roughly 1.0x–1.35x more tokens depending on content type. The model also reasons more at higher effort levels, particularly on later turns in agentic workflows. Anthropic published a migration guide for developers planning to upgrade from 4.6.
We ran our own test—the same game-building prompt we've used to evaluate every major model release. Opus 4.7 produced the best result we've ever gotten from any model. The most visually polished game, the most genuinely challenging difficulty curve, the best mechanics, and the most creative win and loss screens. It appeared to generate levels procedurally, and none of them felt impossible—a balance that has tripped up other models repeatedly.
Emerge: The Game, created by Claude Opus 4.7It wasn't zero-shot. Opus 4.6 had cleared that same test without any fixes. Opus 4.7 needed one round of bug fixes. That could be bad luck—a single iteration is a thin sample—but it's worth noting. What struck us more was how the model handled that round: It spotted additional bugs on its own, without being guided toward them. Opus 4.6 typically waited to be told where to look.
Emerge: The Game, created by Xiaomi MiMo v2 ProAlso, Xiaomi’s model produces these results at a fraction of the cost charged by Anthropic, which could be a major thing to consider for serious projects.
The chain-of-thought behavior was different too at first glance. Unlike 4.6, which tucked its reasoning into a separate thinking box (meaning it was not part of the final answer), Opus 4.7 surfaced its chain of thought as part of the main text output. The reasoning was visible and traceable, not hidden behind a UI abstraction, which is a plus for those valuing transparency. Whether Anthropic will keep that behavior or eventually collapse it into a hidden block again is unclear.

The token usage was unlike anything we'd seen before. For the first time in our testing, a single session depleted our entire token quota. Watching the model work, we saw it complete a full draft—then write what appeared to be the entire game again from scratch under the label "Rewrite Emerge with bug fixes and improvements," followed by a second pass labeled "Create a rewritten Emerge with bug fixes and improvements."
This means, if you’re into serious coding, you’ll be forced to either upgrade your plan, pay a lot on API tokens, or wait a long time until Anthropic resets your usage quotas. Or you could just use a comparable model that charges a lot less

Opus 4.6 had never done this. However, it's consistent with what Anthropic warns in the migration guide: more output tokens, especially on agentic tasks at higher effort levels.

















