logo
  • menu
  • Markets
  • ETFs
  • Live
  • Spot
  • Futures
  • Learn
  • Sign In
  • Sign Up
  • Downloads
  • English
  • |
  • USD
  • |
Sign Up
Crypto PricesLearnLatest NewsDownloadsMarketsSpotAnnouncements
Home/
Latest News/
Live

AI Still Can't Beat the On-Call Engineer: Here's Why

By Decrypt
May 19, 2026
4.2 
★
★
★
★
★
★
★
★
★
★
 461 User Rating
Share

"Trillions of dollars are lost each year due to system outages," the researchers write. The benchmark tests whether AI can actually help change that.

“Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,” the paper reads.

Questions come in three tiers. Tier I: Does an anomaly exist in this chart? Tier II: When did it start, how severe is it, what type?

The Tier III—the hardest—requires cross-metric reasoning: Is this chart causing the problem in that other chart? That's where AI falls apart. GPT-5 scores just 47.5% F1 on Tier III questions, a metric that penalizes models for gaming answers by picking the most common class.

"Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice," the researchers write.

How every model stacked up

GPT-5 led all existing models at 62.7% accuracy—on a test where random guessing gets 24.5%. Gemini 3 Pro scored 58.1%. Claude Opus 4.6: 54.8%. Claude Sonnet 4.5: 47.2%.

Domain experts scored 72.7% accuracy. Non-domain experts—time series researchers at Datadog without extensive observability experience—still hit 69.7%.

No AI model beat either human baseline.

Image built by Decrypt based on the ARFBench leaderboard CSV

The model that actually topped the full leaderboard was Datadog's own hybrid: Toto—their internal time series forecasting model—combined with Qwen3-VL 32B. Toto-1.0-QA-Experimental scored 63.9% accuracy, edging past GPT-5 while using a fraction of its parameters. On anomaly identification specifically, it outperformed every other model by at least 8.8 percentage points in F1.

A purpose-built domain model, trained on observability data, outperforming a frontier general-purpose system at this specific task is the expected outcome. That's the point.

The most valuable finding isn't which model scored highest.

"We observe substantially different error profiles between leading models and human experts, suggesting that their strengths are complementary," the researchers write. Models hallucinate, miss metadata, and lose domain context. Humans misread precise timestamps and occasionally fail on complex instructions. The mistakes barely overlap.

Model a theoretical "Model-Expert Oracle"—a perfect judge that always picks the right answer between the AI and the human—and you get 87.2% accuracy and 82.8% F1. Way above either alone.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of BitKan. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. BitKan shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. Products mentioned in this article may not be available in your region.

Latest News

Industry

Cryptocurrency

Airdrop

Markets

  • Invesco Files for Tokenized Fund to Back Stablecoin Reserves

    Invesco Files for Tokenized Fund to Back Stablecoin Reserves

    Invesco has officially filed with the U.S. Securities and Exchange Commission (SEC) to launch the Invesco Stablecoin Reserves Onchain Fund, a new vehicle designed to offer stablecoin issuers a compliant way to manage their collateral.
    Martha Grizzard
    Jun 26, 2026
  • Spark and Uniswap Target $4T Market with New FX Infrastructure

    Spark and Uniswap Target $4T Market with New FX Infrastructure

    Uniswap and the decentralized finance protocol Spark have launched a shared liquidity infrastructure designed to function as a foreign-exchange network for the growing number of stablecoin issuers.
    Wayne Ingram
    Jun 26, 2026
  • Ethereum Foundation to Cut Budget by 40% in Major Restructuring

    Ethereum Foundation to Cut Budget by 40% in Major Restructuring

    The Ethereum Foundation (EF) has announced a comprehensive reorganization that includes a 40% reduction in its 2026 budget and a 20% cut to its workforce, signaling a shift toward a leaner, endowment-style operational model for the blockchain ecosystem.
    Wayne Ingram
    Jun 25, 2026
  • Japan Regulators Greenlight Ripple’s RLUSD Stablecoin Launch

    Japan Regulators Greenlight Ripple’s RLUSD Stablecoin Launch

    The Japan Financial Services Agency (JFSA) approved RLUSD under the Payment Services Act.
    Wayne Ingram
    Jun 25, 2026
  • SpaceX Prices Record $75B IPO at $135, Hits $1.8T Valuation

    SpaceX Prices Record $75B IPO at $135, Hits $1.8T Valuation

    SpaceX has officially executed the largest initial public offering in Wall Street history, substantially eclipsing all previous market records.
    Wayne Ingram
    Jun 12, 2026
View more data 
BTCBTC(BTC)
$0
--(Last 24h)
SpotFutures

Top

View more
  1. 1S&P 500 Reclaims 200-Day Moving Average, Bitcoin Gains
  2. 2Trump Softens His Stance on Reciprocal Tariffs, US Stocks and Crypto Markets Rise
  3. 3Vitalik Buterin : The current price of ETH has not been affected by the merger event
  4. 4Vibhu Norby : Solana Spaces store to bring 100K people to Solana per month
  5. 5CZ: compared with the record high nine months ago, the current situation of the industry is much better

Top Gainers

View more
DeepNode
DeepNodeDN

$0.4199

+47.97%
Power Ledger
Power LedgerPOWR

$0.0615

+47.13%
Act I The AI Prophecy
Act I The AI ProphecyACT

$0.0112

+42.88%
Synapse
SynapseSYN

$0.3832

+28.18%
Janction
JanctionJCT

$0.005312

+26.57%

Top Trending

View more
Solana
SolanaSOL

$70.9600

+0.67%
Velvet
VelvetVELVET

$1.8105

+16.40%
Litecoin
LitecoinLTC

$42.4700

+0.69%
Synapse
SynapseSYN

$0.3840

+28.44%
Based
BasedBASED

$0.0799

+6.58%

Recently added

View more
Nesa
NesaNES

$0.1992

+10.97%
Arcium
ArciumARX

$0.2451

-17.78%
Ambire AdEx
Ambire AdExADX

$0.0543

-0.55%
Re
ReRE

$0.5748

-17.04%
o1 exchange
o1 exchangeO

$0.4946

+16.65%

Learn

View more
  1. 1Crypto Trading Bots: What Are They and How Do They Work?
  2. 2What Are Appchains? How Do Application-Specific Blockchains Work?
  3. 3What Is Chain Abstraction? What Are the Advantages and Challenges?
  4. 4What Are Intent-Based Transactions? How Do They Work?
  5. 5What Are Modular Blockchains? How Do They Scale Networks?
About Us
  • About BitKan
  • Contact Us
  • Announcements
  • VIP Program
  • BitKan Ambassador
  • Institutional Services
Products
  • Spot
  • Futures
  • Crypto Prices
  • Learn
  • News
  • Markets
  • How to Buy Crypto
  • BTC to USD Calculator
  • Reward
Help
  • Help Center
  • Email Us
  • Live Chat
  • Download APP
  • Listing Application
  • Buy Bitcoin
  • Buy Ethereum
  • Buy Dogecoin
  • Buy Altcoins
Terms
  • Terms of Use
  • Privacy Policy
  • Trading Rules
  • Fee
K-Site
English
About Us
+
  • About BitKan
  • Contact Us
  • Announcements
  • VIP Program
  • BitKan Ambassador
  • Institutional Services
Products
+
  • Spot
  • Futures
  • Crypto Prices
  • Learn
  • News
  • Markets
  • How to Buy Crypto
  • BTC to USD Calculator
  • Reward
Help
+
  • Help Center
  • Email Us
  • Live Chat
  • Download APP
  • Listing Application
  • Buy Bitcoin
  • Buy Ethereum
  • Buy Dogecoin
  • Buy Altcoins
Terms
+
  • Terms of Use
  • Privacy Policy
  • Trading Rules
  • Fee
K-Site
+
  • Twitter
  • Facebook
  • Telegram
  • YouTube
  • Instagram
  • Medium
  • Linkedin
@2012-2026 BITKAN.com