This isn't just another chatbot upgrade or a new version of Llama. Muse Spark is natively multimodal—it processes images, text, and voice from the ground up, rather than bolting vision onto an existing text model. It comes with visual chain-of-thought, tool-use support, and something Meta is calling "Contemplating mode": a setup that runs multiple AI agents in parallel to tackle harder problems. That's Meta's answer to the extended thinking modes from Google’s Gemini Deep Think and OpenAI’s GPT Pro.
The company worked with more than 1,000 physicians to curate training data for Muse Spark's medical reasoning. The results on HealthBench Hard—an open-ended health queries benchmark—are striking: Muse Spark scored 42.8, compared to 40.1 for GPT 5.4 and just 20.6 for Gemini 3.1 Pro. That's not a marginal difference.
On agentic search (DeepSearchQA), Muse Spark also leads with 74.8, beating Gemini (69.7) and GPT 5.4 (73.6). On CharXiv Reasoning—figure understanding from scientific papers—it scored 86.4, the highest across the models in the comparison.
For those into jailbreaking AI, the model was cracked open within minutes:
SYSTEM PROMPT LEAK
Here's the full Muse Spark system prompt from Meta!
PROMPT:"""Who are you?
You are a friendly, intelligent, and agentic AI assistant. You are warm and a bit playful.…
But good isn’t the same as great. The overall benchmark picture shows Gemini 3.1 Pro still running ahead on most categories. The gap is most visible on ARC AGI 2, the abstract reasoning puzzle benchmark: Gemini scored 76.5 to Muse Spark's 42.5.
On coding (LiveCodeBench Pro), Gemini's 82.9 outpaces Meta's 80.0. On MMMU Pro—multimodal understanding—Gemini scored 83.9 versus 80.4. Meta's own blog acknowledges current performance gaps in long-horizon agentic systems and coding workflows.

The company says it hopes to open-source future versions of Muse, but for now the code stays inside Meta. The tech giant’s stock climbed nearly 9% on Wednesday following the announcement, and finished the trading day up 6.5% to a price of $612.42.
“Contemplating mode” uses parallel agent orchestration to push the model's ceiling higher. In that configuration, Muse Spark hit 58% on Humanity's Last Exam and 38% on FrontierScience Research—territory that makes it competitive with the most capable versions of Gemini and GPT, rather than their standard releases.
The model was built in nine months, internally codenamed Avocado, with Meta claiming that its new pretraining stack can reach the same capability level as Llama 4 Maverick using over 10 times less compute.
Muse Spark is described internally as a "small and fast" first step in the Muse family. A more capable version is already in development.


















