A fleet of eight robot arms at Nvidia's GEAR lab spent the past few weeks teaching themselves to insert pins, seat graphics cards, and cut zip ties. The only humans involved were the ones who wrote the paper afterward.

Coding agents like OpenAI's Codex, Anthropic's Claude Code, and Moonshot's Kimi Code have spent the past year running what researchers call autoresearch—writing code, testing it, and rewriting it again without a person in the loop. That loop has mostly stayed on a screen, where resetting a failed experiment costs nothing. ENPIRE drags it into the physical world, where resetting an experiment means moving an actual robot arm.
Building the ‘Enpire’The system splits the work into two stages. In the first, a human walks the agent through building two permanent tools: a reset routine that returns the workspace to a fresh starting position, and a reward function that watches camera footage to score success—basically a referee that never blinks and never takes a lunch break. That setup happens once, then gets reused for every attempt that follows.
Once those tools exist, the agent takes over completely. It searches published research for ideas, picks between training methods like imitation learning, reinforcement learning, or hand-written rules, then rewrites its own code and tests the result on the robot. Nothing in that loop requires a person to watch, which is either liberating or slightly unsettling depending on how you feel about a robot holding scissors unsupervised.
Nvidia ran the experiment on eight bimanual robot stations, each with its own hardware, computer, and coding agent. The stations trade progress via Git, the same tool coders use to merge code, so a winning idea spreads fleet-wide within minutes.
Researchers measured the payoff on “Push-T,” a task where a robot slides a T-shaped block into a target zone using only pushes, and pin insertion, where it threads pins into 4-millimeter holes. Scaling from one robot to eight cut the time to master Push-T from roughly five hours to two, and pin insertion from more than 90 minutes to about 40.

Across the four real-world tasks tested, the agents drove their policies to a 99% success rate, according to the paper. For pin insertion, the agents reached near-perfect reliability faster than a comparable human-in-the-loop method, the kind that still needs someone to show up every morning.
Nvidia's Jim Fan, the GEAR Lab co-lead who directs the company's AI research, called the project an effort to enable AutoResearch in the physical world for the first time. Fan said the team handed the agents a fleet of robots, a GPU allocation, and a token budget, then stepped back and let the robots take over.
The gap between simulation and reality showed up almost immediately. All three coding agents solved Push-T inside a simulator, but two of the three failed once the same task moved onto a physical robot, the paper notes.
Simulators don't have friction problems. Real tables do.
Nvidia also tested ENPIRE inside RoboCasa, a simulated kitchen benchmark that scores robots on chores like opening cabinets or turning off stoves by success rate, mercifully without any risk of burning the place down. There, ENPIRE outperformed both Nvidia's own end-to-end model GR00T and CaP-X, a tool-using agent that skips the autoresearch loop entirely.


















