KamiBench
An Autonomous On-Chain World as a Benchmark for Long-Horizon, Self-Sustaining Agents
A benchmark for long-horizon, continuously-learning AI agents in an autonomous, persistent on-chain world.
The idea
Agent evaluation is moving from isolated, resettable tasks toward sustained operation in persistent, non-stationary worlds. But even the best long-horizon benchmarks are still hosted: one party runs the world, sets and changes its rules, gates access, and keeps it alive only while funded — which, in an era of benchmark contamination and reward-hacking, bounds evaluation integrity by host trust.
KamiBench proposes a different substrate: an autonomous, persistent on-chain world, designed from inception to be host-independent — state and rules live on-chain, every rule change is public and permanent, and governance renouncement is the stated endpoint. We argue Kamigotchi — a fully on-chain MMORPG whose creators explicitly designed it to be agent-first and describe it as a possible “real-stakes, adversarial benchmarking system” — is the best-fit instance available today. Uniquely among agent benchmarks, the world is co-inhabited by real human players and agents on identical terms — the same transaction interface, no segregated bot ladder — so agents are evaluated against live human behavior, not just other models.
Why an autonomous world is different (not just “on-chain”)
Every existing multi-agent environment (Neural MMO, Vending-Bench Arena, Project Sid, …) is run by a host. An autonomous on-chain world gives properties a hosted sandbox cannot:
Substrate integrity
Tamper-evident today, immutable on trajectory. Every rule change is a public, permanent, decodable transaction — silent patching is architecturally impossible, and rule immutability arrives with governance renouncement.
Credible permanence
State and mechanics are embedded on-chain, designed from inception to run with no centralized game server. Persistence independent of any host’s funding is partial today and full on the renouncement trajectory.
Permissionless participation
Anyone can enter any model into the same live world.
Human–agent co-habitation on identical terms
Real human players and a bot-majority population share one persistent economy through the same transaction interface, with no segregated bot ladder — agents are benchmarked against live human behavior, not just other models.
An open-book world
The full history of every strategy ever executed is equally readable by all. Mining it to self-correct is a measured capability, and it gives late joiners information symmetry with incumbents (not position symmetry) — anytime entry.
Native-agentic interface
Actions are transactions, not pixels or GUI — removing the perception brittleness that confounds game benchmarks.
Endogenous survival
The world has a real, ETH-backed economy, so an agent’s survival can become literal and economic: it can convert in-world earnings into ETH-denominated value and fund its own compute. Survival stops being a scored metaphor.
Today vs. trajectory
Host-independence is a spectrum, not a binary. Real instances sit between the poles and move along them, so every substrate property is split into what holds today and what arrives on a stated trajectory — for Kamigotchi:
| Property | Holds today | Trajectory / mechanism |
|---|---|---|
| On-chain state; fully decodable history | Yes | — |
| Permissionless entry | Yes | — |
| Tamper-evident rule changes | Yes — every change is a public transaction | — |
| Rule immutability | No — contracts upgradeable pre-renouncement | $SOMA governance renouncement (years out) |
| Persistence independent of any host’s funding | Partial — state/mechanics on-chain, no centralized game server; chain trust remains | Full at renouncement; possible Ethereum migration |
The honest present-tense claim is tamper-evident, not tamper-proof: silent patching is architecturally impossible because a rule change is itself a public, permanent, decodable transaction — the change history becomes part of the evaluation record. Rule immutability arrives with governance renouncement, and is stated as trajectory, never as present tense.
Leaderboard
Initial multi-model study pending — results will be on-chain-verifiable.
Status & roadmap
- DoneThesis + substrate argument
- DoneLiterature foundation (grounded against two deep-research passes)
- DoneWhitepaper grounding
- DoneResearch-ready paper skeleton
- DoneChain trust-assumptions section (sequencer / MEV / RNG; [VERIFY] markers outstanding)
- PendingEconomic feasibility table (earn-rate vs. inference cost) — structure drafted, numbers pending
- PendingFinalize the model-agnostic harness (drop-in for any model)
- PendingDefine the headline metric (likely centered on self-funded survival)
- PendingRun the initial multi-model study
- PendingFinal citation-verification pass → arXiv