Can AI actually write Rust?
This section will benchmark how well AI models generate, fix, and reason about Rust code — focusing on whether the code actually compiles, passes tests, and uses real crates.
Placeholder — sample data
The leaderboard below is illustrative sample data. RustPulse does not yet run real model benchmarks. The columns and categories show what it will measure.
Leaderboard
| Model | Provider | Compile | Tests | Hallucinated | Unsafe | Avg latency | Cost | Last run |
|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 94% | 86% | 5% | 13% | 4.8s | $3.80 | Jun 13, 2026 |
| GPT-5 | OpenAI | 92% | 83% | 7% | 13% | 5.1s | $3.40 | Jun 13, 2026 |
| Gemini 2.5 Pro | 90% | 80% | 9% | 13% | 4.2s | $2.10 | Jun 13, 2026 | |
| Llama 4 70B | Meta | 81% | 68% | 16% | 13% | 3.7s | $0.30 | Jun 13, 2026 |
Benchmark categories
The kinds of task the benchmark will cover.
Generate Rust function from prompt
Fix compiler error
Pass cargo test
Avoid hallucinated crates
Explain borrow checker issue
Refactor sync code to async
Use crates correctly
RAG with docs vs no RAG
How it will work
The planned benchmark pipeline — runnable Rust, not just plausible-looking text.
- 01Ask a model to generate Rust code for a task.
- 02Write the generated code into a temporary Cargo project.
- 03Run cargo check to see whether it compiles.
- 04Run cargo test to see whether it passes.
- 05Detect hallucinated crates by checking dependencies against crates.io.
- 06Score the model across all tasks.