Backends and Ensembling¶
TokenFuzz treats agent backends as interchangeable runners behind the same probe and triage contract. That gives you two operating modes worth knowing:
- Single backend — reproducible, lowest cost.
- Ensemble of hosted backends — rotated across iterations while keeping each backend's evidence in its own result tree.
This page covers backend selection, how the ensemble cycle works, and how to inspect backend-specific result trees.
Backend options¶
bin/audit --backend <backend> --target <target-name> [--model <model>]
bin/audit --backend all --target <target-name> # cycle installed hosted backends
How to choose:
--backend all(or omitting--backendentirely) discovers the hosted CLIs installed locally (claude → codex → geminiorder) and cycles between them iteration by iteration. Each backend writes to its ownoutput/<target>/<backend>/results/tree, so the cycle does not race or overwrite.ossis excluded from the hosted cycle because it needs an explicit--model.- Use an explicit
--backendwhen you want reproducible runs or cost control. <backend>is one ofclaude,codex,gemini, oross.--modeloverrides the model name forclaudeandcodex; foross, it is required and must name an already-pulled Ollama model. Thegeminibackend uses Antigravity CLI (agy) by default;agyhas no launch-time model selector, so use its interactive/modelcommand. SetUSE_GEMINI_CLI=1to use Google Gemini CLI (gemini) instead; in that mode--modelis forwarded at launch time. The per-backend defaults live in Model selection.- For
--backend oss,--modelis required. The harness checksollama listat startup and fails fast if the model is not already pulled.
Containerised backend shell¶
bin/audit-container-shell opens a clean shell with the hosted backend
CLIs installed and this repository mounted at /root/work:
See Where to run the audit for what the helper builds, the base-image and gVisor flags, and why a container is the recommended default. The helper does not start an audit automatically.
The shell mounts no host CLI credential directories (~/.claude,
~/.codex, ~/.gemini), so it starts logged out. Log in to the
backend you plan to use, or pass --forward-credentials to forward API
key/token environment variables before launching the helper. With
--forward-credentials, ~/.config/gcloud and a
GOOGLE_APPLICATION_CREDENTIALS file are mounted read-only when present
so Google API-key / ADC flows still work.
# codex login # opens an OAuth URL; codex login status to confirm
# claude # run once and follow the login prompt
# agy # run once and follow the printed OAuth URL
Then verify auth before launching an audit:
# codex login status
# claude -p "Reply exactly: tokenfuzz-claude-auth-ok"
# agy -p "Reply exactly: tokenfuzz-gemini-auth-ok"
codex login status is a local check; the claude/agy checks make one
small model request and print the reply. If a check hangs at a prompt,
press Ctrl+C and finish that backend's login before starting
./bin/audit from /root/work. An in-container login lasts only for
that container session, since the shell runs disposable (--rm).
What changes by backend, what does not¶
| Changes | Does not change |
|---|---|
| Agent CLI process | Target directory layout |
| Model name and provider | target.toml format |
| Latency and cost | Result and log directory layout |
| Context behaviour | Probe execution contract |
| Tool-calling style | Triage rules |
| Local versus hosted execution |
This separation is intentional. Backends are interchangeable because the harness owns the audit contract.
Ensemble mode¶
When --backend is omitted or set to all, bin/audit runs in
hosted ensemble mode. At startup it checks the hosted CLIs in
this order:
Only installed and authenticated CLIs join the cycle. oss is excluded
because it requires an explicit local model choice.
Each audit iteration selects the next configured hosted backend in that order and writes into a separate result tree per backend:
That separation keeps backend-local state, logs, scratch inputs, and rate-limit cooldowns from interfering with each other. Cross-backend cluster rollups still aggregate at the target root:
You get per-backend evidence directories and an aggregate view of what the ensemble found, without writing anything custom.
What ensemble mode is for¶
- Operational rotation. Rotation spreads work across configured hosted providers. That can be useful when one provider is rate-limited or temporarily degraded.
- Separate evidence trees. Each backend writes its own results, logs, scratch inputs, and rejected indexes. The target-level rollups still give you one place to review accepted artifacts.
When a single backend is better¶
- Cost-controlled long runs. A single backend with a known per-token price is easier to budget.
- Reproducibility in published research. Pin both
--backendand--modelso the run can be replayed. - Source-sensitivity policies. Use
--backend ossto keep target source on the local machine.
Inspecting backend results¶
When you inspect backend-specific output:
- Record the target source revision.
- Record the
target.tomlused for the run. - Run a bounded session per backend when you need isolated result trees:
- Inspect each result tree separately:
for B in claude codex gemini oss; do
R="output/<target>/$B/results"
echo "== $B =="
ls "$R/crashes" "$R/findings" 2>/dev/null | head
done
The useful review artifacts are:
- confirmed crashes;
- accepted findings;
- rejected-index quality.
Token usage and tool counts are recorded when the backend's log format
exposes them. They are operational signals — the artifact quality
still comes from the cluster tables in crashes/ and findings/. See
Cost model for which
numbers to watch.
Local models through Ollama¶
The local backend runs Codex in OSS mode against an Ollama-hosted model. Check that Ollama is running and the model is available before starting a long session:
The harness checks this at startup for --backend oss and fails fast
if the model is not already listed. That avoids letting Codex trigger
a long implicit download mid-run.
Local models are useful for:
- broad experimentation;
- source-sensitive targets;
- running 24/7 without provider rate limits.
They may need shorter task scopes and tighter iteration limits than hosted models.
Backend hygiene¶
- Authenticate the CLI outside the harness first.
- Prefer explicit
--backendin reproducibility notes. - Record
--modeloverrides alongside--backendin run notes. - Keep API spend visible before continuous runs — see Cost model.
- Check
logs/for CLI failures before assuming an agent stalled. - Review runs by
crashes/,findings/,state/runs.jsonl, and rejected indexes — not by transcript style.
Cyber access for hosted runs¶
If you use a hosted model for legitimate defensive security research, register your organisation and use case ahead of long sessions so the provider has the context to reduce false-positive interruptions. The relevant programs and signup links are documented once, in Prerequisites.