Commands¶
This is the operator command reference. It is organised by the order you reach for things:
- target setup;
- audit execution;
- testcase development;
- mid-run checkpoint;
- result review;
- maintenance.
Examples assume you are at the repository root.
The commands you reach for most:
bin/audit --target <slug> --backend <backend> 1 # one bounded iteration
bin/state --results-dir "$RESULTS" show-recent --agent 1 # mid-run check
open "$RESULTS"/crashes/CRASH-CLUSTERS.html # review results
Everything below expands on those three.
To keep the examples consistent, set the target slug and backend once:
export TARGET=<your-target>
export BACKEND="<backend>" # one of: claude, codex, gemini, oss
# Optional convenience path for inspecting results.
export RESULTS="output/$TARGET/$BACKEND/results"
Target setup¶
bin/setup-target "$TARGET" <repo-url>
bin/setup-target "$TARGET"
bin/setup-target "$TARGET" <repo-url> --ref <branch-or-revision>
bin/setup-target "$TARGET" <repo-url> --repo-type hg
This creates or updates:
Run twice for a normal ASan-first setup:
- Before the sanitizer build, with a repository URL — creates
or updates
targets/<target>/and seedsoutput/<target>/target.toml. - After the sanitizer build, without a repository URL — re-inspects build outputs and fills values that could not be inferred earlier.
With no repo URL or ref, bin/setup-target <target> re-inspects
build-asan/ and refreshes generated fields in target.toml.
Advanced flags:
bin/setup-target <target> <repo-url> --no-update
bin/setup-target <target> --force-config
bin/setup-target <target> --bootstrap
bin/setup-target <target> --no-llm-config
What each flag does:
--no-update— skip the pull when passing a repo URL or ref.--force-config— explicitly regeneratetarget.tomland overwrite LLM-suggested[threat_model]/[s6_peers]sections.--bootstrap— run an optional language build step for non-C/C++ targets (Python C extensions,npm install,composer install, …). See Add a target.--no-llm-config— skip the best-effort threat-model and S6 lookalike project suggestions. Not recommended unless you have a specific reason to stay offline; the LLM-suggested values are usually a better starting point than the conservative defaults.
LLM-backed config helpers can be run explicitly after setup:
bin/suggest-threat-model "$TARGET" --apply --force-config
bin/suggest-peers "$TARGET" --apply --force-config
Use these when the deterministic seed is too conservative and you want
to refresh [threat_model] or [s6_peers] without recreating the rest
of target.toml.
Audit runs¶
bin/audit --backend <backend> --target <target-name> [--model <model>]
bin/audit --target "$TARGET" --backend "$BACKEND" 1
bin/audit --target "$TARGET" --backend "$BACKEND" 10
bin/audit --target "$TARGET" --backend "$BACKEND" --strategy S3 1
Notes:
- The optional final argument limits iterations. Omit it, or pass
0, to run continuously. <backend>is one ofclaude,codex,gemini, oross.- Start with
1to verify target config, backend CLI, results directory, and state writer before committing to a long run. - Use
--modelwith theclaudeorcodexbackend. The harness forwards it to that backend's native model flag. Foross,--modelis required and names an already-pulled Ollama model. Thegeminibackend uses Antigravity CLI (agy) by default, which has no launch-time model selector; useagy's interactive/modelcommand. SetUSE_GEMINI_CLI=1to use Google Gemini CLI instead; then--modelis forwarded togemini.
Results are written under:
# Optional convenience paths for inspecting results.
export RESULTS="output/$TARGET/$BACKEND/results"
export LOGS="output/$TARGET/$BACKEND/logs"
- Use
results/for audit progress and evidence. - Use
logs/when startup, backend authentication, or wrapper behaviour needs debugging.
Supported single backends: claude, codex, gemini, oss.
Use --backend all, or omit --backend, to cycle installed hosted
backends across iterations. Use the same explicit backend when
comparing runs — backend names are part of the output path, so each
writes to its own result tree.
Container shell¶
This opens an interactive Docker shell with the supported backend CLIs
installed and the repository mounted. Use it when you want a reproducible
Linux audit environment or to run the image-mode test suite. Credentials
are not forwarded unless you pass --forward-credentials.
bin/probe — run one testcase¶
bin/probe scratch-1/testcase.html
bin/probe --confirm scratch-1/testcase.html
bin/probe --dry-run scratch-1/testcase.dat
PROBE_SANITIZER=ubsan bin/probe scratch-1/testcase.dat
bin/probe is the execution gate for testcases. It discovers the active
target by walking up from the testcase to
output/<slug>/<backend>/results/.session-env, reads target.toml,
chooses the right browser / JS / generic runner, and saves sanitizer or
runner output beside the testcase as
<testcase>.asan.txt.
Use --dry-run first when checking a new target config or HARNESS:
setup; it prints the resolved mode, sanitizer, output path, and command
without executing the testcase. Use --confirm only after an initial
crash or diagnostic, when you need the 5-run reproducibility check.
Testcase headers provide the audit linkage:
TARGET: path/to/file.c:Function:123
HYPOTHESIS-ID: H1
CATEGORY: bounds
MODE: generic # optional; auto/browser/js/generic/js-diff
HARNESS: harness.c # optional sibling API harness
The default sanitizer is the first entry in [sanitizer].enabled, or
runner when the target is findings-only with enabled = []. Override a
single run with PROBE_SANITIZER=asan|ubsan|msan|tsan|race|runner.
Advanced probe escape hatches are listed in
Environment.
Testcase development¶
These commands help an operator or agent avoid duplicate work, seed a testcase from existing inputs, and inspect probe history without opening large scratch directories.
TARGET_ROOT="targets/$TARGET" RESULTS_DIR="$RESULTS" bin/find-seed <path.ext>[:<Function>] [max_results]
bin/scratch-status "$RESULTS/scratch-1"
RESULTS_DIR="$RESULTS" bin/scratch-search <pattern>
bin/probe-history --results-dir "$RESULTS" scratch-1/testcase.html
bin/probe-history --results-dir "$RESULTS" --hypothesis-id H1
bin/find-seedsearches the target tree for in-tree tests or sample inputs likely to exercise a file or function.bin/scratch-statussummarizes testcase/output pairs in one scratch directory and flags orphan testcases that have not been probed.bin/scratch-searchsearches audit scratch/corpus/crash artifacts for prior attempts, hypothesis IDs, harness names, or symbols.bin/probe-historyreadsstate/runs.jsonland shows prior verdicts for a testcase, hypothesis, card, agent, or verdict.
Coverage helpers are useful when you need to debug why a testcase is missing the intended code:
bin/hits --testcase scratch-1/testcase.html --want <symbol-regex> --mode browser
bin/hits --testcase scratch-1/testcase.js --want <symbol-regex> --mode js
bin/coverage-summary --results-dir "$RESULTS"
bin/hitsruns a testcase against the coverage-instrumented ASan build and reports whether the requested symbol was reached.bin/coverage-summaryaggregates per-agent edge journals into a subsystem coverage summary. It is only useful after coverage data exists for the run.
Capped source-reading wrappers¶
Agents read source through wrappers that cap output size so a single search cannot flood a prompt. They are plain CLI tools, and they are just as useful to an operator poking at a large target tree:
bin/rg-safe <pattern> targets/$TARGET/src # rg with line + byte caps
bin/peek targets/$TARGET/src/parser.c 200 260 # clamped line-range view
bin/peek <pattern> targets/$TARGET/src/file.c # clamped grep-with-context
bin/show-patch <commit> [paths...] # git show with narrow context + caps
Each prints a footer when a cap fires, so truncation is visible rather than silent. The caps are tunable — see Environment.
bin/audit-recon — breadth-first survey of the source tree¶
bin/audit runs this automatically on the first audit of a given
target commit; the results are cached and re-used until the source
changes. You only need to invoke it directly if you want a survey
without launching a deep audit, or to refresh recon on demand.
bin/audit-recon --target "$TARGET" --backend "$BACKEND"
bin/audit-recon --target "$TARGET" --backend "$BACKEND" --timeout 2700
Outputs:
output/$TARGET/$BACKEND/results/recon-findings.md— human read.output/$TARGET/$BACKEND/results/recon-hypotheses.jsonl— the candidate findings (validator votes merged in). Seedsbin/audit's work queue at cold start.
See Recon discovery for the recon shape, validator gate, and auto re-roll on all-AUDIT-CLEAN runs.
For focused strategy smoke validation, pass --strategy S1 through
--strategy S8 to bin/audit, not to bin/audit-recon. Recon has
its own scoping flags (--scope, --path, --concurrency,
--validate, and --no-reroll) and always emits work cards for the
later deep audit.
Mid-run checkpoint¶
bin/state exposes a number of slim views into the structured
state. The single most useful is show-recent — it bundles the
agent's recent claims, hypotheses, and probe runs in one call:
The narrower views are also available:
bin/state --results-dir "$RESULTS" recent-runs --agent 1
bin/state --results-dir "$RESULTS" recent-claims --agent 1
bin/state --results-dir "$RESULTS" recent-notes --agent 1
bin/state --results-dir "$RESULTS" resume --agent 1
bin/state --results-dir "$RESULTS" list-cards
bin/state --results-dir "$RESULTS" list-crashes
bin/state --results-dir "$RESULTS" list-findings
bin/state --results-dir "$RESULTS" explain-queue
bin/state --results-dir "$RESULTS" show-card <id>
bin/state --results-dir "$RESULTS" show-crash <id>
bin/state --results-dir "$RESULTS" show-finding <id>
What each one shows:
show-recent— agent's recent claims + hypotheses + probe runs.recent-runs— one pipe-delimited row per recent testcase verdict.recent-claims/recent-notes— narrower listings of those streams.resume— the compact next-iteration brief for one agent.list-cards/list-crashes/list-findings— slim JSONL listings.explain-queue— why the top-N cards in the queue scored where they did.show-card/show-crash/show-finding— full compact JSON for one item.
Result review¶
Open the generated HTML artifacts in a browser. Start with the highest-level page that exists for the scope you are reviewing:
output/$TARGET/CRASH-CLUSTERS.html
output/$TARGET/FINDING-CLUSTERS.html
$RESULTS/crashes/CRASH-CLUSTERS.html
$RESULTS/findings/FINDING-CLUSTERS.html
$RESULTS/crashes-rejected/INDEX.html
Then follow the linked cluster member to the report:
Use the target-level cluster HTML files when comparing multiple
backends. Use the backend-local cluster HTML files when reviewing one
run. crashes-rejected/INDEX.html explains rejected crash candidates
so the next operator does not repeat already-triaged work.
These commands are for regenerating or enriching artifacts, not for reading the normal review output:
bin/export-repro CRASH-001-1 --slug "$TARGET"
bin/reachability --report "$RESULTS/crashes/CRASH-001-1/" --severity-only
bin/validate-finding --finding "$RESULTS/findings/FIND-001" --target-path "targets/$TARGET" --backend "$BACKEND"
bin/enrich-report "$RESULTS/crashes/CRASH-001-1/report.md" --slug "$TARGET"
bin/find-crash-testcase "$RESULTS/crashes/CRASH-001-1"
bin/cluster-crashes "$RESULTS"
bin/cluster-findings "$RESULTS"
bin/export-reprocreates the maintainer bundle for an accepted crash.bin/reachability --severity-onlyrecomputes severity without external queries. Omit--severity-onlyonly when public caller search is intended. The caller search filters by the target's language (derived from the target tree; override with--language, or--language noneto disable the filter).bin/validate-finding— re-runs the finding substance gate on a candidate report or recon row.bin/enrich-report— inlines source snippets, patch excerpts, and report metadata into a crash/finding report.--source-root,--upstream-url, and--pinned-revoverride the snippet source and source-link rewrites;--quietsuppresses the changed/unchanged line.bin/find-crash-testcase— prints the testcase path selected from a crash directory, useful when an old artifact layout is ambiguous.bin/cluster-crashes/bin/cluster-findings— group reports that share a root cause, write theCRASH-CLUSTERS/FINDING-CLUSTERSsummaries, and stampCluster:lines into each member report. Both run automatically during triage; rerun them after manually editing reports or moving artifacts. Pass a backend results dir for one run, oroutput/<target>for the cross-backend rollup.bin/show-exclusions "$RESULTS"— one read-only view of what was kept vs. excluded and why: active crashes, confirmed findings, rejected candidates with reasons, and fuzz-crash noise.
An empty crashes/ directory is normal after a short smoke run. A
rejected index with clear reasons is useful output too — it tells
the next operator what has already been evaluated.
Always inspect findings/FINDING-CLUSTERS.html as well. A run may
produce a concrete security finding without producing a sanitizer
crash.
Maintenance¶
bash tests/run-tests.sh
bash tests/run-tests.sh --image ubuntu:24.04
bash tests/run-tests.sh --image fedora:latest
bin/docs build
bin/docs serve
# Wipe state, logs, or both for a clean restart on the same target.
bin/cleanup_state --target "$TARGET" --backend "$BACKEND"
bin/cleanup_logs --target "$TARGET" --backend "$BACKEND"
Both cleanup helpers accept --backend NAME for one backend or
--backends a,b,c for a comma-separated set, and --dry-run to
print what would be removed without touching anything. With no
--target they sweep every target under output/, so prefer an
explicit target. cleanup_state preserves every crash, finding,
rejected artifact, corpus seed, and cross-session memory file by
default — it removes the transient queue and scratch state only.
To adjust what survives, --keep <name> (repeatable) protects an
extra directory or file, --keep-only <csv> replaces the default
preserve list outright, and --output-root <path> points both
helpers at a non-default output root.
Run the test suite before merging changes to the harness or to docs that describe its behaviour. Use image mode for Linux portability checks: it mounts the repository in a clean container, installs the baseline dependencies for apt, dnf, or microdnf/yum images, and runs the same suite under Docker (see Prerequisites).
Use bin/docs build for the same strict MkDocs build CI expects, and
bin/docs serve for a local preview.
Benchmarking is a separate operator workflow:
It compares the shipped harness against a direct-prompt baseline under
isolated --experiment output directories. Use it for harness
evaluation, not routine target auditing.
After changing a severity or clustering algorithm, re-derive an existing run's results — severity, crash/finding dedup, and the rollup tables — without re-auditing or any API calls:
With --target, --regenerate reuses --run-id / BENCHMARK_RUNID (or the
most recent run when neither is set). With no --target it re-derives every
run under the bench root — all targets and backends — and rebuilds the full
cross-backend page. Either way it refreshes benchmark-result.{md,html} and
the cluster reports and does not rebuild each crash's export-repro report
bundle. See Benchmarking.
Helpers the harness runs for you¶
The remaining commands in bin/ are invoked by the harness itself.
You rarely run them by hand, but knowing what each one is keeps a
directory listing from being mysterious:
| Command | Invoked by | What it does |
|---|---|---|
bin/auto-build-script |
setup-target --bootstrap |
Converges on a working sanitizer build recipe via an LLM and writes it to targets/<target>/.audit/build.sh. |
bin/auto-repair-target-toml |
triage, after repeated C/C++ harness build failures | Proposes a conservative additive repair to includes / defines / link_libs, with a target.toml.bak.<timestamp> backup. Disable with TARGET_TOML_AUTO_REPAIR=0. |
bin/rank-work |
bin/audit |
Builds and ranks the concrete work cards agents claim. |
bin/patch-cards |
bin/audit |
Builds S1 prior-fix cards from the target's recent fix commits. |
bin/peer-fix-cards |
bin/audit |
Builds S6 cards from the [s6_peers] projects in target.toml. |
bin/run-asan, bin/run-ubsan, bin/run-msan, bin/run-tsan |
bin/probe |
Per-sanitizer single-shot execution wrappers (browser, JS, generic, and fuzz modes). |
bin/run-sanitizer-multi |
bin/probe --confirm, export-repro |
Multi-run normalizer (default 5 runs, SANITIZER_RUNS overrides) that measures the reproduction rate. bin/run-asan-multi is a compatibility shim forwarding to it. |
bin/triage-fuzz-crashes |
triage | Triages libFuzzer artifacts under fuzz-crashes/ into a single fuzz-leads.md agents can pick leads from. |
bin/render-md |
triage, clustering, benchmark | Renders a markdown report or index to its styled .html sibling. |