The capability-recall study

This is the strongest quantitative result in the project, measured head-to-head against the tools auditors actually run. The question it answers: does a per-function SBOM give guarantees that scanners and dependency SBOMs cannot? Capa is run against three real treatments on a corpus of 25 hand-built Python / Capa pairs, scoring 48 per-function facts on two distinct questions. The decisive finding: under closed-world SBOM semantics, Capa never clears a function incorrectly, where the best dataflow tool clears ten. Everything on this page is anchored to the committed study in the compiler repository and is reproducible end to end; the canonical source is evaluation/empirical_study/summary.md.

The question

A Capa program ships a manifest of capability claims: for every function, which authorities (filesystem, network, environment, and the rest) it could possibly reach. An auditor verifies those claims by re-deriving the SBOM from source with the compiler itself, not a second analyser. But a claim is only as useful as what its absence means.

An SBOM is a closed list. What is not listed for a function is implicitly excluded. A consumer reading "this function reaches nothing sensitive" treats that silence as a clearance. That makes the dangerous failure mode not a missing detection in the abstract, but a function the record silently clears while the code can in fact exercise the authority. The study measures exactly that, against the tools that produce such records today.

Method

The corpus is 25 hand-built Python / Capa pairs: 20 covering direct and via-helper authority, plus 5 covering via-dispatch and via-data indirection, where the sink is selected at runtime through a callable or a data table. The unit of measurement is one (python_function, capability) fact, 48 facts in total. Four treatments are scored, each by the same criterion within each question:

Treatment	What it is
T1 dependency / PURL SBOM	Package-granularity SBOM, the Syft / cdxgen style. No per-function granularity at all.
T2 Semgrep	A good-faith pattern heuristic.
T2b CodeQL 2.25.6	Good-faith dataflow analysis (`python-all` 7.1.2), the strongest real tool on the corpus. Direct-fact recall is 36/36 with zero over-attribution, so every dispatcher miss is the tool's limit, not a weak query.
T3 Capa by construction	The capability record the Capa compiler emits, with a sound provably-excluded channel proved in Agda.

The study asks two distinct questions and reports them in separate tables, never collapsed into one number, because a treatment can do well on one and badly on the other. Q1 is positive attribution: does the treatment credit capability C to the named function F? Q2 is false-clearance under closed-world semantics: does the treatment leave a true fact silently cleared? The two are kept apart deliberately, and this is where the honesty of the result lives.

The decisive result: false-clearance (Q2)

A treatment commits a false-clearance for a true fact (F, C) when it gives the consumer no way to know F can exercise C. Under closed-world SBOM semantics, absence equals exclusion, so a silently blank function reads as cleared. Lower is better.

Treatment	False-clearances	of 48
T1 dependency / PURL SBOM	48 / 48	clears every function (no per-function granularity at all)
T2 Semgrep (pattern heuristic)	12 / 48	clears every fact it cannot see (absence = exclusion)
T2b CodeQL 2.25.6 (dataflow)	10 / 48	clears the ten dispatcher facts it cannot resolve
T3 Capa by construction	0 / 48	clears nothing it has not soundly proved absent

This is the real argument for Capa, and adding the best real dataflow tool sharpens it rather than dulling it. CodeQL false-clears 10/48: the ten via-dispatch / via-data dispatcher facts. It beats Semgrep by exactly the two via-helper facts (it follows the local call edge Semgrep cannot), but on the dispatchers it leaves the function silently blank, and closed-world that reads as cleared. CodeQL's native output has no explicit-exclusion field, so absence is the only signal it can give, exactly as for Semgrep.

Capa's manifest gives each (F, C) three states: reachable, provably-excluded (sound, proved in Agda), or not-determined. A false-clearance can only arise from the provably-excluded state, and because that state is sound (used ⊆ declared; used ∩ provably-excluded = ∅), it never contains an axis the function actually exercises. The ten dispatcher facts land in not-determined, not excluded, so Capa clears nothing: 0 false-clearances by construction. The dispatcher functions report an empty provably_excluded list, the honest record that their authority depends on what was registered into the table they receive.

The separation is not a softer scoring rule applied to Capa. A consumer who ignored Capa's exclusion field and read its reachable = [] closed-world, exactly the only reading available for Semgrep and CodeQL, would also false-clear all ten dispatchers. What separates Capa is that it offers a sound exclusion channel a consumer can rely on, while both real tools carry only positive detections and no sound way to answer the exclusion question.

The honest result: attribution (Q1)

Does the treatment attribute C to the named function F? Identical criterion for all: C appears in the treatment's output for F, not merely somewhere in the pair.

Treatment	Positive attribution	Recall
T1 dependency / PURL SBOM	0 / 48	0.0 %
T2 Semgrep (pattern heuristic)	36 / 48	75.0 %
T2b CodeQL 2.25.6 (dataflow)	38 / 48	79.2 %
T3 Capa by construction	38 / 48	79.2 %

On positive attribution Capa does not beat the best dataflow tool. It ties it, exactly. CodeQL and Capa both attribute 38/48: the 36 direct facts plus the 2 via-helper facts, and neither attributes any of the 10 dispatcher facts. CodeQL follows the via-helper call edges Semgrep misses (hence 38 versus Semgrep's 36), and Capa carries the same two facts on the caller's type, so the two land on the identical 38.

The crucial honesty point: Capa does not see more than CodeQL on Q1. It does not vouch which handler a dispatcher runs, so it does not credit the dispatcher with the handler's authority, and neither does CodeQL. The Q1 story is a clean parity at the top between the best dataflow tool and Capa. Capa's advantage is not attributing more, it is never clearing a function incorrectly, which is Q2.

The three regimes

The corpus separates three regimes by how the authority is reached. The columns show Q1 (attributes?) and Q2 (false-clears?) for Semgrep, CodeQL, and Capa.

Regime	T2 Semgrep	T2b CodeQL	T3 Capa
direct	attributes / does not false-clear	attributes / does not false-clear	attributes / does not false-clear
via-helper	misses / false-clears	attributes / does not false-clear	attributes / does not false-clear
via-dispatch / via-data	misses / false-clears	misses / false-clears	misses / does not false-clear (sound)

On direct facts all three attribute and none false-clears.
On via-helper facts Semgrep misses and false-clears; CodeQL attributes the fact (it follows the local call edge) and so does Capa (the helper's authority is on the caller's type). Neither CodeQL nor Capa false-clears. This is the band where dataflow earns its two-fact lead over the pattern heuristic, and it ties Capa exactly.
On via-dispatch / via-data facts neither real tool nor Capa positively attributes the dispatcher, but both real tools false-clear it under closed-world semantics, while Capa reports not-determined and false-clears nothing. This is the crux: the separation is in Q2, not Q1, and it holds against the best dataflow tool, not only the pattern heuristic.

Depth: the same record on two real enterprise programs

The breadth corpus above settles the tool comparison. The depth half reads the real SBOM the Capa build emits (capa --manifest main.capa) on two real enterprise programs that exist only in Capa, to show the richness and scale of the per-function record on real code.

Program	Functions	Provably pure	Provably-excluded facts
capa_paymentguard (PCI / payment core)	70	66 / 70 (94.3 %)	625
capa_claimdesk (claims engine)	213	187 / 213 (87.8 %)	2,295

Between 88 % and 94 % of functions are provably pure, and no sensitive axis in either program is held by more than ~4 % of functions (paymentguard's Fs at 4.3 %, three of seventy, is the worst case). The 625 and 2,295 provably-excluded (function, capability) facts are sound (proved in Agda) and have no counterpart in any dependency SBOM, which names packages and carries zero per-function facts. A dependency SBOM for capa_claimdesk lists six packages; it cannot tell a consumer that exactly two functions reach the network or that 187 functions are provably side-effect-free.

Reproducibility

Every number on this page is committed to the compiler repository with a versioned ground truth and a deterministic harness. The CodeQL treatment reads pre-computed facts (CodeQL 2.25.6, python-all 7.1.2) so the harness never invokes the CLI and stays deterministic.

The capability-recall study

The question

Method

The decisive result: false-clearance (Q2)

The honest result: attribution (Q1)

The three regimes

Depth: the same record on two real enterprise programs

Reproducibility

The study, in the repository →

The breadth summary →

`run_study.py` →

`scratch_codeql/REPRODUCE.md` →

The capability-recall study

The question

Method

The decisive result: false-clearance (Q2)

The honest result: attribution (Q1)

The three regimes

Depth: the same record on two real enterprise programs

Reproducibility

The study, in the repository →

The breadth summary →

run_study.py →

scratch_codeql/REPRODUCE.md →

`run_study.py` →

`scratch_codeql/REPRODUCE.md` →