Track record · updated 22 Jun 2026

How well we predict. Published, misses included.

Most planning forecasts are never scored. We score ours in the open. This page shows how closely our evidence engine's approval predictions match what councils actually decide, and just as honestly, where the engine is only modestly better than the base rate. The number we put against a single site is never sold as a verdict; only this aggregate record is published.

Retrospective hold-out: 2,871 decided applications · cut 1 Jul 2025 · 32 boroughs · 2022–2026. Prospective ledger opened 22 Jun 2026.

Two numbers, both true. The honest pair.

A prediction can be well-calibrated (when it says 60%, about 60% happen) yet only modestly discriminating (it struggles to separate the eventual winners from the losers on a single site). Ours is exactly that, and we say so.

0.024
Calibration error (ECE)
Low is good. Our stated probabilities track reality closely across the range. See the curve below.
0.67
Discrimination (AUC)
Modest, by design. This is the irreducible ceiling of pre-submission information, confirmed in three separate experiments.
0.225
Brier score
Overall accuracy of the probabilistic forecast (lower is better).

In plain terms: anyone who hands you a confident, site-specific approval percentage is overclaiming. The information available before you submit doesn't support it, and here's the evidence, including our own ceiling. What the engine is good for is the gradient and the structure of risk, not a single decimal on one plot.

The calibration curve

perfect calibration 20%40% 60%80%100% 20%40% 60%80%100% Predicted approval probability Actual approval rate

Each dot is a bin of decided applications from the hold-out set; dot size is the number of applications in the bin. The closer the dots sit to the dashed diagonal, the better the calibration. Faded dots are thin-sample bins, shown rather than hidden. Source: time-based hold-out, trained on 9,535 decisions before 1 Jul 2025, tested on 2,871 after.

The prospective ledger: committed before the council decides

A retrospective curve only proves we fit the past. The real test is forward. So when the ledger opened we committed the engine's prediction for every currently undetermined small-site application in the dataset, and timestamped that file so it cannot be quietly rewritten once the decisions land. As councils determine them, each call is scored against the outcome, wins and losses both.

1,670
Predictions committed
Undetermined small sites (≤9 units), awaiting their councils' decisions.
0
Resolved so far
Determinations typically take 60+ weeks. This page fills in as they arrive.
1,670
Still awaiting
Locked predictions whose councils have yet to decide.

Prospective calibration appears here once 30 predictions have resolved. Determinations typically take 60+ weeks, so the forward record opens slowly and on purpose.

How the ledger resists tampering (locally, no blockchain)

Each committed file is hashed; each ledger record signs the file hashes plus the hash of the record before it, with our Ed25519 key. Any silent edit, deletion, or reordering of a past prediction breaks every record after it, and the signature proves authorship. The public key below lets anyone verify it independently. We pin the chain to an outside clock by publishing the head hash and committing the chain to version control.

Public key
a1d803cd2df8b4ad3836e4a15696b7e61a49378d086101f3728ca52a3be903d2
Chain head
1c8462a4b3ad3049b9916e7debd70e422d7ba8a5e7165505a115fe44986ea307
Records
1

Verify the chain:

python3 _scripts/local_attest.py verify

Honest about the method: a local timestamp is self-asserted, so we could in principle re-sign the whole chain against a false clock. The signed chain stops silent tampering; pinning the head to version control and publishing it here is what stops back-dating. It is deliberately simpler and more inspectable than a blockchain anchor, and good enough for what it claims.

What this record can and cannot tell you

  • It is retrospective until the prospective ledger resolves. Today's calibration comes from a hold-out on past decisions. The forward record is committed but mostly still pending, so treat it as opening, not proven.
  • Base rates are conditioned on submission. The dataset sees applications that were lodged and determined. The sites a developer rejected in due diligence, or never tested because they were hopeless, are invisible, so a real-world "chance of approval across everything you might buy" is lower than any figure here.
  • The per-site probability is internal. We publish the aggregate calibration as an honesty check; we do not sell a single site's percentage as a verdict. The product value is the structure of risk: which boroughs, types and design positions are hard, and why, not a decimal.
  • Planning is non-stationary. A new London Plan, a change of administration, or a policy reform can shift the patterns the model learned. The record is date-stamped for exactly this reason: when calibration drifts, that is the signal of a regime change, not a number to keep trusting.

Method: deterministic logistic model over pre-submission features (borough, site type, PTAL band, conservation status, scale), with a leak-safe design-proximity step; time-based hold-out evaluation. Small sites, ≤9 units. Descriptive planning intelligence, not regulated advice.