Methodology

How the dataset is built.

Name: The London Small Sites Planning Dataset
Creator: Perfect Scale
License: https://perfect-scale.co.uk/terms/

15,114 planning applications. 33 London boroughs. Three years. Every refusal reason coded into a 10-category taxonomy (extending to 13) and back-tested before publication.

Dataset scope

The London Small Sites Planning Dataset covers 15,114 planning applications across all 33 London boroughs, from January 2023 to March 2026 — a rolling three-year window that captures the current policy equilibrium. Of these, 11,997 have been determined: 6,774 approved, 5,223 refused. The overall refusal rate is 43.5%, with a 34-point spread between the most permissive borough (Kensington & Chelsea, 15.2% refused) and the strictest (Havering, 70.5% refused) on like-for-like small-site applications.

Every application is classified along nine axes: decision outcome, site type (conversion, demolish-rebuild, end-terrace, mid-terrace, backland, infill, extension), area within the borough, conservation area status, PTAL accessibility level, density, determination time, decision route (committee vs delegated), and case officer.

Sources and ingestion

Primary feed: the Mayor of London’s Planning London Datahub (PLD), queried via its Elasticsearch guest endpoint. Council back-office systems push application validations, decisions, and metadata to the PLD on a daily cadence under the London Planning Data Standard. Perfect Scale’s ingest pipeline pulls each borough by canonical local-planning-authority name, paginating through the full validation history within the analysis window.

The PLD feed is then augmented by direct harvest of each council’s public planning register for Decision Notices, Officer Reports and Committee minutes. This requires ten different scrapers because London boroughs use ten different portal systems (Idox, Northgate, Council Direct, Arcus/Salesforce, Agile/IEG4, Tascomi/PlaceHub, RBKC Atlas, NECSWS SPA, Aurora and one bespoke build). Each scraper is portal-aware, rate-limited, and resumable.

NLP extraction and the refusal-reason taxonomy

Decision Notices and Officer Reports are parsed for structured refusal reasons. Each numbered reason is extracted as raw text, then classified into a 10-category taxonomy currently in production: Design & Character (DES), Neighbour Amenity (AMN), Daylight/Sunlight (DLT), Space Standards (SPC), Heritage & Conservation (HER), Transport & Parking (TRN), Policy Non-compliance (POL), Infrastructure & Sustainability (INF), Flood Risk (FLD), and Other (OTH). Three further categories — Insufficient Information (INS), Permitted Development Non-compliance (PDD), and Loss of Use / Community (LUC) — are being introduced in the next quarterly refresh to disaggregate patterns previously bundled under OTH. Categories are not mutually exclusive: a single reason can match both Design and Amenity, which is the most common pairing across the dataset.

Each reason carries a primary and secondary code, the source document, and the original text so any classification can be audited back to the council’s own words. Total reasons classified to date: over 18,000 across more than 10,000 source documents.

Decision routes and officer overturns

Every decision is tagged with its route — committee or delegated — and, where an officer report exists, the officer’s recommendation is captured separately from the committee’s outcome. This lets the dataset surface overturn rates: how often a committee diverges from the officer it asked to write the report, broken down by borough, site type and area. The political-noise signal is one of the few in planning data that requires NLP-extracted document reading rather than transactional records alone, and it is also one of the most asked-about findings by developers preparing for a marginal application.

Modelling and back-testing

Two model classes sit on the dataset. The first is a density model that estimates the approvable unit count for a given site type, area, and conservation context. Across the borough dashboards, the density model predicts within ±1 unit of the actual approved scheme 73% of the time, and within ±2 units 90% of the time, with no systematic over- or under-prediction direction. Every borough dashboard publishes the model’s back-test before going live.

The second is a refusal-probability model that combines area, site type, density, conservation status, and PTAL band with the empirical refusal rates for each cohort. Probabilities are surfaced as cohort comparisons (e.g., “your scheme sits in a cohort that approves at 41% vs the borough average of 56%”) rather than as point estimates without context.

Back-test sample sizes and confidence intervals for both model classes are documented in each Borough Intelligence Report. A consolidated back-test summary, with per-cell n and CI, is being added to this page in the next quarterly refresh.

Evidence tiers

Every finding in every Perfect Scale output carries an evidence tier: Robust, Indicative, Suggestive, or Anecdotal. The tier reflects sample size, the appropriateness of any statistical test (Mann-Whitney U, chi-squared, Spearman correlation), and whether the cohort meets a minimum-n gate. Robust cells empirically out-predict Anecdotal cells by 37% across the dashboard back-tests. The four tiers exist so readers know exactly how much weight to put on each claim — a number drawn from 200 applications is not the same as one drawn from a handful.

Limits and cadence

The dataset is London-only by design, focused on schemes of 1–9 units. Conditions, officer intelligence, and refusal-reason extraction depend on harvested Decision Notices and Officer Reports — coverage varies between boroughs because portals vary in accessibility (Westminster currently caps harvest at ~23% on its POST-blocked Idox; most boroughs sit at 70%+). Where coverage is below 50%, the relevant findings are marked Indicative at best and explicitly footnoted in every report.

Refresh cadence is quarterly: each borough is re-pulled from the PLD, re-harvested for any new refused applications, NLP-classified, and the workbook regenerated. The full pipeline is documented and resumable; any stage can be re-run independently. Borough-level data-window-end dates are published on every dashboard footer.

See the data for yourself.

Download a free Market Snapshot for any London borough. Five findings, one page.

Download free Snapshot → Request a Site Assessment →