open-source population infrastructure

A nation is millions of households.
We build a faithful copy of all of them.

populace is a stack for constructing weighted synthetic populations from public survey and administrative data — realistic enough to model tax and benefit policy for everyone, private by construction, and improved in the open.

every point above is a synthetic household · brightness ∝ survey weight

01 — the idea

The sampling frame, made executable.

Every population estimate rests on a frame: the list of units a sample is drawn from, and the weights that scale them back up to a country. In most pipelines that frame is implicit — scattered across data files, weight columns, and convention. When the convention breaks, the numbers break silently.

populace makes the frame a first-class datatype. Entity tables — people, households, tax units — with explicit links, typed weights that can never be silently zeroed, and a record of where every row came from. Imputation, calibration, and policy simulation are operators on that one object. The structure is built once and never re-derived.

02 — the stack

Three operators on one frame.

frameshipped

The kernel

The weighted sampling frame: typed weights with conservation invariants, strata for provenance, weighted accounting, entity structure, and a rules-engine adapter protocol so the tax-benefit model is swappable.

fitshipped

Conditional models

Weight-aware imputation by construction. Regime-gated, chained quantile forests draw missing variables from the weighted conditional distribution — no silent unweighted defaults, the bug class that has quietly distorted survey data for years.

calibrateshipped

Representation

The one place weights become representative. Sparse calibration with L0 pruning turns a large, diverse pool of candidate records into a population that hits thousands of administrative targets at once.

generation owns support · calibration owns representation

03 — evidence

It already matches the data it aims to replace.

PolicyEngine's enhanced Current Population Survey is the microdata behind millions of US policy calculations. The published populace-US release — built entirely from primary sources (the incumbent is the benchmark, never an input), with full variable parity — beats it on training, held-out, and full-surface loss in the matched-sample, symmetric-refit comparison.

training loss 0.18vs 1.09 lower is better
held-out loss 0.04vs 0.32 739 unseen targets
full-surface loss 0.21vs 1.41 all 3,704 targets

build populace-us-2024-5da5a95 · 2026-06-11 · matched 41,314 households, symmetric refit. Per individual target the incumbent still wins more often (2,528 of 3,704 to our 1,127) — we win big where we win and lose narrowly where we lose. Net short-term capital gains land on the signed PUF-anchored target (−$77.4B), and every donor is a primary survey; every remaining gap is itemized on the observatory. The two populations share an open-source unit-construction engine, so this measures synthesis quality on a partly shared scaffold. We report the gaps, not just the wins.

04 — the commons

Toward one faithful record per person.

The long-run goal is a communal population that many parties improve — at full scale, one statistically faithful record for every person, carrying no one's private data. Contributions come in three forms, and they are exactly the three operators: records as new strata, conditional models trained on data a contributor holds, and facts as calibration targets.

A contribution merges only if it improves the population's score on held-out, rotating evidence without degrading any protected family. Privacy is enforced by provenance and measurement, not by blurring: public sources can be sharp, private evidence enters only through certified models, and the population must resemble held-out data — never anyone's training data.

Built in the open. Read it, break it, contribute.