The kernel
The weighted sampling frame: typed weights with conservation invariants, strata for provenance, weighted accounting, entity structure, and a rules-engine adapter protocol so the tax-benefit model is swappable.
open-source population infrastructure
populace is a stack for constructing weighted synthetic populations from public survey and administrative data — realistic enough to model tax and benefit policy for everyone, private by construction, and improved in the open.
every point above is a synthetic household · brightness ∝ survey weight
Every population estimate rests on a frame: the list of units a sample is drawn from, and the weights that scale them back up to a country. In most pipelines that frame is implicit — scattered across data files, weight columns, and convention. When the convention breaks, the numbers break silently.
populace makes the frame a first-class datatype. Entity tables — people, households, tax units — with explicit links, typed weights that can never be silently zeroed, and a record of where every row came from. Imputation, calibration, and policy simulation are operators on that one object. The structure is built once and never re-derived.
The weighted sampling frame: typed weights with conservation invariants, strata for provenance, weighted accounting, entity structure, and a rules-engine adapter protocol so the tax-benefit model is swappable.
Weight-aware imputation by construction. Regime-gated, chained quantile forests draw missing variables from the weighted conditional distribution — no silent unweighted defaults, the bug class that has quietly distorted survey data for years.
The one place weights become representative. Sparse calibration with L0 pruning turns a large, diverse pool of candidate records into a population that hits thousands of administrative targets at once.
generation owns support · calibration owns representation
PolicyEngine's enhanced Current Population Survey is the microdata behind millions of US policy calculations. The published populace-US release — built entirely from primary sources (the incumbent is the benchmark, never an input), with full variable parity — beats it on training, held-out, and full-surface loss in the matched-sample, symmetric-refit comparison.
build populace-us-2024-5da5a95 · 2026-06-11 · matched 41,314 households, symmetric refit. Per individual target the incumbent still wins more often (2,528 of 3,704 to our 1,127) — we win big where we win and lose narrowly where we lose. Net short-term capital gains land on the signed PUF-anchored target (−$77.4B), and every donor is a primary survey; every remaining gap is itemized on the observatory. The two populations share an open-source unit-construction engine, so this measures synthesis quality on a partly shared scaffold. We report the gaps, not just the wins.
The long-run goal is a communal population that many parties improve — at full scale, one statistically faithful record for every person, carrying no one's private data. Contributions come in three forms, and they are exactly the three operators: records as new strata, conditional models trained on data a contributor holds, and facts as calibration targets.
A contribution merges only if it improves the population's score on held-out, rotating evidence without degrading any protected family. Privacy is enforced by provenance and measurement, not by blurring: public sources can be sharp, private evidence enters only through certified models, and the population must resemble held-out data — never anyone's training data.