Skip to main content

Introducing pseudopeople: census-scale simulated data for entity resolution

Please join us for a UW Data Science Seminar event on Wednesday, May 17th from 4:30 to 5:20 p.m. PST. The seminar will feature Abraham D. Flaxman, Associate Professor of Health Metrics Sciences at the UW Institute for Health Metrics and Evaluation (IHME).

Use this zoom link to join


“Introducing pseudopeople: census-scale simulated data for entity resolution”

Abstract: I will introduce and demo pseudopeople, our new, publicly available Python package that we hope you will use in entity resolution research and development. pseudopeople generates census-scale, simulated population data with adjustable parameters, to replicate key complexities from real challenges in record linkage work. Typical applications of entity resolution and record linkage rely on sensitive and confidential data, and this can be a barrier to reproducible computational research and sometimes even to open communication about innovations and challenges. The value hypothesis of this work is that creating realistic, simulated data (that includes non-confidential simulated versions of sensitive fields, like name, address, and date of birth) will enable more research in census-scale entity resolution and guide the research towards challenges that Census Bureau faces in practice.

Our work builds on previous entity resolution data projects, such as FEBRL, GeCO, and SOG, as well as our microsimulation framework, Vivarium. We model individual people and their household, family, and employment relations at USA scale, and include simulated versions of confidential attributes like name, address, income, and social security number. On top of this, we simulated a range of census-relevant data collection mechanisms, including simulated decennial censuses, simulated ACS and CPS surveys, simulated tax records, and simulated social security administrative data. By creating realistic, but non-confidential, data which includes these attributes, we can make entity resolution research and development easier for ourselves and others.

Biography: Abraham D. Flaxman, PhD, is an Associate Professor of Health Metrics Sciences at the Institute for Health Metrics and Evaluation (IHME) at the University of Washington. He is currently leading the development of a simulation platform to derive “what-if” results from Global Burden of Disease estimates and is engaged in software engineering and development for verbal autopsy and probabilistic record linkage. Dr. Flaxman has previously designed software tools such as DisMod-MR that IHME uses to estimate the Global Burden of Disease, and the Bednet Stock-and-Flow Model, which has produced estimates of insecticide-treated net coverage in sub-Saharan Africa.

The UW Data Science Seminar is an annual lecture series at the University of Washington that hosts scholars working across applied areas of data science, such as the sciences, engineering, humanities and arts along with methodological areas in data science, such as computer science, applied math and statistics. Our presenters come from all domain fields and include occasional external speakers from regional partners, governmental agencies and industry.


May 17 2023


4:30 pm - 6:00 pm