Improving & Integrating Diversity Estimates

NSF Award #1824005

Co-PI, Scott Althaus, Merriam Professor of Political Science, University of Illinois

Existing estimates of ethnic, religious, and linguistic diversity are correlated cross-sectionally with a number of socio-political and economic outcomes including development, conflict, and social capital. Close examination of these data raises validity concerns: few are based on high-quality official statistics, the majority coming from questionable secondary sources. Further, criteria for group inclusion (i.e., ontologies) are opaque and inconsistently applied. Even where they appear accurate, data are static and aggregated at the country level, although they are often used to explain time-varying and spatially disaggregated outcomes. Ontologies in extant datasets are also incompatible, making comparison and integration difficult.

Our proposal improves existing measures by applying machine learning methods to compare 13.7 million responses across 180 countries with a new database of census results. An algorithm will identify survey design features that maximize accuracy, to define a compensatory weighting scheme across these features. The result is a set of survey-based demographic estimates with improved validity, even for countries lacking reliable census data. This method of triangulating surveys and official statistics is generalizable to research areas that use either source and can also inform improved survey design. The project will also develop tools for linking surveys, censuses, and existing datasets based on explicit and transparent decision rules to facilitate their comparison and integration. An online portal will provide access to datasets and code, supporting customized data manipulation and visualization. The methods and tools proposed here – emphasizing accuracy, transparency, and cross-resource integration – should serve as a model for future data collection.

Find the full project description here. More information about the NSF award available here.

Our proposed method of improving and integrating estimates of diversity worldwide entails a number of steps:

1. Identifying and integrating self-identification data from cross-national surveys

Across all cross-national survey projects that have conducted multiple waves in multiple, these thirty ask respondents to self-identify in terms of ethnicity (E), religion (R), and/or language (L):

In total, we have information about the self-identification of over 13.7 million individuals across 180 countries between 1962 and 2019.


Carousel imageCarousel imageCarousel imageCarousel image

2. Building a database of national censuses that enumerate ethnicity, religion, and language

A motivating factor behind this project was the observation that most of the demographic statistics used in existing estimates of diversity could not be traced back to the results of a recent national census. This is because the enumeration of identity in national censuses is far from universal, and there are important socio-political predictors of which countries enumerate ethnicity, religion, and language. This raises concerns about systematic measurement error in our existing estimates, correlated with these predictors of enumeration.

Still, there is important information to be gleaned from these official statistics, where they are available. In our case, we propose to use high-quality census results as an input into a machine learning algorithm identifying systematic measurement error across surveys.

Toward this end, we gather information about census enumeration in every country since 1900 to produce a "census of censuses." If a census was administered in a given country-year, we look for information about whether ethnicity, religion, or language was included and how these categories were enumerated. Whenever a category was enumerated, we look for a report of the results. In this last effort, we have been graciously supported by the incredible team at the University of Illinois Library and have made great use of the Inter-Library Loan service.

3. Understanding the ethnic, religious, and linguistic structure of countries to connect categories across surveys, censuses, and other existing datasets

A key challenge in making full use of existing data on identity -- including surveys and censuses -- is that each source tends to use its own set of ethnic, religious, and linguistic categories. Creating linkages across these ontologies is rarely straightforward. The same group can have many different names, and each of these names can have different spellings. Inconsistencies in these naming and spelling practices within and across countries mean that even fuzzy algorithms will struggle to make appropriate matches. Further, the nested structure of identity means that ontologies may also be operating at different levels of aggregation, with some listing more coarse umbrella categories, while others include smaller sub-groups.

For these reasons, we rely on a team of human coders to research the ethnic, religious, and linguistic structure of different countries, identifying which categories are one-to-one matches and which are nested inside each other. Their carefully documented work not only helps us to integrate surveys within countries, and to link these to census results where they are available, but also connects all of these back to a range of existing datasets, including Ethnic Power Relations (EPR), Minorities at Risk (MAR), the Composition of Religious and Ethnic Groups (CREG), the World Religion Project, and Ethnologue. The result is a dictionary of ethnic, religious, and linguistic groups worldwide that should prove useful to scholars from a variety of disciplines.

Our team of coders is comprised of research interns who take courses through the Department of Political Science (see more information about these courses here). In addition to getting hands-on experience in the data-generation process, the interns learn more about the role of data analytics in political science and gain some preliminary experience in the analysis of quantitative data, working towards conducting their own research. The internship program is generously hosted by the Cline Center for Advanced Social Research, which provides office space and equipment, in addition to welcoming the team into its intellectual community.

4. Constructing a database of survey design characteristics and an algorithm to detect systematic measurement errors

We assume that most of any systematic measurement error across surveys is likely to reflect differences in the way that surveys were designed and administered. To identify and correct for these sources of error, we sift through methodological reports and questionnaires for each included survey, coding information about the design of each sample, the methods used in the survey as a whole, and how questions about identity are asked and answered. The variables in this "survey of surveys" serve as inputs into an algorithm comparing our survey-based estimates to the results of high-quality censuses, where they exist. The algorithm defines a compensatory weighting scheme applied to the full set of surveys.

5. Developing an online portal through which scholars and the public can engage with the components of the project and its main results

Designed and hosted by the Cline Center for Advanced Social Research, a website provides access to datasets and code, as well as supporting customized data manipulation and visualization. Designed for scholars, policy-makers, and members of the public in mind, this user-driven portal should help facilitate a better understanding of diversity worldwide.