After an 11-year hiatus, the long-awaited Census 2022 has arrived, but it has attracted attention for all the wrong reasons. There have been countless articles and blogs on the Census 2022 undercount, estimated at 30%. The undercount is not only a significant outlier in an international context (the US undercount in 2020 during the pandemic was estimated at 0.2%), but also when compared to previous South African Censuses (17% in 2001 and 14% in 2012 which were themselves concerning).
If the 30% who were excluded from Census 2022 were omitted at random the remaining 70% sample would still be representative of South Africa’s population. The dataset would still be very useful, as required adjustments would be trivial for statisticians. Unfortunately, this is not the case. The post enumeration survey (which itself undercounted) found that some groups of people were more likely than others to be excluded. StatsSA noted that enumerators found it more difficult to survey those who live in gated communities. This makes it difficult to determine whether emerging trends in the data reflect real shifts in the population. According to the Gauteng City-Regions Observatory[1], Census 2022 indicates that growth in the City of Johannesburg’s population has been slower than the national average. They are unable to determine whether this apparent wave of deurbanization is real or a function of the undercount.
Besides undercounting, non-sampling issues also impact on data quality. People who don’t trust StatsSA to keep their data private and secure will be less likely to answer truthfully – if at all. The wealthy in South Africa might underreport their income, possibly to avoid resentment given our high levels of inequality, or to avoid taxes. Likewise, foreign nationals who fear deportation or xenophobic violence[2] might report being South African-born. According to Diego Iturralde from Stats SA [3] this is the reason for the lower-than-expected number of foreign nationals in Census 2022.
Clearly, the census, for all its value, has serious shortcomings. Alternative data sources like administrative records and/or information generated as a by-product of transactions and interactions – often referred to as Big Data – can help to close some data gaps. These alternative data sources can be significantly more reliable than survey responses. Additionally, they are usually current and relatively easy to access as no additional data collection is required.
For instance, according to the World Bank’s Global Financial Inclusion Database 2021, 84% of South African adults (aged 15 years or older) had a bank account while the Pew Research Centre[4] estimates that 87% of South African adults have their own phones. This suggests that banks and mobile network providers may well be capturing data on more adults than Census 2022. It is theoretically possible to construct secure data sharing architecture that would allow for matching individuals across banks and mobile networks (to avoid double counting) and then aggregate the number to a national count of banked individuals. More geographically granular data could be generated off mobile network data based on most used cell towers. Technological safeguards can be put in place to protect the identities of individual users and reduce reliance on the discretion of data scientists or executives.
Developments in computing have also reduced the cost of processing so-called big data. Machine learning algorithms can process high resolution imagery generated by drones to detect buildings and structures in urban settings. In rural areas these enable crop-mapping which provides up to date regional statistics about land area usage for different crop-types. Alternative data is not without shortcomings. New datasets would initially lack historical data which would make adjusting for seasonality difficult. They can also be non-representative – those who own cell phones are different from those who don’t, for example. Access to skills to analyse alternative data in all its various forms and formats can also be a challenge. Much of this data resides in private sector institutions that may be unwilling to share it.
These challenges notwithstanding, alternative data can provide an important cross-check for survey and census data, reduce the response burden on stakeholders, and become particularly useful in times of crisis when timely data updates can be vital.
As a first step, administrative data generated by various public sector entities should be more actively explored. SARS[5] already makes available very useful aggregated data on taxpayers in South Africa. Likewise, unemployment Insurance Fund (UIF) contributions and claims data could be used to gauge changes in employment across industries, firm size and location, while South African Social Security Agency (SASSA) data on grant payments could be used to track trends in poverty. If these institutions were to make use of standardised digital records, it would be possible to set up data pipelines that generate near-real time aggregated data.
At the same time, we need to encourage decisions makers to wean themselves off familiar and on hand, but too often flawed, sources of data – chief among them, the census. In some cases, they may be required by law to rely on this data. If anything, the recent census provides clear, unambiguous data to motivate for a new, very much improved alternative.
[1] https://www.pewresearch.org/internet/2019/11/20/mobile-divides-in-emerging-economies/
[2] https://www.sars.gov.za/about/sars-tax-and-customs-system/tax-statistics/
[3] https://theconversation.com/south-africas-2022-census-has-johannesburg-stopped-growing-or-are-the-numbers-wrong-215610
[4] https://www.ohchr.org/en/press-releases/2022/07/south-africa-un-experts-condemn-xenophobic-violence-and-racial
[5] https://www.dailymaverick.co.za/article/2023-10-12-how-much-can-we-rely-on-census-2022/