Article Text

## Abstract

**Introduction** The wealth index is widely used as a proxy for a household’s socioeconomic position (SEP) and living standard. This work constructs a wealth index for the Mopeia district in Mozambique using data collected in year 2021 under the BOHEMIA (Broad One Health Endectocide-based Malaria Intervention in Africa) project.

**Methods** We evaluate the performance of three alternative approaches against the Demographic and Health Survey (DHS) method based wealth index: feature selection principal components analysis (PCA), sparse PCA and robust PCA. The internal coherence between four wealth indices is investigated through statistical testing. Validation and an evaluation of the stability of the wealth index are performed with additional household income data from the BOHEMIA Health Economics Survey and the 2018 Malaria Indicator Survey data in Mozambique.

**Results** The Spearman’s rank correlation between wealth index ventiles from four methods is over 0.98, indicating a high consistency in results across methods. Wealth rankings and households’ income show a strong concordance with the area under the curve value of ~0.7 in the receiver operating characteristic analysis. The agreement between the alternative wealth indices and the DHS wealth index demonstrates the stability in rankings from the alternative methods.

**Conclusions** This study creates a wealth index for Mopeia, Mozambique, and shows that DHS method based wealth index is an appropriate proxy for the SEP in low-income regions. However, this research recommends feature selection PCA over the DHS method since it uses fewer asset indicators and constructs a high-quality wealth index.

- public health
- indices of health and disease and standardisation of rates
- health services research

## Data availability statement

Data are available in a public, open access repository. The BOHEMIA consortium has agreed to make the data underlying each manuscript openly available on publication. The data supporting this paper is generating from BOHEMIA project, which can be found in the ISGlobal dataverse repository (https://doi.org/10.34810/data682).

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

## Statistics from Altmetric.com

#### WHAT IS ALREADY KNOWN ON THIS TOPIC

Wealth index, derived from the assets of the household, is widely used as a proxy indicator for socioeconomic position (SEP) and living standard.

Importance of principal components analysis (PCA), in constructing the wealth index, has been well accepted by researchers.

#### WHAT THIS STUDY ADDS

This research provides an alternative to Demographic and Health Survey (DHS) methodology for constructing a wealth index in data poor regions, and gives insights into the effectiveness of using alternative approaches for creating a wealth index, including feature selection PCA, sparse PCA and robust PCA.

The feature selection PCA method, when applied to the district of Mopeia in Mozambique, shows an improvement over the DHS methodology since it reduces the required number of input variables by 40% and yet constructs a high-quality wealth index.

The method works for the Mopeia region as well as for the entire country, Mozambique.

#### HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

The proposed method reduces the burden of data collection on investigators and simplifies the construction of the wealth index.

For other low-income and middle-income countries which are data sparse, it will be easier to build the wealth index, which is a key indicator of the SEP of households.

## Introduction

The elimination of extreme poverty in all its forms everywhere by 2030 is one of the major goals of the United Nations’ Sustainable Development Agenda. Despite consistent and widespread progress, poverty remains a major problem in Africa.1 Socioeconomic position (SEP) of households is a key indicator of poverty and generally measured in terms of income and consumer spending. In low-income and middle-income countries (LMICs) such as African countries, where most of the population lives in rural areas, this information is not readily available. Hence, the wealth index is developed as a proxy measure for SEP in LMICs.2 The wealth index is constructed using asset-based information including ownership of durable assets, housing characteristics and access to basic services, which are easier to obtain and are more reliable in terms of data collection than income and consumption.2

Demographic and Health Survey (DHS) datasets are widely used with the classical PCA (principal components analysis) approach for the wealth index construction, known as the DHS Wealth Index.3 When it comes to explaining variation in education, child mortality, nutrition, fertility and healthcare, classical PCA wealth indices based on the DHS dataset frequently outperform spending data.2 4–7

Despite its long success, questions have been raised about the issues in the construction and interpretation of wealth indices. For example, both Houweling *et al*8 and Howe *et al*9 stated that the wealth index may differ between urban and rural areas. This is because the PCA procedure assigns more weight to indicators of assets owned by more urban households (eg, television, communication tools, electricity), and less or even negative weights to indicators of assets owned by rural households (eg, livestock, agricultural land), leading to a gross underestimation of the wealth of rural households. Thus, a recent development proposed a polychoric PCA wealth index with two principal components, to reduce the urban bias in standard PCA with one component.10 PCA could also be challenging to interpret due to extremely small weights,11 and proxy methods such as combining categorical and sparse PCA (SPCA)12 have been described. Such approaches, however, still have the issues of redundancy caused by hundreds of categories and sensitivity to outliers.

Here, we focus on building a wealth index for the district of Mopeia in Mozambique. With a per capita GDP of US$541.5 in 2022 reported by the World Bank, 64% of the population in Mozambique is still below the extreme poverty line (international poverty line is US$2.15 (2017 PPP) per day per capita).13 For LMIC and data sparse regions, this study offers new insights into the effectiveness of using alternative PCA approaches for creating a wealth index.

In addition to the classical PCA method, we apply three alternative approaches for building the wealth index: (1) Feature selection PCA approach, which only uses a subset of asset indicators for estimating wealth. (2) SPCA approach, which uses the well-known SPCA14 method using LASSO (elastic net) to provide an easily interpretable modified wealth index. (3) Robust PCA approach, which uses the popular robust PCA method, ROBPCA,15 to overcome the sensitivity of classical PCA to outliers.

## Methods

### Data

Data used for this study are drawn from the demographic survey of the population in Mopeia in 2021 conducted under the Broad One Health Endectocide-based Malaria Intervention in Africa (BOHEMIA) study,16 17 including 25 550 households and 131 818 people in Mopeia district, Zambezia province, in Mozambique. Several authors of this study were engaged in the original data collection phase of the BOHEMIA survey. The present research, however, uses data extracted retrospectively from the public dataset provided by the BOHEMIA demographic study.17 This dataset offers an extensive set of 72 indicators that capture detailed information about each participating household. Out of these, we focus on 16 variables (table 1) as the socioeconomic indicators. These particular variables were selected in accordance with relevant literature obtained from the DHS website.3 This method ensures a thorough and well-founded socioeconomic analysis, grounding our research in established methodologies.

The wealth index is validated using the household income, which was collected by the BOHEMIA team through the 2022 Health Economics Survey in Mopeia. The Health Economics survey gathered income information from 537 households for six consecutive months, including labour income from each member of the household, households’ non-farm business income, and households’ agricultural income. The estimation of total household monthly income in our analysis is derived by combining the individual monthly incomes of all members and all types of household monthly income. Incomes were initially retrieved in 2022 Mozambican meticais and later converted to 2022 US dollars (US$) under the 2022 exchange rate of 63.85 metical/US$.18

### Classical PCA wealth index

We followed the steps used in constructing the DHS Wealth Index19 to calculate our classical PCA wealth index. The very first step is to convert each category of the 16 asset ownership variables into 71 dummy variables to form an assets’ binary dataset. Then PCA20 is carried out on the correlation matrix of the standardised assets’ binary data. The wealth index for each household is a linear combination of all assets with the PCA weights as corresponding coefficients according to the formula described in online supplemental appendix section 1.1.

### Supplemental material

### Feature selection PCA wealth index

The feature selection PCA selects a much smaller number of asset indicators to build the wealth index. In the classical PCA wealth index, several asset variables have extremely small absolute weights indicating that they are only weakly correlated to the first component. By ignoring variables with small-magnitude loadings, important features can be retained without losing much information.21 Here an absolute weight threshold of 0.01, which is approximately equal to the median of all PCA weights, is applied to filter out the negligible asset variables. A sensitivity analysis of choice of threshold was conducted and is described in online supplemental appendix section 2. Results suggest that researchers should carefully consider the threshold to ensure enough indicators are retained, thereby maintaining the robustness of the wealth quintiles.

### SPCA wealth index

Another popular variable selection technique, which develops accurate and yet sparse models is LASSO (elastic net). Zou and Hastie14 proposed SPCA)by imposing the LASSO (elastic net) constraint on the regression optimisation problem such that the modified PCA produces sparse loadings (explained in online supplemental appendix section 1.2). This efficient approach is integrated into wealth index construction in our paper, producing SPCA weights and a more interpretable wealth index.

### Robust PCA wealth index

Classical PCA, feature selection PCA and SPCA methods are sensitive to anomalous observations. This is because the sample covariance or correlation matrix is very sensitive to outliers. Robust PCA is an effective way of obtaining principal components with little impact from outliers.

A well-known robust PCA method, called ROBPCA,15 determines a robust subspace by obtaining an outlier-free subset. The data are then projected onto this subspace to robustly estimate the eigenvectors and eigenvalues. However, ROBPCA is typically suited for roughly symmetric distributed data, which is not common in assets’ binary data, especially in LMICs. Hubert *et al*22 proposed an improved ROBPCA algorithm, skewness-adjusted ROBPCA, to address the issue of imbalanced data. In this study, the construction of robust PCA wealth index is performed using skewness-adjusted ROBPCA, due to the imbalance in the BOHEMIA data.

### Statistical analysis of wealth indices

Per DHS wealth index methodology, missing data in our analysis are replaced by the average value of the respective variables, and all the asset variables are standardised before applying the PCA algorithm. As for the parameter setting, the number of principal components is set to one in all four approaches as suggested by the DHS method.19 The robustness parameter is set as 0.5 to yield maximal robustness in the robust PCA process.

Because the wealth index presents only a relative ranking of households, it is difficult to interpret and compare the values of scores. To address these issues, one of the popular approaches is to transform the wealth index into wealth quintiles. Wealth quintiles are calculated by dividing all households into equal quintiles (20%) based on the wealth index. Households are categorised into five ranks from ‘rank 1’ to ‘rank 5’ with the wealth index scores from the first quintile to the last quintile.

We examine the reliability of the wealth index from two perspectives: the internal coherence and its consistency with other wealth indices. The internal coherence is examined using the summary statistics (percentage of assets ownership or average number of asset indicator) of how the assets’ ownership varies across five quintiles. An intuitive heatmap is used to visualise the agreement between the four indices, where misclassification between quantiles can be observed directly. The association between different PCA wealth indices is determined by the Spearman’s rank correlation coefficients, a typical non-parametric measure of rank correlation.

We validate the wealth indices through Spearman’s rank correlation coefficient to measure the association between wealth ventiles (calculated by dividing all households into equal 5% quantiles based on the wealth index scores) and logarithmic household income. The validation of wealth indices is also evaluated by the receiver operating characteristic (ROC) analysis which is commonly used to calculate predictive capacity of a classification model. An ROC curve is obtained by plotting sensitivity against 1-specificity for all possible cut-off points of wealth indices. Thus, the area under the ROC curve (AUC) can be used as an informative measure of the discriminating capacity of wealth indices, and the closer the AUC is to one, the better is the performance.23

In terms of the stability of the wealth index on the other dataset, the three alternative PCA approaches are applied to the 2018 Malaria Indicator Survey (MIS) data in Mozambique and are compared with the original DHS wealth index reported on the DHS website.24

Data analysis in this paper is performed by using software R V.4.1.2. The classical PCA algorithm is achieved through ‘principal’ function in ‘psych’ package.25 To accomplish the alternative PCA techniques, we use ‘elasticnet’ package26 and ‘robpca’ package27 to carry out the SPCA and skewness-adjusted ROBPCA algorithm, respectively.

### Patient and public involvement statement

It was not appropriate or possible to involve patients or the public in the design, or conduct, or reporting, or dissemination plans of our research.

### Reflexivity statement

A structured reflexivity statement is provided in online supplemental appendix.

## Results

### Importance of different asset indicators across methods

Figure 1 reports the PCA weights for each of the asset indicators for each of the four PCA approaches using all 69 variables from the BOHEMIA demographic survey data. The weights signify the relative contribution different assets make to the wealth index.

In the classical PCA result (see the first plot in figure 1), almost all wealth index coefficients have expected signs. Variables indicating wealth (eg, lighting by electricity, wall material made of zinc) have positive weights while those representing poverty (eg, living in a hut, lighting by firewood) have negative weights. However, there are also some unexpected results. For instance, the ‘number of members per sleeping room’ variable carries a positive weight, indicating that a wealthier household has more people per sleeping room.

More simplified PCA approaches, that is, feature selection PCA, SPCA and robust PCA—resulted in a succinct list of relevant household assets, trimming off variables with negligible weights. Interestingly, none of these methods assigned any weight to ‘cows’ ownership’ variable. Moreover, some variables that were used by a minority of households (eg, paper for wall material, natural gas for lighting) were also ignored by three alternative PCA methods due to sparseness. It is worth to noting that while the threshold choice in feature selection PCA may seem subjective, its outcomes align with those of SPCA. This agreement justifies setting 0.01 as the threshold value in the filtering criteria in feature selection PCA.

### Evaluation of wealth indices

All households are ranked into five levels: rank 1 (poorest), rank 2, rank 3, rank 4, rank 5 (richest). Each group represents 20% of total households. In this section, we investigate the internal coherence and cross-method agreement of the wealth indices.

#### Internal coherence of wealth indices

Table 1 compares the average asset ownership across the households in five wealth quintiles based on the classical PCA approach. Since the other three techniques have similar results as the classical one, their data are not shown here (see online supplemental appendix section 3). The percentage of households, owning assets with positive weights, generally rises from rank 1 to rank 5. Conversely, the opposite trend is observed for assets with negative weights. For example, only 8.4% of households in rank 1 own cell phones, but this percentage increases for higher ranks, reaching 66.9% in rank 5.

Table 1 also presents the regression analysis between each asset indicator and classical PCA (DHS) rank. The linear regression and the logistic regression are used to examine the association between wealth ranks with assets for numerical indicators and binary indicators, respectively. The signs and the magnitude of regression coefficients are consistent with those of the PCA weights, particularly when the absolute PCA weights are more than 0.001. Conversely, indicators with tiny absolute weights (<0.001) exhibit limited discriminatory capacity between wealth groups, as indicated by the statistically insignificant results of the regression analysis.

#### Cross-method agreement of wealth indices

Figure 2 demonstrates the wealth ranks according to the four PCA techniques. The value in each cell represents the number of households under the corresponding combination of wealth quintiles from three approaches. For instance, among all households that have rank 1 in classical PCA, 99.75% of households are classified in rank 1, with remaining 0.25% of households classified into rank 2 in robust PCA. Almost all households are classified in the same group in all four indices. Approximately, only 10% of the households appear in a different group, which is an adjacent group with a difference of only one rank. Additionally, the strong Spearman’s correlation coefficients (all >0.98 with p<2.2e−16) across all methods confirm this agreement (the table of spearman’s rank correlation is shown in online supplemental appendix table S4). Consequently, the wealth quintiles derived from four different PCA approaches are robust to insignificant asset indicators and outliers.

### External validation of wealth indices

To further validate our indices, we cross-checked against household income data and the 2018 DHS wealth index.

#### Consistency with household income classification

Income information for 537 households in the Mopeia district is available from the Health Economics Survey data collected by BOHEMIA project in 2022. However, nearly all households reported zero personal income (93.67%, N=503), and a majority of households reported zero household income (77.47%, N=416). Here, we only illustrate the association between the wealth indices and 138 households with non-zero total income. The analysis for households with zero total income is demonstrated in online supplemental appendix section 6.

The feature selection PCA wealth ventiles showed a moderate association (Spearman’s rank correlation=0.26) yet significant positive linear relationship (p<0.02) with the average monthly income on a log scale of 138 households with non-zero total income. All wealth ventiles exhibit similar results, which are included in online supplemental appendix table S5.

Classifying these households as ‘rich’ or ‘poor’ according to the 2022 international poverty line (US$2.15 (2017 PPP) per day per capita) from World Bank Report,13 we used ROC curves to compare income classification with wealth indices. Figure 3 shows the ROC curves, with the feature selection PCA wealth index demonstrating a high AUC value of 0.76 for the classification results, suggesting its robust discriminating capacity. Notably, all other wealth indices also exhibit strong performance, with AUC values exceeding 0.75 (see online supplemental figure S3). Both the Spearman’s correlation and ROC analysis show significant coherence between wealth indices and households’ average monthly income.

#### Consistency with DHS wealth index based on 2018 MIS data in Mozambique

The DHS household wealth index for Mozambique has been estimated and reported on the DHS website based on the 2018 MIS data.24 Applying all three alternatives to the 2018 MIS data, we investigate the consistency between the alternative PCA wealth index with the original DHS wealth index, and therefore, justify the stability of the alternative methods over different data sets. Figure 4 demonstrates a strong correlation (Spearman’s rank correlation=0.99, p<2.2e−16) between the DHS wealth index and the feature selection PCA wealth index as well as between the wealth quintiles (Spearman’s rank correlation=0.95, p<2.2e−16). Using a threshold of 0.02 (approximately equal to the median of all PCA weights), feature selection PCA filtered out 63 of 107 asset indicators in the 2018 MIS data from Mozambique, without significantly affecting the quality of the wealth index.

## Discussion

Wealth indices have been widely used as a proxy measure of SEP for households in LMICs. One popular way of creating these indices within DHS datasets is the PCA approach, despite criticisms regarding its reliability and the data burden it creates.9 28 This work offers a new way of calculating the wealth index for the Mopeia district in Mozambique. This alternative methodology removes less significant features and handles outliers more efficiently.

The analysis shows that wealth indices produced by four PCA techniques exhibit strong internal coherence and offer a viable indicator of the SEP of households. The three alternative indices are highly consistent with the classical PCA wealth index, a benchmark for the standard wealth index. Each wealth index exhibits a significant correlation with household income and hence an ability to discriminate between households’ levels of poverty.

The alternative methods, feature selection PCA, SPCA and robust PCA, use fewer asset indicators (about 40% less), while still being able to construct a reliable wealth index. A concise set of indicators and a simpler model could make it easier for researchers to identify and comprehend the major contributors. These results are in line with other studies29 30 which support the idea of designing a simpler questionnaire with fewer number of questions to assess SEP. A shorter questionnaire has a positive effect on response rate and response quality, which further improves the accuracy of the follow-up studies or strategies.31

Both SPCA and robust PCA, while guaranteeing the high quality of the wealth index, may compromise computational simplicity. Therefore, the feature selection PCA wealth index is recommended over the classic DHS wealth index. Its efficiency is demonstrated by its ability to generate comparable results while requiring 40% less information—a significant reduction in the burden of data collection and computational load. However, one aspect to consider is that the selection of the optimal threshold may not be universally applicable across regions. If a wealth index in other regions is needed, the DHS data from previous years can be analysed, if available, to identify valid thresholds and to eliminate insignificant asset indicators before the questionnaire-development stage, reducing the burden on investigators and survey respondents.

### Limitations

There could be some limitations in the proposed method and analysis. First, the first principal component from all PCA methods only explains a fraction of variation in the data. Some studies consider multiple principal components to capture more wealth effect in the data.10 32 However, there is an inevitable trade-off between the higher explained variance and a clear interpretation of the contribution of each asset indicator. Moreover, the total proportion of variance explained may not be considerably higher since the successive higher-order components always explain smaller proportions than the first component.28

Another drawback is possibly the lack of generalisability due to the high level of rurality in Mozambique. Pursuant to the DHS method, we initially created a composite wealth index by combining different urban and rural wealth indices.19 However, the analysis revealed that it was indistinguishable from a unified wealth index. Hence, we developed only one wealth index for Mozambique, without differentiating between urban and rural areas. Although our method works well for Mozambique (as shown in figure 4), its performance in more urbanised LMICs is unclear. Further analysis using diverse datasets from various socioeconomic settings is warranted to evaluate the generalisability of our method.

## Conclusions

This research presents a new approach for calculating a wealth index for Mopeia, Mozambique. The commonly used DHS method based wealth index is effective, but we find the feature selection PCA approach achieves comparable performance while using 40% less variables. We identify variables that make minimal contribution in calculating the wealth index, omit them and show that their elimination does not affect the quality of the wealth index. This simplifies the data collection process and reduces the cost of data collection while improving the quality of the survey results. Despite using fewer asset indicators, feature selection PCA delivers a stable and robust wealth index, and shows consistency in performance with other methods, including the DHS method. Thus, we recommend the feature selection PCA approach as a practical alternative for wealth index calculations in similar LMIC regions.

### Supplemental material

## Data availability statement

Data are available in a public, open access repository. The BOHEMIA consortium has agreed to make the data underlying each manuscript openly available on publication. The data supporting this paper is generating from BOHEMIA project, which can be found in the ISGlobal dataverse repository (https://doi.org/10.34810/data682).

## Ethics statements

### Patient consent for publication

### Ethics approval

The study protocol for the BOHEMIA study was approved by the Internal Scientific Committee and Institutional Review board from the Centro de Investigacao em Saude de Manhica (Ref: CIBS-CISM/004/2021), Hospital Clinic of Barcelona Clinical Research Ethics Committee (Ref: HCB/2019/0938) and The Ethics Research Committee of the WHO (Protocol ID: ERC.0003265).

## Acknowledgments

We acknowledge support from the Spanish Ministry of Science, Innovation and Universities through the 'Centro de Excelencia Severo Ochoa 2019-2023' Program (CEX2018-000806-S), and support from the Generalitat de Catalunya through the CERCA Program. CISM is supported by the Government of Mozambique and the Spanish Agency for International Development (AECID). We thank the residents and authorities of Mopeia for their support. We acknowledge the work and dedication of over 360 local staff during implementation and data collection.

## References

## Supplementary materials

## Supplementary Data

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

## Footnotes

Handling editor Seye Abimbola

Contributors KX, AM, XD, CJC and CR conceived the idea presented. The methodology was developed by KX, AM, XD and CS. The project implementation and social science study are supported by SI, VM, MS, PN, JM, EJ, HM, FM, AC and RR. The investigation process and the project administration were conducted by AM, XD, PR-C, SI, EE, VM, MS, PN, FS and CS. EE coordinated the data curation and software. Writing the original draft and preparing the figures and tables was performed by KX and all authors contributed to the review and editing of the final draft. All authors contributed cognitively to various areas of this work, participated in later changes, and read and approved the final publication. All authors had access to all study data and held the ultimate responsibility for the publication decision. AM is repsonsible for the overall content as guarantor of this paper.

Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.