Rachid Laajaj, Omar Arias, Karen Macours, Marta Rubio-Codina & Renos Vakis This study evaluates how well the Big 5 (B5) benchmark. The results suggest that the use of B5 personality traits –commonly used to proxy for self-reported questionnaires to measure personality non-cognitive (or more properly socio- traits through household surveys in developing emotional) skills— are measured in the STEP countries has to contend with biases arising from surveys using self-reported questionnaires so far systematic response patterns such as AB, the applied in 15 developing countries. Several mediating role of enumerators, and possibly other commonly used indicators are used to assess validity issues such as reference biases if different groups and reliability of the measures, ranging from have different standards or reference points when Cronbach’s Alpha, Acquiescence Bias (AB, often assessing their behavioral traits. More work is needed referred to as "yea-saying", i.e., the tendency of to further explore these issues and test potential respondents to agree with a statement when in solutions and alternative approaches to non- doubt), the explanatory power of enumerator fixed cognitive skills measurement in developing settings. effects, the number of factors identified by exploratory factor analysis and the factor structure. Results suggest that in most aspects, the indicators of internal consistency for the B5 measures are well below the norm observed in other similar studies in Standard skills indicators leave an information developed economies. Low internal consistently is gap on workforce skill characteristics, job skill partly related to the functional literacy ability of the requirements, and quality of worker-job matches respondents and to the lower number of items used that prevents policymakers from making in the STEP surveys B5 module. Yet these factors do informed and timely decisions on training and not explain all the differences in internal consistency education for current and future employees. Thus, compared to data from the USA that is used as a there is growing interest on measures of cognitive, socio-emotional (non-cognitive), and technical skills 1 Based on work by Rachid Laajaj (Universidad de los Andes), Omar Arias (World Bank), Karen Macours (Paris School of Economics), Marta Rubio-Codina (IADB), and Renos Vakis (World Bank). This research is part of a joint work program on methodological improvements of skills measures in household surveys co-led by the Mind, Behavior, and Development Unit (eMBeD), the Skills Global Solutions Group of the Education Global Practice and Social Protection & Labor Global Practice, and the Living Standards Measurement Study team of the DEC Survey unit. to analyze determinants of skills formation, economic traits inventory commonly used in the personality decisions and labor market outcomes. But how can psychology literature. we do this at a national scale in the typical household Motivated by the results from Kenya, we surveys that development researchers collect? examined how well the STEP surveys measure Existing measures have mostly been validated in socio-emotional skills. Using data for over 50,000 developed countries, and with data collected in thousand working-age individuals from 15 STEP relatively controlled settings, far from the context surveys collected in Sri Lanka, Yunnan (China), Lao typical of developing countries. PDR, Vietnam, Philippines, Bulgaria, Armenia, Georgia, Macedonia, Ukraine, Serbia, Kenya, Ghana, Recent work from Kenya (Laajaj and Macours, Bolivia and Colombia, as well as additional data from 2017) highlighted some of the challenges of the USA for benchmarking, the analysis compiles a collecting reliable and valid data for a set of set of measures of validity and reliability around the commonly used skills measures. Using a survey Big-5 personality traits. with skills measurements administered to more than 900 farmers in western Kenya and then re- administered three weeks Laajaj and Macours found that cognitive skill measures were reliable and consistent, but non-cognitive skills measures The findings confirm that in all the STEP countries (captured through a Big Five personality traits analyzed, the reliability and validity of the Big-5 inventory) were rife with measurement error. Do measures are below norms commonly accepted in these findings hold across different settings and the psychological literature. Just like the results of countries? Laajaj and Macours (2017) for Kenya, the indicators typically used to assess the quality of the measures (such as acquiescence bias and Cronbach’s alphas) are lower than the minimum standards used in psychometrics to be considered useable. For To expand on this work, we conducted a instance, a Cronbach alpha coefficient of at least 0.7 systematic analysis of the Skills Towards is a common threshold used to judge a measure as Employability and Productivity (STEP) surveys, a internally consistent, but the alpha coefficients for research initiative ran by the World Bank that the STEP measures in all countries are generally in the aims to better understand the interplay between range of 0.3 to 0.4. skills and employability in developing countries. The STEP program developed household survey In addition, the low reliability of the measures instruments tailored to collect data on cognitive, does not appear to be random. For example, socio-emotional, and job-relevant skills in low- and reliability and validity tend to be higher in countries middle-income country contexts. The module on with higher scores on STEP cognitive (literacy) tests, cognitive skills measures functional literacy linked to although even for countries with the highest the scale of the OECD’s Programme for the cognitive scores in our sample (specifically, Eastern International Assessment of Adult Competencies European countries), reliability levels remain below (PIAAC). For socio-emotional skills, the STEP US levels. Literacy scores tend to explain more instrument uses a variation of the B5 personality variation in reliability between countries than within country, suggesting that other factors at the country some tertiary education, we find almost no level are at play. Similarly, when restricting (in each improvement in the reliability indicators. country) the sample only to people who completed Figure 1: Big 5 and factors associated for each item Openness Conscientiousness Extraversion Agreeableness Neuroticism Q1 Q2 Q3 Q1 Q2R Q3 Q1 Q2R Q3 Q1 Q2 Q3 Q1 Q2R Q3R US 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 Sri Lanka 2 2 3 2 2 2 3 3 3 4 4 4 5 1 1 Yunnan 1 1 4 4 2 4 3 4 3 4 4 1 5 5 5 (China) Lao PDR 1 1 1 4 2 1 1 3 4 4 4 4 5 5 1 Vietnam 1 1 1 3 2 3 1 3 4 4 3 4 5 5 5 Philippines 1 1 1 2 2 1 5 3 4 1 1 1 5 5 3 Bulgaria 3 1 1 2 1 4 2 3 4 1 4 4 5 5 4 Armenia 1 1 1 1 2 1 3 3 3 4 1 4 5 5 5 Georgia 1 2 2 2 4 2 1 3 4 4 2 4 5 5 5 Macedonia 1 4 4 4 2 4 3 3 3 4 4 4 5 1 5 Ukraine 3 1 1 2 2 2 3 3 3 4 1 4 5 5 5 Serbia 1 1 1 5 2 2 3 3 3 4 4 4 5 5 5 Kenya 2 2 4 2 2 2 3 3 3 4 4 4 1 5 5 Ghana 1 1 1 1 2 1 3 4 3 1 1 1 4 5 5 Bolivia 1 1 4 1 2 1 3 3 3 4 4 4 5 5 5 Colombia 3 1 1 1 2 1 3 3 3 4 1 4 5 5 5 Q1, Q2, Q3 refer to items (questions) in the STEP B5 personality scale. R: Negatively worded items. Using Varimax-Rotated Five-Factor Structure for Big Five items, each item was assigned to the factor which had the highest load. For example, US follows exactly the expected factor structure of the Big Five, but none of the other countries have this perfect match. Items corrected for Acquiescence Bias. One of the biggest concerns is that the factor items do not capture well the underlying personality structure arising from the data does not conform factors or that the underlying factor structure is well with the model of the Big five personality altogether different in these countries than it is in the factors. That is, in most countries several of the items United States. in the Big Five scale do not statistically align with the personality factors they purport to measure. As Figure 1 illustrates, many items are misaligned across various factors (highlighted in red). Therefore, two items of different constructs (personality factors) can The results from the analysis of the STEP surveys be more correlated than items of the same construct. call for caution in interpreting existing measures By contrast, the five factors emerge very clearly in US and underscore the need for further research to data, as seen in the top line of Figure 1. It may be that improve nation-wide measurement of socio- in developing country contexts, responses to the B5 emotional skills in developing countries. Taken together with the earlier work and additional analysis • Consider self-administration when from the STEP surveys not shown, the results offer feasible. We find that enumerator fixed effects five takeaways and a path for future research: jointly significantly predict the values of the Big Five measures. As a tentative implication, when possible, • Check, check, check! Before utilizing non- we suggest considering self-administrated surveys cognitive skills measures, it should be customary instead of oral surveys for socio-emotional skill practice to assess that they behave the way they are measures. The absence of interaction with an supposed to. Most of the reliability and consistency enumerator may also reduce issues related to tests done in both sets of studies are common properly listening, being influenced, and/or social practice in studies that use these data, are easy to do desirability bias (trying to give an answer that gives a with standard statistical packages like STATA, and better impression). This is of course a challenge to thus should be done regularly by applied researchers. implement with low literacy populations. • Correct for Acquiescence Bias (or yeah • More methodological testing please! saying). Corrections for AB lead to considerable Looking ahead, more work is needed to test potential improvements to reliability and validity of the solutions and alternative approaches to non- measures (even if not fully to commonly acceptable cognitive skills measurement in developing settings. levels). As an example, task-based measures may be one research avenue that can minimize some of the issues • You cannot cut corners – include sufficient discussed above. Methodological research in this number of items per construct. The third wave area is urgently needed. STEP surveys use seven questions per personality factor instead of three in the first two phases. We find that including more questions improves on some the Rachid Laajaj Professor, Universidad de los Andes, reliability indicators. karen.macourspsemail.eu Omar Arias, Practice Manager, World Bank’s • Don’t assume standard scales measure the Education Global Practice, oarias@worldbank.org same underlying constructs everywhere or are Karen Macours, Professor, Ecole d'économie de best suited for purpose. Start by linking the policy Paris, karen.macourspsemail.eu question at hand with the right constructs you should Marta Rubio-Codina, Economist, Inter-American aim to collect. But even then, let the data tell you the Development Bank (IADB), latent traits measured using factor analysis. As we show, even after correcting for AB or having a larger Renos Vakis, Lead Economist, World Bank’s Poverty and Equity Global Practice (GPVGE), number of items, different factor structures may rvakis@worldbank.org emerge in developing countries. This note series is intended to summarize good practices and key policy findings on Poverty-related topics. The views expressed in the notes are those of the authors and do not necessarily reflect those of the World Bank, its board or its member countries. Copies of these notes series are available on www.worldbank.org/poverty