WPS6962 Policy Research Working Paper 6962 Estimation of Normal Mixtures in a Nested Error Model with an Application to Small Area Estimation of Poverty and Inequality Chris Elbers Roy van der Weide Development Research Group Poverty and Inequality Team July 2014 Policy Research Working Paper 6962 Abstract This paper proposes a method for estimating distribution the only application. Monte-Carlo simulations show that functions that are associated with the nested errors in estimates of poverty and inequality can be severely biased linear mixed models. The estimator incorporates Empiri- when the non-normality of the errors is ignored. The bias cal Bayes prediction while making minimal assumptions can be as high as 2 to 3 percent on a poverty rate of 20 about the shape of the error distributions. The applica- to 30 percent. Most of this bias is resolved when using tion presented in this paper is the small area estimation of the proposed estimator. The approach is applicable to poverty and inequality, although this denotes by no means both survey-to-census and survey-to-survey prediction. This paper is a product of the Poverty and Inequality Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at rvanderweide@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Estimation of normal mixtures in a nested error model with an application to small area estimation of poverty and inequality Chris Elbers and Roy van der Weide1 Keywords: Normal mixtures, linear mixed models, small area estimation, Empirical Bayes, poverty, inequality JEL Classification: I32, C31, C42 1 Chris Elbers (c.t.m.elbers@vu.nl) is at VU University Amsterdam and the Tinbergen Institute. Roy van der Weide (rvanderweide@worldbank.org) is at the World Bank. We gratefully acknowledge financial support from VU University Amsterdam and the World Bank's Knowledge for Change Program (KCP II). A thank you also goes to Peter Lanjouw for providing comments on earlier versions of this paper. 1 Introduction We propose a method for estimating distribution functions that are associated with the nested errors in linear mixed models. The proposed estimator is accommodating Empirical Bayes prediction and makes minimal assumptions about the shape of the error distributions. Our objective is to accurately predict nonlinear functions of the dependent variable, where the shape of the error distribution functions potentially plays an important role. The application we have in mind is the small area estimation of poverty and inequality, although this denotes by no means the only application. This particular application has been made popular by the work of Elbers, Lanjouw and Lanjouw (2003; henceforward ELL). The approach they put forward has since been applied to obtain maps of poverty and inequality in over 60 countries worldwide. The highly disaggregated estimates of poverty and inequality have in turn inspired a range of other applications, for example: Demombynes and Ozler (2005) investigate whether inequality at the small area level has an impact on local crime rates, and find that it does using data from South Africa; Elbers et al. (2007) conduct an empirical experiment in order to estimate by how much one could potentially lower the costs of getting resources to the poor if one had access to a poverty map, and conclude that the gains can be substantial; Araujo et al. (2008) examine whether villages with higher levels of inequality are less likely to invest in public goods that would benefit the poor. Using data from Ecuador they find empirical support for this hypothesis which they attribute to elite capture. Recently, Fujii (2010) modified the approach to make it better suited for the small area estimation of child malnutrition outcomes with an application to Cambodia. 1.1 Problem statement Consider the following standard linear mixed model for log income of household h residing in area a: yah = xah β + ua + εah . Let yd denote a vector of log incomes for all households from domain d (where “domain” typically refers to a small geographic area). We are interested in estimating W (yd ), where W is some (possibly nonlinear) function of yd . It is assumed that we have data on x for the entire population, while data on both x and y is available for a sample S of households only. ELL propose to estimate W by: E [W (yd )|x ∈ d; (x, y ) ∈ S ], where the expectation is taken over the unobserved errors ua and εah . Because W is nonlinear, the shape of the error distributions will matter for the expected value of W . Estimation of β is obviously of primary importance when estimating E [W ], but in this paper we will concentrate on the error distributions and assume – except for the application in section 6 – that β is known. We argue that even if β is estimated perfectly, getting the error distribution wrong still has the 2 potential to introduce a significant bias. ELL felt their approach would be most convincing if they make minimal assumptions about the errors ua and εah . They proceed by first obtaining estimates of the area errors ua and εah , ˆa and ε which then allows them to sample from the empirical errors u ˆah . ua is estimated as the simple area average of the total residuals (appropriately re-scaled so that the sample variance of u 2 that corrects for the contribution of the area average of ε ). ˆa equals the estimate of σu ah ˆa from the total residual (re-scaled so that its The estimate for εah is obtained by subtracting u 2 ).1 sample variance matches σε This non-parametric approach to estimating the error distributions has two limitations. ˆa equals a sum of ua Firstly, this procedure does not adequately account for the fact that u ¯a . This latter term is often large enough for it to affect the ability of the empirical and ε ˆa to reproduce the shape of the actual distribution of ua . Simply ignoring distribution of u this “contamination” may result in a significant bias. If the empirical distribution of ua is biased, then this bias will also have implications for the empirical distribution of εah . This is also referred to as a “convolution problem”, where the objective is to estimate the distribution function of a random variable that is observed with error. Secondly, the approach adopted by ELL does not easily lend itself for Empirical Bayes (EB) estimation where the distribution of ua is tightened by conditioning on household data y and x available for domain d (i.e. in the event that some households in domain d are included in the sample). Working out the conditional distribution is not a trivial exercise without making fur- ther distributional assumptions. Consequently ELL decided to forego EB estimation altogether. In doing so, they have accepted a certain loss in efficiency by not fully utilizing all available information. Molina and Rao (2010) recently picked up on this and put forward an alternative approach that does implement EB estimation. They take ELL as a point of departure but then assume that both ua and εah are normal distributed, in which case the conditional distribution too will be normal distributed. Where ELL accept a loss in precision by not implementing EB estimation, Molina and Rao (2010) accept a loss in precision that might stem from a misspeci- fication of the error distribution functions. The data at hand will ultimately determine which of the two will be accepting the larger loss. ELL are arguably most interested in estimating poverty and inequality in developing countries where the number of small areas (or domains) that are covered by the income surveys are often small, think of 5 to 25 percent of all domains in the population. In this case the benefits of EB estimation will be modest (as survey data are available for only few areas). However, in more developed countries, or countries where travel costs that are incurred when covering all small areas are manageable, income surveys often cover a much larger number of the domains. In fact, there are numerous examples where surveys cover between 50 and 100 percent of all domains in the country. In those instances, there may be clear benefits to adopting EB estimation. The approach presented in this paper improves on both ELL and Molina and Rao (2010). Like ELL we make no restrictive assumptions about the error distributions. Our estimator 1 2 Note that ELL allows for heteroskedasticity, so that σε can be household-specific. 3 for the distribution functions will generally be more accurate than the estimator adopted by ELL however, as we explicitly account for the nested error structure (that is responsible for the “convolution problem”). Unlike ELL we also accommodate EB estimation. We achieve this by fitting finite normal mixtures (NM) to the error distribution functions. Normal mixtures are extremely flexible; they are able to fit any well-behaved distribution function, and are ideally suited for accommodating EB estimation. If the marginal distributions of ua and εah can be described by normal mixtures, then the conditional distribution too can be described by a normal mixture with known parameters (that are functions of these parameters and of the data on which is being conditioned). Estimation of the normal mixtures for ua and εah is complicated by the fact that neither ua nor εah are observed. Our estimator for the NM parameters may be viewed as a modified version of the EM algorithm. Monte Carlo simulations indicate that estimates of poverty and inequality can be severely biased when ignoring the non-normality of the errors. The bias can be as high as 2 to 3 percent on a poverty rate of 20 to 30 percent. Most of this bias is resolved when implementing our estimator. This is confirmed by an empirical application to US data. 1.2 Normal mixtures in nested-error models There are a number of other studies that have explored different ways of relaxing the normality assumption in mixed linear models. Verbeke and Lesaffre (1996) is an early example that also considers normal mixtures as a “non-parametric” representation of the error distribution function. However, they impose a number of important restrictions. First, only the area random effects ua are allowed to be non-normal; a normal-mixture is fitted to the distribution of ua under the assumption that εah is normally distributed. Second, it is assumed that the “component distributions” that make up the normal mixture share a common variance, which noticeably simplifies estimation but at the same time significantly limits the flexibility of the normal mixture to fit any given distribution function. This approach has also been followed by Cordy and Thomas (1997) who work with the same setup and adopt the same set of restrictions. There is another strand of the literature that permits both error terms to be non-normal distributed by imposing an alternative parametric family for the distribution functions. See for example Zhou and He (2008) who fit skewed t-distributions to the nested errors of the linear mixed model. Recently, there have also been efforts to explore the impact of misspecifications in the error distributions for Empirical Bayes predictions derived from linear mixed models, see for example Skrondal and Rabe-Hesketh (2009) and McCulloch and Neuhaus (2011). They conclude that the bias is reasonably small. It should be noted however that those studies focus on prediction of the dependent variable itself, in which case the Empirical Bayes estimates of the area random effects ua denote the only source of bias. Misspecifications in the error distributions become considerably more important when predicting nonlinear functions of the dependent variable, such as measures of poverty and inequality, as we will show in this paper. The remainder of this paper is organized as follows. In Section 2 we briefly discuss how the error distribution functions associated with the errors in a linear mixed model will matter for 4 prediction, with an application to poverty and inequality measurement. In this section we will also introduce normal mixtures as a flexible “non-parametric” representation of any given well- behaved distribution function, and demonstrate the implications for EB estimation. Estimation of the normal mixture distributions to both errors from the linear mixed model is presented in Section 3. A modest Monte Carlo simulation study followed by an equally modest empirical application is provided in Sections 4 and 5, respectively. Finally, Section 6 concludes. 2 Estimation of poverty and inequality: Distributions matter 2.1 Linear mixed model for income Suppose that at the household level the data generating process (DGP) satisfies the equation already mentioned above: yah = xT ah β + ua + εah , (1) where xah denotes a vector with independent variables, and where ua and εah denote zero expectation error terms that are independent of each other. The subscripts indicate target area (or domain) a and household h.2 In this paper we assume that errors are homoskedastic, so 2 + σ 2 . Throughout the paper that for each household h and area a we have: var[yah |xah ] = σu ε it is assumed that consistent estimators for the variance parameters are available, which we ˆu shall denote by σ 2 and σ ˆε2 .3 We will not make any assumptions about the shape of the error distributions. Let A denote the total number of areas covered by the income survey and let na denote the A number of households that have been sampled in area a, so that n = a=1 na denotes the total sample size of the survey. We shall denote the total household error by: eah = yah − xT ah β , and its area a average by e ¯T ¯a − x ¯a = y ¯a = a β where e h eah /na . We shall also use the notation ea = (ea,1 , . . . , ea,na ) which denotes the vector of length na with residuals for all households ¯a from area a. With a slight abuse of terminology we will at times refer to the errors eah and e as data (as if we know the parameter vector β ). Let the probability distribution functions for ua and εah be denoted by Fu and Gε . We will propose a computationally attractive method for estimating these distribution functions, where we make no restrictive assumptions concerning their functional form, in particular allowing these functions to be other than normal distribution functions. Another appealing feature of our non-parametric estimator for the error distribution functions is that it can easily accommodate Empirical Bayes estimation. 2 These areas may refer to geographic areas such as districts or municipalities, but also to non-geographic domains such as ethnic groups or age groups, say. 3 A commonly used estimator for the variance parameters from a nested error model is Henderson’s method III estimator (see Henderson, 1953; and Searle et al., 1992), which may be viewed as a method of moments estimator which does not require any assumptions about the shape of the error distributions. Alternative estimators are restricted maximum likelihood, minimum norm quadratic unbiased estimation (MINQUE; see e.g. Westfall, 1987; Searle et al., 1992), and so-called spectral decomposition estimation (see e.g. Wang and Yin, 2002; and Wu et al., 2009). 5 2.2 Empirical Bayes estimation Empirical Bayes (EB) estimation, also known as Empirical Best estimation, tightens the error distributions by conditioning on all available data in the survey for the purpose of prediction. The ‘observation’ ea (provided that area a is covered by the survey) clearly carries information ¯a will about the area random effect ua . In the extreme, for example, where na tends to infinity e perfectly reveal ua . In effect EB estimates are obtained by integrating out area errors ua using probability density p(ua |ea ), i.e. the probability density of ua conditional on ea observed from the survey. (Non-EB estimates are obtained by using the unconditional density p(ua ).) The challenge is to work out p(ua |ea ) along with the marginals p(ua ) and p(εah ) without imposing restrictions about their form so that the resulting conditional density p(ua |ea ) can also take on any form. Currently, the literature on EB estimation avoids this challenge by assuming normally distributed marginals, in which case the conditional distribution will be normal too. For normally distributed errors it can be shown that p(ua |ea ) can be written as a function ¯a , the mean of sample errors from domain a. This is not true for general error distributions. of e However, conditioning on the full vector ea becomes computationally intractable. Therefore we ¯a rather than the full vector ea even in the case of non-normal errors. will condition on e A second simplification concerns the known regression residuals for sample households. When conditioning on ea , we ignore that for these households the total error eah = ua + εah is known (since it is one of the components of ea ). Instead we will assume that εah is independent of ea even for sample households. The reason is that in practice sample household h cannot be traced among households in the target domain a (households for which only data on x is available). In most practical settings, the error thus introduced will be negligible. In fact, this error tends to zero as the size of the sample relative to the population size tends to zero. 2.3 Distributions matter Let y(a) and e(a) denote vectors of length Na with elements yah and eah for all households from the population. Similarly, x(a) will denote a matrix with rows given by xT ah for all Na households. ya , ea and xa will denote the survey sample analogues. As mentioned above our objective is to estimate: E [W (y(a) )|x(a) , ya ] = W (x(a) β + e)p(e|ea )de (2) W (x(a) β + u + ε)p(ε)p(u|ea )dεdu, (3) where the function W will generally be non-linear. Non-normality of the errors u and ε will affect the estimates of E [W ] via two different channels. Firstly, when the function W is indeed non-linear, the expected value E [W ] will be a function of the higher moments of the distributions of u and ε. Getting these moments wrong, in other words getting the distributions wrong, is then likely to introduce bias. Secondly, if 6 the distributions of u and ε are wrong then the conditional density p(u|ea ), which concerns EB estimation, will also be wrong. This will affect all moments of the conditional distribution of u, including the first, and hence has the potential to introduce bias in the estimate of E [W ] even if W is linear. (Note that in the case of linear W , non-EB estimates of E [W ] are unbiased even if the distributions for u and ε are wrong.) Recently, Skrondal and Rabe-Hesketh (2009) and McCulloch and Neuhaus (2011) have explored the magnitude of the bias that is introduced by getting the first moment of p(u|ea ) wrong, in the case of linear W , and found it to be modest. Our estimator would therefore be most relevant for the case of nonlinear W . 2.4 Empirical Bayes estimation with normal mixtures Let us assume that Fu and Gε can be represented by mixture distributions: i=mu Fu = πi Fi (4) i=1 j =mε Gε = λ j Gj , (5) j =1 where the Fi ’s and Gj ’s denote a basis of distribution functions which we will also refer to as components or component distribution functions. The πi ’s and λj ’s denote unknown non- negative probabilities that satisfy i πi = 1 and j λj = 1, which we will also refer to as mixing probabilities. mu and mε denote the number of components used to represent Fu and Gε , respectively. We will denote the probability density functions associated with Fi and Gj by respectively fi and gj . Mixture distributions are remarkably well equipped to fit any well-behaved distribution function. For example, kernel density estimators are closely related to mixture distributions. We will be working with normal component distributions, so that the mixture distributions are normal mixtures. Assumption 1 The components Fi are normal distribution functions with mean µi and vari- 2 . Similarly, components G are normal distribution functions with mean ν and variance ance σi j j 2. ωj Note that the modeler is at liberty to work with a different basis of component distributions. This choice does not have real implications for the ability of the mixture to fit a given distribution function. If p(ua ) and p(εah ) are normal-mixtures, then p(ua |e ¯a ) is a normal mixture too. This is a powerful result as the integral in equation 3 will generally have to be computed by simulation, and sampling from normal mixtures is straightforward. Lemma 2 shows how the parameters that define p(ua |e ¯a ) can be obtained as a function of the parameters of the normal-mixtures p(ua ) and p(εah ). Implementing EB estimation is thus as easy as sampling the area errors from the normal-mixture p(ua |e ¯a is observed in the survey sample. ¯a ) whenever e 7 ¯a , which we denote by p(ua |e Lemma 2 The probability density function of ua conditional on e ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε p(ua |e ¯a ) = wijk ϕ ua ; mijk ; s2 ijk , (6) i=1 j =1 k=1 where: 2 2 + ω2 ωj σi k mijk = 2 2 2 )/n ea − (νj + νk )/na ) + (¯ 2 + ω2 + n σ2 µi σi + (ωj + ωk a ωj k a i 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk ϕ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (7) 2 ) and (λ , ν , ω 2 ) where ϕ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. ¯a to a If we apply the Central Limit Theorem to approximate the marginal distribution of ε 2 /n , the expression for p(u |e normal distribution with mean zero and variance equal to σε a a ¯a ) simplifies considerably: i=mu 2 2 −1 ¯a ) ≈ p(ua |e ¯a + (1 − γai )µi ; 1/σi αi ϕ ua ; γai e + na /σε , (8) i=1 2 /(σ 2 + σ 2 /n ), and where α = α with γai = σi ˜i/ i ε a i ˜i iα with: 2 2 ˜ i = πi ϕ(¯ α ea ; µi ; σi + σε /na ). (9) ¯a , given the density function p(ua |e The expected value of ua conditional on e ¯a ) from Lemma 2, is seen to solve: ¯a ] ≈ E [ua |e αi (¯ ¯a + (1 − γai )µi ) , ea ) (γai e (10) i where γai = 2 /(σ 2 σi + 2 /n ), σε ea ) denotes the mixing probabilities of p(ua |e and where αi (¯ ¯a ). i a Note that the standard assumption of normal errors is nested as a special case, where there 2 = σ 2 . The first and second moment of the normal is just one component with µi = 0 and σi u conditional density p(ua |e ¯a ) in this case are seen to solve: E [ua |e ¯a ¯a ] = γa e (11) 2 ¯a ] = (1 − γa )σu var[ua |e , (12) 8 2 /(σ 2 + σ 2 /n ) (see e.g. Molina and Rao, 2010).4 For non-normal errors, we have where γa = σu u ε a that E [ua |e ¯a , and var[ua |e ¯a ] is generally a non-linear function of e ¯a ] will generally be a function ¯a . of e 2.5 Application to small area estimation of poverty and inequality For our application let yah measure per capita income (or expenditure) for household h residing in area a, and let sah denote the number of household members for that same household. In vector notation, let y(a) and s(a) be vectors with elements yah and sah for all households from area a. The objective is to determine the level of welfare for area a which can be expressed as a function of y(a) and s(a) : W (y(a) , s(a) ). The welfare function W is typically non-linear. Popular examples are the share of individuals whose income falls below a pre-specified poverty line (also known as the head-count poverty rate), or the Gini index of income inequality. Collecting data on income (or expenditure) yah for any given household is generally found to be expensive relative to collecting data on demographics, education, employment status, and housing. This is particularly true for developing countries where much of the income does not come from wage employment. Consequently, income data is often only available in the form of so-called income surveys. The sample size of these surveys is sufficient to estimate national and possibly sub-regional welfare, but too small to estimate welfare directly at the level of much smaller areas (i.e. the target areas a). Elbers et al. (2003; henceforward ELL) advocate an approach that combines the income survey with unit record population census data. The census has data on the independent variables xah from equation (1), such as demographics, education, employment and housing, but not the household income variable yah . Crucially, the data on xah are also collected by the income survey. The idea is to use the income survey to estimate the parameters from equation (1), and then use the model to predict income for every household in the census. With these predicted incomes we can subsequently estimate welfare W for each target area a. Standard errors can be obtained by means of simulation which is ideally suited for estimating quantities that are non-linear functions of the random variables at hand, as is the case with measures of poverty and inequality. Let R denote the number of simulations. The estimator then takes the form: R 1 (r) ˆ= µ ˜(a) , s(a) , W y (13) R r=1 (r) (r) ˜(r) where y ˜(a) denotes the r-th simulated (or predicted) income vector with elements y˜ah = xT ah β + (r) (r) ˜(r) and the errors u(r) (r) ˜a + ε u ˜ah . With each simulation, both the model parameters β ˜a and ε˜ are ah drawn from their estimated distributions.5 In the end this gives R simulated poverty rates. The 4 2 2 2 2 2 Note that the unconditional variance solves: var[ua ] = var[γa e¯a ] + (1 − γa )σu = γa (σ u + σε /na ) + (1 − γa )σu , 2 2 2 2 2 2 which equals var[ua ] = γa σu + (1 − γa )σu = σu , since σu + σε /na = σu /γa . 5 Our preferred method is to draw β˜(r) by re-estimating the model parameters using the r-th bootstrap version ˜ of the survey sample. Alternatively, β (r) may be drawn from its estimated asymptotic distribution. The difference between these two alternatives is expected to be modest, unless the survey sample is particularly small so that finite sample effects may play a role. 9 point estimates and their corresponding standard errors are obtained by computing respectively the average and the standard deviation over these simulated values. ˜a from the estimated unconditional It should be noted that ELL draw the area error u ˜ah too is estimated distribution, which is estimated non-parametrically. (The distribution for ε non-parametrically.) The advantage of this approach is that it is fully flexible in that it does not restrict the shape of the error distributions. A possible shortcoming is that it does not take ˜a from a distribution full advantage of all the available data. Ideally one would want to draw u that is conditioned on all relevant data that has been sampled from area a. Molina and Rao (2010; henceforward MR) do exactly that, they closely follow ELL but then draw the area error from the conditional distribution. This is referred to as Empirical Bayes (EB) estimation. Importantly, MR also differ from ELL in that they restrict the errors to be normally distributed. As we have seen, under this assumption the conditional distribution of ˜a is normal too and its parameter can be easily determined.6 When errors are non-normal, it u ˜a will take; it will generally be of a is not obvious what form the conditional distribution for u different form than the unconditional distribution. As is noted in section 2.3, getting the error distributions right is not merely a matter of efficiency. When the welfare function W is a non-linear function of the error terms, using wrong error distributions will also introduce a bias in the welfare estimates. Whether the magnitude of this bias due to misspecification is important in practice is an empirical question. The choice between non-normal errors combined with non-EB estimation (which is more flexible, but does not fully utilize all available data) or normal errors combined with EB estima- tion (which is less flexible, but fully utilizes the available data) may be determined/motivated by: (a) the degree of non-normality found in the data, and (b) how much information one stands to ignore/lose. The latter depends largely on: (i) how many areas have been sampled by the income survey (as for areas not represented in the survey EB and non-EB estimation are equivalent), and (ii) the size of the area error relative to the total error. The approach developed in this paper aims to combine the best of both worlds; we adopt EB estimation while permitting non-normal distribution functions. The non-parametric estimator for the distribution functions used by ELL does not lend itself for EB estimation. We propose a non-parametric estimator that does. Remark 3 Note that an added advantage of representing the non-normal error distributions by normal-mixtures is that E [w(yah )] can be evaluated as the sum of expectations over normal (i) (j ) (i) errors, i.e. E [w(yah )] = E [w(xT ah β + ua + εah )] = ij κij E [w(xT ah β + ua + εah )], where ua and (j ) 2 and ω 2 . This means that εah are normal distributed with means µi and νj , and variances σi j evaluating E [W (yah )] analytically under normal-mixture errors is no more difficult than under normal errors. 6 This is arguably why errors are generally assumed normal in the literature on EB estimation. 10 3 Estimation of normal mixtures when errors are nested 3.1 Estimation of unrestricted normal mixtures 2 ) and (λ , ν , ω 2 ), which determine the The objective is to estimate the parameters (πi , µi , σi j j j i=mu 2 ) and G (ε) = i=mε 2 ), re- normal mixtures (NM) Fu (u) = i=1 πi Fi (u; µi , σi ε j =1 λj Gj (ε; νj , ωj spectively, where mu and mε denote the number of components. Estimation of NM to observed data is a routine task, mostly using the EM algorithm, see e.g. Dempster et al. (1977) and McLachlan and Peel, (2000). Suppose that some data y is drawn from a NM with m components and parameter vector θ. The algorithm introduces a latent random variable zij that equals 1 if observation yj has been drawn from component i, and 0 otheriwse. Let the log-likelihood function that treats both y and z as data be denoted by L(θ; y, z ), and its derivative with respect to θ by Lθ (θ; y, z ). The corresponding maximum- likelihood (ML) estimator solves the following moment condition: E [Lθ (θ; y, z )] = E [E [Lθ (θ; y, z )|y ]] = 0. (14) The EM algorithm essentially solves the same moment condition but iteratively: ˆ(k) ]] = 0, E [E [Lθ (θ; y, z )|y, θ (15) ˆ(k) denotes the iteration-k estimate for θ. Solving equation (15) yields θ where θ ˆ(k+1) . The ˆk ). The advantage of the EM procedure results in a non-decreasing sequence of likelihoods L(y ; θ algorithm is that solving equation (15) is often considerably easier than solving equation (14). Note that the unobserved data z is integrated out conditional on the observed data y and the current estimate of θ. The nested error structure however poses a challenge that prevents a straightforward ap- plication of the EM algorithm. Conventionally, mixture distributions are estimated to data that are observed directly. We wish to estimate the distributions for ua and εah but we do not observe either of them. In some sense ua and εah are both observed with error (even if the total error eah is known with certainty), which makes this a convolution problem. We propose a modification of the EM algorithm that is able to deal with this convolution problem. In a nutshell, we treat the measurement error as a latent variable and integrate it out the same way as the latent variable z is integrated out by the EM algorithm. When estimating ¯a = ua + ε Fu (the NM for ua ), the observable data includes e ¯a = ¯a , where ε h εah /na will act as ¯a along with za . When measurement error. The modified EM algorithm will then integrate out ε estimating Gε (the NM for εah ), the observable data includes eah = ua + εah . Now ua takes on the role of measurement error and will hence be integrated out jointly with za . Note that we would need some initial estimate of Gε when estimating Fu this way (as we have to integrate ¯). Similarly, we need some initial estimate of Fu when estimating Gε (to integrate out u). out ε This by itself is very much in the spirit of the EM algorithm which requires an initial estimate of θ to determine to distribution of z conditional on the observed data. 11 Our modified EM algorithm solves the following moment condition when estimating Fu : E [Lθu (θu ; e ¯a , za )|e ¯a , ε ˆ(k) ] = 0, ¯a , θ (16) u a which is equivalent to: E [E [Lθu (θu ; e ¯a , za )|e ¯a , ε ¯a , ε ˆ(k) ]|e ¯a , θ ¯a ] = 0. (17) u a Similarly, it solves the following moment condition when estimating Gε : ˆ(k) ] = 0, E [Lθε (θε ; ea , ua , za )|ea , θ (18) ε a which is equivalent to: ˆ(k) ]|ea ] = 0, E [E [Lθε (θε ; ea , ua , za )|ea , ua , θ (19) ε a where ea = (ea1 , . . . , ea,na ). ¯a conditional on the observed e Specifically, in order to integrate out ε ¯a (see equation 17), ¯a |e we need an initial estimate of the probability distribution function for ε ¯a . It can be verified εa |e that p(¯ ¯a ) is again a normal-mixture distribution. Lemma 4 The probability density function of ε ¯a conditional on e εa |e ¯a , which we denote by p(¯ ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε εa | e p(¯ ¯a ) = ¯a ; mijk ; s2 wijk ϕ ε ijk , (20) i=1 j =1 k=1 where: 2 + ω2 ωj 2 k na σ i νj + νk mijk = 2 2 2 ea − µi ) + (¯ 2 2 2 ωj + ωk + na σ i ωj + ωk + na σi na 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk ϕ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (21) 2 ) and (λ , ν , ω 2 ) where ϕ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. Similarly, to integrate out ua conditional on ea (see equation 19), we need the probability distribution for ua |ea . As mentioned earlier however, we use the distribution of ua |e ¯a , assuming 12 ¯a ). It follows that ua |e that p(ua |ea ) ≈ p(ua |e ¯a too is normal-mixture distributed. ¯a , which we denote by p(ua |e Lemma 5 The probability density function of ua conditional on e ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε p(ua |e ¯a ) = wijk ϕ ua ; mijk ; s2 ijk , (22) i=1 j =1 k=1 where: 2 2 + ω2 ωj σi k mijk = 2 + (ω 2 + ω 2 )/n ea − (νj + νk )/na ) + (¯ 2 + ω2 + n σ2 µi σi j k a ωj k a i 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk ϕ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (23) 2 ) and (λ , ν , ω 2 ) where ϕ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. ¯a ) instead of p(ua |ea ) we will be solving a modified It should be noted that by working with p(ua |e moment condition, namely: E [E [Lθε (θε ; ea , ua , za )|e ˆ(k) ]|e ¯a , ua , θ ¯a ] = 0. (24) ε a This denotes a genuine departure from the original EM algorithm where one conditions on all the data that features in the log-likelihood function. We will refer to the resulting estimator as a pseudo-EM estimator. We expect the loss in precision to be minor, while the gain in practicality is substantial. The resulting estimators, in the form of iterative equations, are presented below (see An- εa | e nex 7.3 for a derivation). Given some initial estimate of p(¯ ¯a ), we may implement the estimator for Fu . Subsequently, given this estimate for Fu , we may implement the estimator εa |e for Gε . The newly obtained estimates can in turn be used to update our estimates for p(¯ ¯a ) and p(ua |e ¯a ), after which we may obtain a new round of estimates for Fu and Gε . This is con- tinued until convergence. (In practice, one iteration is found to be sufficient to obtain accurate estimates.) Estimator for Fu The fixed-point solution to the following set of iterative equations yields the estimator (ˆ πi , µ ˆi , σ ˆi2) 13 2 ) for i = 1, . . . , m : for (πi , µi , σi u (k+1) (k) ˆi π = E τ εa )|e ˆai (¯ ˆ(k) (¯ ¯a ; p εa | e ¯a ) /A (25) a (k) E ˆai (¯ aτ ea εa )(¯ ¯a )|e −ε ˆ(k) (¯ ¯a ; p εa |e ¯a ) (k+1) ˆi µ = (26) (k) E ˆai (¯ aτ εa )|e ˆ(k) (¯ ¯a ; p εa |e ¯a ) (k) (k+1) 2 E ˆai (¯ aτ εa ) ¯a − µ ¯a − ε e ˆi |e ˆ(k) (¯ ¯a ; p εa |e ¯a ) 2(k+1) ˆi σ = , (27) (k) E aτ εa )|e ˆai (¯ ˆ(k) (¯ ¯a ; p εa |e ¯a ) with: (k) (k) 2(k) π ¯a − ε ˆi ϕ e ˆi , σ ¯a ; µ ˆi (k ) ˆai (¯ τ εa ) = , (28) (k) (k) 2(k) iπ ¯a − ε ˆi ϕ e ˆi , σ ¯a ; µ ˆi where ϕ denotes the normal probability density function, and where the expectations are taken ¯ conditional on e over ε ¯a using the iteration-k estimate of the conditional density function (k) (k) 2(k) ˆ(k) (¯ p εa |e ˆ(k) (¯ ¯a ). Lemma 4 shows how p εa | e πi , µ ¯a ) can be obtained as a function of (ˆ ˆi ˆi , σ ). Estimator for Gε ˆj , ν The fixed-point solution to the following set of iterative equations yields the estimator (λ ˆj , ω ˆj2) 2 ) for j = 1, . . . , m : for (λj , νj , ωj ε ˆ (k+1) = E λ (k) ˆahj (ua )|e τ ¯a /n (29) j ah (k) E ah τ ˆahj (ua )(eah − ua )|e ¯a (k+1) ˆj ν = (30) (k ) E ˆahj (ua )|e ah τ ¯a (k) (k+1) 2 E ˆahj (ua ) ah τ eah − ua − ν ˆj |e ¯a 2(k+1) ˆj ω = , (31) (k) E ˆahj (ua )|e ah τ ¯a with: ˆ (k) ϕ eah − ua ; ν λ (k ) ˆj , ω 2(k) ˆj (k) j ˆahj (ua ) τ = , (32) ˆ (k ) (k) 2(k) j λj ϕ eah − ua ; ν ˆj , ωˆj where ϕ denotes the normal probability density function, and where the expectations are taken ¯a using the probability density function p(ua |e over ua conditional on e ¯a ). See Lemma 5 for the probability density function for ua |e ¯a . 14 In order to get the iterative scheme started, we would of course need suitable initial estimates ¯a ) and p(ua |e εa |e for p(¯ ¯a ). Note that even without any prior knowledge about the distributions ¯a by for ua and εah , we should be able to obtain a reasonable estimate of the distribution for ε appealing to the Central Limit Theorem (CLT). Under the assumption that εah are independend ¯a tends to a normal distribution with mean across households, we have that the distribution of ε 2 /n . In a typical household income survey, n will be in the range zero and variance equal to σε a a of 10 to 100, which is sufficiently large for the CLT to take effect. The corollaries below derive ¯a ) and p(ua |e εa |e the initial estimates for p(¯ ¯a ¯a ) under the assumption of normal distributed ε (with known variance). ¯a conditional on e Corollary 6 The probability density function of ε ¯a , which we denote by εa | e p(¯ ¯a ), is a normal-mixture with known parameters: i=mu 2 /n 2 σ 2 /n σε a σi ε a εa |e p(¯ ¯a ) = ¯a ; wi ϕ ε 2 2 ea − µi ); (¯ 2 2 , (33) i=1 σε /na + σi σi + σε /na ˜i / where wi = w ˜i iw with: 2 2 ˜i = πi ϕ(¯ w e a ; µ i ; σi + σε /na ), (34) 2 denote the where ϕ denotes the normal probability density function, and where πi , µi , and σi parameters associated with the normal-mixture distribution Fu . ¯a , which we denote by Corollary 7 The probability density function of ua conditional on e p(ua |e ¯a ), is a normal-mixture with known parameters: i=mu 2 2 −1 p(ua |e ¯a ) = ¯a + (1 − γai )µi ; 1/σi αi ϕ ua ; γai e + na /σε , (35) i=1 2 /(σ 2 + σ 2 /n ), and where α = α with γai = σi ˜i/ i ε a i ˜i iα with: 2 2 ˜ i = πi ϕ(¯ α ea ; µi ; σi + σε /na ), (36) 2 denote the where ϕ denotes the normal probability density function, and where πi , µi , and σi parameters associated with the normal-mixture distribution Fu . The following propositions provide some properties of the estimators. Proposition 8 At every iteration-k , the mean and variance for the probability density function (k) ˆ (k) ϕ(ε; ν (k ) 2(k) ˆε (ε) = g λj j ˆ ,ω ˆj ) solve: j (k) 1 ˆε E [ε; g ] = eah − E [ua |e ˆ(k) (ua |e ¯a ; p ¯a )] n ah (k ) 1 var[ε; g ˆε ] = E [(eah − ua )2 |e ˆ(k) (ua |e ¯a ; p ¯a )] − E 2 [ε; g ˆε(k) ]. n ah 15 Proposition 9 At every iteration-k , the mean and variance for the probability density function ˆ (k) (k) (k) 2(k) fu (u) = ˆ ϕ(u; µ π i i ˆ ,σ ˆ i) solve:i ˆ(k) ] = 1 E [u; fu ¯a − E [¯ e εa | e ˆ(k) (¯ ¯a ; p εa | e ¯a )] A a ˆ(k) ] = 1 ˆ(k) ]. var[u; fu E [(¯ ¯a )2 |e ea − ε ˆ(k) (¯ ¯a ; p ¯a )] − E 2 [u; f εa |e u A a 3.2 Some practical parameter restrictions Let us also consider the case where the component distributions are assumed to be given, so that only the mixing probability vectors π and λ will have to be estimated. This is obviously a special case of the more general approach where the probabilities are jointly estimated with the parameters of the component distributions. Keeping the latter fixed markedly benefits the numerical convergence, indicating that the approach deserves to be considered a distinct option in and of itself. This is precisely the approach advocated by Cordy and Thomas (1997). The estimators for π and λ are obtained as solutions to the following iterative equations: (k) 2 + σ 2 /n (k+1) 1 ˜i ϕ e π ¯a ; µi , σi ε a ˜i π = (k) , (37) A 2 + σ 2 /n a ˜i ϕ e iπ ¯a ; µi , σi ε a and: ˜ (k) λ 2 + ω2 1 j ˜i ϕ iπ eah ; µi + νj , σi j ˜ (k+1) λ = , (38) j n ˜ (k) λ 2 + ω2 ah j j ˜i ϕ iπ eah ; µi + νj , σi j ˜i denotes the estimator for πi . Note that Cordy and Thomas (1997) only consider the where π estimator for Fu (i.e. the estimator for πi for i = 1, . . . , mu ), as they are not interested in the distribution function for εah . Since the parameters of the component distributions are not being estimated, the modeler 2 and ω 2 beforehand. Cordy will have to set the values for the means µi and νj , and variances σi j and Thomas (1997) recommend to set a common variance (as is also done in kernel density estimators where the common variance is refered to as the bandwidth parameter). A natural ˆu σ 2 σ 2 ˆε 2 = choice is to set this common variance equal to: σi 2 = for all i (and hence ωj mu mε ). Note ¯ 2 we that for any given common variance σ ¯ 2 + i π i µ2 have: var[u] = σ i , with 2 i pi µi ≥ 0. This means that at a minimum the common variance must be chosen so that: var[u] − σ ¯ 2 ≥ 0. 2 σu u −1 ¯2 = Obviously this is satisfied for σ mu , which yields: 2 i πi µi = ( mm u 2. )σu A convenient choice for the mean parameters is to take equally spaced µi (and νj ) with a range that respects the values attained by the observed data. Note that the estimated mixing probabilities are expected to satisfy: E [u] = ˆ i µi iπ = ¯a /A ae (matching first moment), as u −1 well as: ˆ i µ2 iπ i = ( mm u σu )ˆ 2 (matching second moment), although this is not guaranteed by the estimation approach. This is in contrast with the approach where the component distribution parameters are also being estimated, in which case the first and second moments are guaranteed. 16 Remark 10 One possible extension to the set of component distributions proposed above (with different µi but common σ 2 ) can be obtained by adding components with common zero mean 2. (i.e. µr = 0) but different variances σr 4 A small simulation study This section presents a modest Monte-Carlo simulation experiment. We will focus our attention to the following two questions: (1) How effective is our approach in fitting non-normal error distributions (that are not necessarily normal-mixtures)?, and (2) What are the implications for the estimation of poverty and inequality? Do distinct deviations from normality have the potential to introduce a meaningful bias in estimates of poverty and inequality if this non- normality is ignored? We make the following assumptions. The simulated “country” consists of 500 domains (or target areas), which denotes the level at which measures of poverty and inequality will be estimated. Each domain is home to 3000 households. Household per capita incomes will be generated by means of the following model: yah = βxah + ua + εah , (39) where yah denotes the log of household income, and where xah represents a single covariate. For 2 + σ 2 = 0.3, β = 1, E [x ] = 0, the model parameters we consider the values: var[ua + εah ] = σu ε ah and where var[xah ] is chosen so that R2 = 0.4 (which represents a goodness-of-fit that is typical for empirical household income models). It will be convenient to introduce a parameter that measures the size of the random area 2 /(σ 2 + σ 2 ). We effect relative to the total error term. Let us denote this parameter by ρ = σu u ε will be considering both the case of small “area effect” (ρ = 0.05) and medium/large “area effect” (ρ = 0.25). The non-normal errors ua and εah will be drawn from a Log-Dagum distribution with prob- ability density function: apeap(x−log(b)) f (x) = p+1 , (40) 1 + ea(x−log(b)) where p > 0 is the parameter that determines the shape (or skewness) of the distribution. Smaller values of p will result in larger deviations from normality. The remaining two parameters 2 or σ 2 . Note that the a and b will be fixed by imposing zero mean and setting the variance to σu ε Dagum distribution is not an uncommon choice when modeling income data (see e.g. Kleiber (2007), and the references therein). The covariate xah will be drawn from a normal distribution. Finally, our artificial survey will sample 15 households from each domain. 17 4.1 Estimation of Fu and Gε The first test will be whether we can successfully uncover the probability density functions for ua and εah . We will keep the shape parameter for Fu fixed at pu = 0.5, but will consider three different values of the shape parameter for Gε : pε = (0.10, 0.25, 0.50). The benchmark density functions are obtained as a kernel density estimate applied to the actual realizations of the errors in the census. Our estimators for Fu and Gε are based on normal-mixtures with 3 and 2 component distributions, respectively. Figure 1 presents our estimates of the probability density function for ua . The estimates in the right panel show a remarkably good fit. These correspond to the case where the random area effect is large, with ρ = 0.25, in which case the contribution of ua matters. The imperfect fit shown in the left panel suggests that it is harder to estimate Fu when the random area effect is small (in this case ρ = 0.05). This is arguably due to the poor signal-to-noise ratio in this case; ua will only make up a small fraction of the total residual which is what is observed. However, this lack of precision will also have little to no implication for estimates of poverty and inequality, precisely because of the smallness of ua . In sum, in this example estimates of Fu are precise when they matter, and less precise when they matter less. Probability density function for u 1.5 3.5 3.0 2.5 1.0 dens.uhat(x) dens.uhat(x) 2.0 1.5 0.5 1.0 0.5 0.0 0.0 −0.6 −0.4 −0.2 0.0 0.2 0.4 −1.5 −1.0 −0.5 0.0 0.5 1.0 x x Figure 1: Normal-mixture density (Black) versus benchmark density (Red): (a) ρ = 0.05 (left panel); (b) ρ = 0.25 (right panel) Our estimates for the probability density functions for εah are presented in Figure 2. What is apparent is that, despite using no more than two components, our estimates provide remarkably accurate fits regardless of the size of the random area effect. When ρ is small, we have that εah is large relative to ua , and so the signal-to-noise ratio is in our favour when estimating Gε . It matters less in this case that we do not have a precise estimate for Fu . On the other hand, when ρ is large and hence εah is of a more modest magnitude relative to ua , our estimation of Gε may be helped by having a precise estimate of Fu . (Note that we first estimate Fu , and then use this estimate to subsequently estimate Gε .) 18 Probability density function for ε 1.0 0.8 0.8 0.6 dens.epshat(x) dens.epshat(x) 0.6 0.4 0.4 0.2 0.2 0.0 0.0 −4 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 x x Figure 2: Normal-mixture density (Black) versus benchmark density (Red): (a) ρ = 0.05 (left panel); (b) ρ = 0.25 (right panel) 4.2 Implications for estimates of poverty Next we investigate whether having more precise estimates of Fu and Gε will also give us more precise estimates of poverty, and whether any gain in precision is economically meaningful. For ease of exposition we will focus on the percentage of households with incomes below the poverty line as the measure of poverty which we wish to estimate. Tables 1 and 2 provide estimates of the bias for different values of the shape parameter for Gε (the shape parameter for Fu is fixed at pu = 0.5), and for different values of the (log) poverty line. The poverty rates associated with the different (log) poverty lines are roughly 15, 20, 30, and 45 percent. It should be noted that the bias in this case is estimated as the difference between the estimated and the true poverty rate averaged over the 500 target areas for one given replication of the “census”. At least two observations stand out. First, estimates obtained under the (incorrect) assump- tion of normal errors can be severly biased, with a bias of 3 to 4 percent on a poverty rate that ranges between 20 and 30 percent depending on the shape parameter for Gε . Our approach provides far superior estimates of poverty in these cases with a bias of less than a percent. Second, the benefit of accommodating non-normal errors as opposed to assuming normal errors changes with the value of the poverty line. At the far left tail of the (log) income distribution (i.e. for particularly low values of the poverty line), it will be hard to empirically separate the two distributions. The difference will be more pronounced where the distributions have more “mass”, i.e. for intermediate to higher values of the poverty line. 4.3 Implications for estimates of inequality Table 5 compares the bias for estimates of income inequality for different values of the shape parameter and different values of the area location effect. We focus on the Mean-Log-Deviation (MLD) as the measure of inequality which we wish to estimate. Note that by definition we have that the realizations of the area error ua will drop out of the true measure of MLD for that area, 19 Poverty ρ = 0.05 ρ = 0.25 line/shape 0.50 0.25 0.10 0.50 0.25 0.10 -0.75 0.67 (1.61) 0.64 (1.69) 1.32 (2.06) 1.38 (2.51) 1.40 (2.70) 1.06 (2.55) -0.50 0.76 (2.51) 0.82 (3.29) 1.35 (4.01) 1.39 (3.09) 1.55 (3.78) 1.32 (3.92) -0.25 0.67 (2.73) 0.92 (4.16) 1.14 (5.15) 1.01 (2.82) 1.32 (3.92) 1.34 (4.37) 0 0.31 (1.96) 0.73 (3.60) 0.60 (4.51) 0.20 (1.53) 0.61 (2.73) 0.89 (3.32) Table 1: Bias for EB estimates of head-count poverty: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number) Poverty ρ = 0.05 ρ = 0.25 line/shape 0.50 0.25 0.10 0.50 0.25 0.10 -0.75 0.19 (0.98) 0.12 (1.07) 0.91 (1.45) 0.23 (0.79) 0.22 (0.97) -0.08 (0.85) -0.50 0.25 (1.95) 0.22 (2.73) 0.82 (3.47) 0.36 (1.63) 0.41 (2.31) 0.16 (2.48) -0.25 0.30 (2.40) 0.44 (3.83) 0.62 (4.82) 0.41 (2.03) 0.57 (3.13) 0.53 (3.60) 0 0.23 (19.7) 0.58 (3.60) 0.35 (4.51) 0.28 (1.66) 0.57 (2.86) 0.79 (3.47) Table 2: Bias for NON-EB estimates of head-count poverty: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number) Poverty ρ = 0.05 ρ = 0.25 line/shape 0.50 0.25 0.10 0.50 0.25 0.10 -0.75 0.03 (0.03) 0.03 (0.03) 0.03 (0.03) 0.04 (0.05) 0.04 (0.05) 0.04 (0.05) -0.50 0.04 (0.05) 0.04 (0.05) 0.04 (0.06) 0.05 (0.06) 0.05 (0.06) 0.05 (0.06) -0.25 0.05 (0.06) 0.05 (0.07) 0.05 (0.07) 0.06 (0.07) 0.06 (0.07) 0.06 (0.08) 0 0.06 (0.06) 0.06 (0.07) 0.06 (0.07) 0.07 (0.07) 0.07 (0.07) 0.07 (0.08) Table 3: RMSE for EB estimates of head-count poverty: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number) Poverty ρ = 0.05 ρ = 0.25 line/shape 0.50 0.25 0.10 0.50 0.25 0.10 -0.75 0.04 (0.04) 0.04 (0.04) 0.04 (0.04) 0.10 (0.10) 0.09 (0.09) 0.09 (0.09) -0.50 0.05 (0.06) 0.05 (0.06) 0.05 (0.06) 0.12 (0.12) 0.12 (0.12) 0.12 (0.12) -0.25 0.07 (0.07) 0.07 (0.08) 0.07 (0.08) 0.14 (0.15) 0.14 (0.15) 0.14 (0.15) 0 0.07 (0.08) 0.07 (0.08) 0.08 (0.09) 0.15 (0.15) 0.16 (0.16) 0.16 (0.16) Table 4: RMSE for NON-EB estimates of head-count poverty: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number) which varies around 20 for our simulated data (we multiplied the numbers by a factor 100). Note however that the size of the “location effect” may still have a bearing on our estimates of inequality, to the extent that it impacts on our estimates of the distribution for εah . Similarly to what we observed for poverty, the potential gain of accommodating non-normal errors as opposed to incorrectly assuming normal errors can be quite large. We observe biases of 2 to 3 with true inequality around 20 when assuming normal errors, compared to biases of roughly 0.5 when using our approach. 20 Shape parameter MLD 0.50 0.25 0.10 ρ = 0.05 0.36 (1.67) 0.52 (2.59) 1.12 (3.71) ρ = 0.25 0.37 (1.26) 0.61 (2.18) 0.47 (2.25) Table 5: Bias for estimates of inequality: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number) 5 A small empirical study For this small empirical application we will treat the 2010 micro census from the United States, a 1 percent sample, as a “new” country (i.e. as if the data constitutes a full population census). The data is publically available at IPUMS USA. There are a total of 1.25 million households7 residing in 422 counties, which is the level at which we will be estimating poverty and inequality (i.e. the “small area level”). The census includes individual income data as well as data on a wide range of individual characteristics. We use the income data to compute household income per capita. All other individual level variables are also aggregated at the household level as this denotes the level at which we will build the income model. We defined the head of the household as the member of the household with the highest level of individual income. (In many empirical applications much of the available data is at the household level.) Table 6 provides some descriptive statistics of the data. Variable Description Mean Std. Dev. lnycap Log of household income per capita 9.939 0.998 metro suburb In sub-urb of major city (dummy) 0.290 0.454 metro none Away from major city (dummy) 0.201 0.401 hh size 1 Household of size one (dummy) 0.300 0.458 hh size 2 Household of size two (dummy) 0.342 0.474 hh size 3 Household of size three (dummy) 0.147 0.354 share child15 Share of children aged younger than 15 0.108 0.199 hh hedu2 At least one member with a masters degree (dummy) 0.366 0.482 hh employed Number of employed household members 1.104 0.936 hd female Head of household is female (dummy) 0.417 0.493 hd logage Log of age of household head 3.889 0.371 hd ledu Household head has lower education only (dummy) 0.048 0.214 hd hedu1 Household head has college degree (dummy) 0.226 0.418 hd hedu2 Household head has a masters degree (dummy) 0.312 0.463 hd employed Household head is employed (dummy) 0.660 0.474 hd hisp Household head is Hispanic (dummy) 0.091 0.288 hd black Household head is African-American (dummy) 0.106 0.308 Table 6: Descriptive statistics of variables used in empirical application (derived from 2010 IPUMS USA) We draw a balanced sample of 6330 households (15 households per county). The log of 7 The exact number of households in the census is 1, 242, 967. 21 household per capita income will be our dependent variable. Our independent variables include: urbanity, household size, age composition, gender, ethnicity, education, and employment. There are a total of 16 regressors (excluding the constant). The regression model is shown in Table 7. We obtained an adjusted R-squared of 0.432, which denotes a typical level for this type of income (or consumption) regression model. The location effect is estimated at 0.027, which is not unusual for developed nations but rather small for developing countries where one might find location effects in the range of 0.05 to 0.25. lnycap metro suburb 0.171∗∗∗ (8.21) metro none -0.0542∗ (-1.68) hh size 1 0.733∗∗∗ (16.32) hh size 2 0.620∗∗∗ (16.00) hh size 3 0.344∗∗∗ (9.73) share child15 -0.160 ∗∗ (-2.20) hh hedu2 0.277∗∗∗ (6.47) hh employed 0.257∗∗∗ (14.24) hd female -0.212∗∗∗ (-10.80) hd logage 0.898∗∗∗ (30.46) hd ledu -0.331∗∗∗ (-6.40) hd hedu1 0.175∗∗∗ (7.06) hd hedu2 0.450∗∗∗ (9.81) hd employed 0.484 ∗∗∗ (15.25) hd hisp -0.159∗∗∗ (-4.36) hd black -0.248∗∗∗ (-7.78) cons 5.201∗∗∗ (39.18) N 6330 adj. R2 0.432 t statistics in parentheses ∗ ∗∗ ∗∗∗ p < 0.10, p < 0.05, p < 0.01 Table 7: Regression model with log household income per capita as dependent variable (sample from 2010 IPUMS USA) Figure 3 shows the estimates of the probability density function associated with the area error ua and the household error εah , respectively. The “red” line denotes the normal mixture density estimate, “blue” line denotes the normal density function that best fits the data. As we are dealing with empirical data, not simulated data, we do not know the distributions from which the errors are drawn, and hence are not able to include the “true” density function as a benchmark. The estimates suggest that the household errors are arguably drawn from a distribution that deviates visibly from a normal distribution. We do not see a similar deviation from the normal density for the area error. The smallness of the variance of the area error relative to the total error however, suggests that the signal to noise ratio is low making it difficult to identify the underlying density function. The small variance also means that the area error makes only a minor contribution to the total error and so its density function is less important. It is clearly more important that our estimate of the density associated with the 22 Probability density functions for u and ε 0.6 3.0 0.5 2.5 dens.epshat(x) 0.4 dens.uhat(x) 2.0 0.3 1.5 0.2 1.0 0.1 0.5 0.0 0.0 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −4 −2 0 2 x x Figure 3: Normal-mixture density (Red) versus normal density (Blue) estimates for: (a) area error ua (left panel); (b) household error εah (right panel) household error is accurate. The log poverty line is set at 9.5 which yields an aggregate poverty rate of about 28.3 percent. County level estimates of poverty and inequality are obtained using both EB and non- EB estimates under the assumption of normal errors and under the less restrictive assumption of normal mixture errors. For each of the specifications we obtain 422 estimates of poverty and inequality which we then compare to the true estimates derived from the full census. We will judge the accuracy of the estimates on the basis of two summary statistics: the aggregate bias and the RMSE (both obtained by aggregating the county level differences between our estimates and the true rates over the 422 counties). The results are presented in Table 8. Bias RMSE EB 0.029 (0.048) 0.045 (0.059) NON-EB 0.024 (0.044) 0.042 (0.056) Table 8: Bias and RMSE for estimates of county-level poverty using 2010 data from the United States: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number) We observe a rather large bias when estimating county level poverty under the assumption of normal distributed errors; almost 5 percentage points given a national poverty rate of about 28 percent. By relaxing the normality assumption, and assuming normal mixture distributions instead, we are able to reduce this bias to just below 3 percent which denotes a non-trivial improvement. We see similar reductions in the RMSE. Working with EB or non-EB estimates makes little to no difference in this empirical application. This was to be expected given the smallness of the location effect. Note that the fact that some bias still remains suggests that a degree of model misspecification still persists; these could include mis-specifications in the structural model but also forms of heteroskedasticity that are currently being ignored. 23 6 Concluding remarks We have developed an estimator for the error distribution functions associated with the nested errors from a linear mixed model. An attractive feature of the estimator is that it accommodates Empirical Bayes prediction. This is achieved without making any restrictive assumptions about the shape of the distrbution functions. Monte-Carlo simulations presented in this paper show that estimates of poverty and inequality can be severely biased when non-normality of the errors is ignored. The bias can be as high as 2 to 3 percent on a poverty rate of 20 to 30 percent. Most of this bias is resolved when using our approach. 7 Appendix 7.1 A lemma and corollary that are used in the proofs of Lemmas 4 and 5 Lemma 11 Let v1 and v2 be two independent normal random variates, with Evi = µi , var vi = 2 for i = 1, 2. Let v = v + v . Denote the normal density function with expectation µ σi 3 1 2 and variance σ 2 by ϕ(·; µ, σ ). Then for the joint density p(v1 , v3 ) and the conditional density p(v1 |v3 ) we have p(v1 , v3 ) = ϕ(v1 ; µ1 , σ1 )ϕ(v3 − v1 ; µ2 , σ2 ) = p(v1 |v3 )p(v3 ) 2 σ1 2 σ2 σ1 σ2 = ϕ v1 ; ( 2 + σ2 3v − µ 2 ) + 2 + σ 2 µ1 , 2 + σ2 σ1 2 σ1 2 σ1 2 ×ϕ v3 ; µ1 + µ2 , 2 + σ2 . σ1 2 Proof This can be directly verified. Alternatively, note that both v3 and v1 |v3 are normally distributed. The distribution of v1 |v3 has expectation 2 σ1 σ2 E (v1 |v3 ) = 2 2 (v3 − µ2 ) + 2 2 2 µ1 σ1 + σ2 σ1 + σ2 and variance σ12σ2 2 var (v1 |v3 ) = 2 + σ2 . σ1 2 Corollary 12 Let x and y be independent random variables distributed as normal mixtures, and let q = x + y . Then (i) q is also distributed as a normal mixture, and (ii) the conditional distributions p(x|q ) and p(y |q ) are also normal mixtures. Moreover, the mixing parameters of p(q ), p(x|q ) and p(y |q ) are closed expressions in terms of the distributional parameters of x and 24 y . In particular, if the density function of x is a f (x) = πk ϕ(x; µk , σk ), k=1 (with πk > 0 and k πk = 1) and the density function of y is b g (y ) = λj ϕ(y ; νj , τj ), j =1 (with λj > 0 and j λj = 1) then the conditional distribution of x, given q is h(x|q ) = κjk ϕ(x; mjk , sjk ), jk where the mixing probabilities κjk are proportional to κjk ∝ πk λj ϕ q ; µk + νj , 2 + τ2 σk j and 2 2 τj σk mjk = 2 + τ2 (q − νj ) + 2 + τ 2 µk σk j σk j 2τ 2 σk j s2 jk = 2 + τ2 σk j Proof The random variable x can be expressed as a x= zk x k , k=1 where the xk are independent (latent) random variables, and also independent of (z1 , ..., za ). xk is normally distributed according to component distribution k , with density ϕ(xk ; µk , σk ). The zk s are indicator random variables taking values 0 and 1 with probability πk . One and only one indicator takes the value 1, the others are 0.8 Likewise y can be expressed as b x= wj y j , j =1 where the (wj , yj ) independent of the (zk , xk ). Note that zk = zk j wj and wj = wj k zk , so that q =x+y = zk x k + wj yj = zk wj (xk + yj ). k j jk 8 In other words, the (z1 , ..., za ) have a multinomial distribution with class probabilities πk and a single draw. 25 Part (i) of the Corollary now immediately follows; the mixing probabilities for q are P [zk wj = 1] = πk λj and the component distribution for component (k, j ) has expectation µk + νj and 2 + τ 2. variance σk j For the conditional distribution p(x|q ) we have p(x|q ) ∝ p(x, q ) = πk λj ϕ(x; µk , σk )ϕ(q − x; νj , τj ) jk   2 2 τj σk σk τj = πk λj ϕ x; 2 + τ2 (q − νj ) + 2 + τ2 µk ,  σk j σk j 2 + τ2 σk jk j ×ϕ q ; µk + νj , 2 + τ2 . σk j The last equality follows from the Lemma. The conditional distribution p(x|q ) is therefore again a normal mixture with mixing proba- bilities for the (k, j ) component proportional to πk λj ϕ q ; µk + νj , 2 + τ2 σk j and corresponding component distribution   2 2 τj σk σk τj ϕ x; 2 2 (q − νj ) + 2 2 µk , . σk + τj σk + τj 2 σk + 2 τj The distribution of p(y |q ) follows in exactly the same fashion. 7.2 Proofs ¯a conditional on e Lemma 4: The probability density function of ε ¯a , which we denote by εa | e p(¯ ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε εa | e p(¯ ¯a ) = ¯a ; mijk ; s2 wijk ϕ ε ijk , (41) i=1 j =1 k=1 where: 2 + ω2 ωj 2 k na σ i νj + νk mijk = 2 2 2 ea − µi ) + (¯ 2 2 2 ωj + ωk + na σ i ωj + ωk + na σi na 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i 26 ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk ϕ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (42) 2 ) and (λ , ν , ω 2 ) where ϕ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. Proof The result stated in this lemma follows from a direct application of Corollary 12. ¯a , which we denote by Lemma 5: The probability density function of ua conditional on e p(ua |e ¯a ), is a normal-mixture with known parameters: i=mu j =mε k=mε p(ua |e ¯a ) = wijk ϕ ua ; mijk ; s2 ijk , (43) i=1 j =1 k=1 where: 2 2 + ω2 ωj σi k mijk = 2 + (ω 2 + ω 2 )/n ea − (νj + νk )/na ) + (¯ 2 + ω2 + n σ2 µi σi j k a ωj k a i 2 + ω 2 )σ 2 (ωj k i s2 ijk = 2 + ω2 + n σ2 , ωj k a i ˜ijk / and where wijk = w ijk ˜ijk with: w 2 2 2 ˜ijk = πi λj λk ϕ(¯ w ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ), (44) 2 ) and (λ , ν , ω 2 ) where ϕ denotes the normal probability density function, and where (πi , µi , σi j j j denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively. Proof The result stated in this lemma follows from a direct application of Corollary 12. Proposition 9: At every iteration-k , the mean and variance for the probability density ˆ (k) (k ) (k) 2(k) function fu (u) = ˆi ϕ(u; µ iπ ˆi , σˆi ) solve: ˆ(k) ] = 1 E [u; fu ¯a − E [¯ e εa | e ˆ(k) (¯ ¯a ; p εa | e ¯a )] A a ˆ(k) ] = 1 ˆ(k) ]. var[u; fu E [(¯ ¯a )2 |e ea − ε ˆ(k) (¯ ¯a ; p ¯a )] − E 2 [u; f εa |e u A a Proof Let us first evaluate the first moment: (k) (k) Ek [u] = ˆi µ π ˆi , (45) i 27 ˆ (k) where Ek [·] takes expectations using the iteration-k estimate of the pdf for u: fu (u) = (k ) (k) 2(k) (k) (k) ˆi ϕ(u; µ iπ ˆi , σˆi ). ˆi Substituting the expressions for π ˆi , we obtain: and µ (k ) (k) (k) 1 (k) Ek [ aτ ˆai (¯ ea − ε¯)|e¯a ] π ˆi µ ˆi = Ek [ ˆai |e τ ¯a ] (k) (46) A a Ek [ a τ ˆai |e¯a ] i i 1 (k) = Ek [ˆ ¯)|e ea − ε τai (¯ ¯a ] (47) A a i 1 (k) E = Ek [ τ ¯)|e ea − ε ˆai (¯ ¯a ]. (48) A a i (k) By definition we have that ˆai iτ = 1, so that the latter simplifies to: 1 Ek [u] = ε| e ¯a − Ek [¯ e ¯a ]. (49) A a Next is the expression for the variance of u: (k ) 2(k) 2(k) 2 vark [u] = π ˆi σ ˆi +µ ˆi − Ek [u]. (50) i (k) (k) 2(k) ˆi , µ Substituting the expressions for π ˆi ˆi and σ yields: (k) (k ) (k) 2(k) 2(k) 1 (k) Ek [ aτ ˆai (¯ea −ε ˆi )2 |e ¯− µ ¯a ] 2(k) π ˆi σ ˆi + ˆi µ = Ek [ ˆai |e τ ¯a ] (k) ˆi +µ A Ek [ ˆai |e i i a aτ ¯a ] 1 (k) (k) (k) 2(k) = Ek [ τ ea − ε ˆai (¯ ˆi )2 |e ¯− µ ¯a ] + Ek [ ˆai |e τ µi ¯a ]ˆ A a a i 1 (k) (k) 2(k) (k ) (k) = Ek ˆai (¯ τ ¯)2 + 2 ea − ε τ ˆai µ ˆi −2 τ ea − ε ˆai (¯ µi |e ¯)ˆ ¯a . A a a a i (k ) (k ) (k) (k) ˆai we know that Ek [ From the expression for τ aτ ˆai (¯ ea − ε ¯)|e ¯a ] = Ek [ ˆai |e aτ µi . ¯a ]ˆ Plugging this into the last obtained equation, the last two terms are seen to cancel out, and we obtain: (k ) 2(k) 2(k) 1 (k) π ˆi σ ˆi +µ ˆi = Ek τ ˆai (¯ ¯)2 |e ea − ε ¯a (51) A a i i 1 (k) = Ek ( τ ¯)2 |e ea − ε ˆai )(¯ ¯a (52) A a i 1 = ¯)2 |e ea − ε Ek (¯ ¯a . (53) A a This completes the proof. 28 7.3 A derivation of the EM-type estimator for Fu Below we will derive the EM-type estimator for the normal-mixture distribution Fu . The estimator for Gε is obtained in a very similar fashion. Let Ψ denote the vector that contains the parameters of this normal-mixture, which includes 2 of the component distributions. the mixing probabilities πi as well as the parameters µi and σi Assume for the moment that ua denotes observed data. The EM algorithm would then proceed as follows. It introduces the random variable zai that equals 1 if ua comes from component i and 0 otherwise, and defines L(Ψ; u, z ) as the log likelihood function that would apply if both u and z were observed. It then integrates out z by taking expectations over z conditional on ˆ (k) . The solution an estimate of Ψ that is available at iteration k , which we shall denote Ψ ˆ (k) ) = E [L(Ψ; u, z )|Ψ ˆ (k+1) that maximizes the resulting log likelihood function Q(Ψ; Ψ Ψ ˆ (k) ] can be obtained analytically, unlike the solution to the original log likelihood function. The EM estimator for Ψ in this case is obtained as the fixed point of this iterative equation. ¯a − ε Unfortunately, we do not observe ua . According to the nested error model, ua = e ¯a . We ¯a , but not ε observe e ¯a . The EM algortihm however can still be applied. Instead of integrating out z , we will be integrating out both z and ε¯a . With slight abuse of notation, the resulting log ˆ likelihood function is now given by: Q(Ψ; Ψ k) ) = E [L(Ψ; e ( ˆ (k) ]. Note that this function ¯, z )|Ψ ¯, ε ˆ (k) ) will be different from the function that is obtained when u is treated as observed Q(Ψ; Ψ data and only z is integrated out. (But it is obtained respecting the principles of the EM algorithm.) ¯ until after we have evaluated It will be convenient to wait with taking the expectations over ε ˆ ) with respect to Ψ. It follows that: the partial derivatives of Q(Ψ; Ψ ( k ) ˆ (k) , ε ˆ (k) ) = E [Q(Ψ; Ψ Q(Ψ; Ψ ¯)|e ¯], (54) ¯ conditional on e where expectations are taken over ε ¯, and where: ˆ (k) , ε (k) 2 Q(Ψ; Ψ ¯) = τ ˆai (¯ ea − ε εa ) log πi ϕ(¯ ¯a ; µi , σi ) , (55) i a with: (k) (k) 2(k) π ¯a − ε ˆi ϕ e ˆi , σ ¯a ; µ ˆi (k) ˆai (¯ τ εa ) = . (56) (k) (k) 2(k) iπ ˆi ϕ ¯a − ε e ˆi , σ ¯a ; µ ˆi ˆ (k) ) with respect to Ψ = (π, µ, σ 2 ) and bringing this Evaluating the partial derivative of Q(Ψ; Ψ inside the expectation operator yields: ∂Q ˆ (k) , ε ∂Q(Ψ; Ψ ¯) (k ) 1 = E |e ¯ =E τ ˆai (¯ εa ) |e ¯a ∂πi ∂πi a πi ∂Q ˆ (k) , ε ∂Q(Ψ; Ψ ¯) (k ) 1 = E |e ¯ =E τ ˆai (¯ εa ) ea − ε 2 (¯ ¯a − µi )|e ¯a ∂µi ∂µi a σi 29 ∂Q ˆ (k) , ε ∂Q(Ψ; Ψ ¯) (k ) 1 ¯a − µi )2 ea − ε (¯ 2 = E 2 |e ¯ =E ˆai (¯ τ εa ) 2 2 − 1 |e ¯a . ∂σi ∂σi a 2σi σi (k+1) 2(k+1) ˆi The expressions for µ ˆi and σ presented in Estimation Method 1-A are readily obtained after setting the partial derivatives with respect to µi and σi 2 equal to zero and solving for µ i and σ 2 . Finding the value for πi that maximizes Q(Ψ; Ψˆ (k) ) is a Lagrange optimization problem i that incorporates the constraint that i πi = 1. The corresponding FOC solves: (k) 1 E τ ˆai (¯ εa ) |e ¯a = λ, (57) a πi where λ denotes the Lagrange multiplier. Rearranging terms and summing both sides over i yields: (k) E τ εa )|e ˆai (¯ ¯a = λπi . (58) a i i (k) Since ˆai (¯ iτ εa ) = i πi = 1 by definition, we are left with: λ= 1 = A. (59) a By substituting this into eq. (57), and solving this for πi , we obtain the desired expression for (k+1) ˆi π . References Araujo, M., Ferreira, F., Lanjouw, P. and Ozler, B. (2008). Local inequality and project choice: Theory and evidence from ecuador. Journal of Public Economics, 92, 1022–1046. Cordy, C. and Thomas, D. (1997). Deconvolution of a distribution function. Journal of the American Statistical Association, 92, 1459–1465. Demombynes, G. and Ozler, B. (2005). Crime and local inequality in south africa. Journal of Development Economics, 76, 265–292. Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood estimation from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Serie B, 39, 1–38. Elbers, C., Fujii, T., Lanjouw, P., Ozler, B. and Yin, W. (2007). Poverty alleviation through ge- ographic targeting: How much does disaggregation help? Journal of Development Economics, 83, 198–213. Elbers, C., Lanjouw, J. and Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71, 355–364. 30 Fujii, T. (2010). Micro-level estimation of child undernutrition indicators in cambodia. The World Bank Economic Review, 24, 520–553. Henderson, C. (1953). Estimation of variance and covariance components. Biometrics, 9, 226–253. Kleiber, C. (2007). A guide to the dagum distributions. WWZ Working Paper, 23/07. McCulloch, C. and Neuhaus, J. (2011). Misspecifying the shape of a random effects distribution: Why getting it wrong may not matter. Statistical Science, 26, 388–402. McLachlan, G. and Peel, D. (2000). Finite Mixture Model. New York: John Wiley & Sons. Molina, I. and Rao, J. (2010). Small area estimation of poverty indicators. Canadian Journal of Statistics, 38, 369–385. Searle, S., Casella, G. and McCulloch, C. (1992). Variance components. New York: Wiley. Skrondal, A. and Rabe-Hesketh, S. (2009). Prediction in multilevel generalized linear models. Journal of the Royal Statistical Society, 172, 659–687. Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random effects population. Journal of the American Statistical Association, 91, 217–221. Wang, S. and Yin, S. (2002). A new estimate of the parameters in linear mixed models. Science in China Series A: Mathematics, 45, 1301–1311. Westfall, P. (1987). A comparison of variance component estimates for arbitrary underlying distributions. Journal of the American Statistical Association, 82, 866–874. Wu, M., Yu, K. and Liu, A. (2009). Estimation of variance components in the mixed effects models: A comparison between analysis of variance and spectral decomposition. Journal of Statistical Planning and Inference, 139, 3962–3973. Zhou, T. and He, X. (2008). Three-step estimation in linear mixed models with skew-t distri- butions. Journal of Statistical Planning and Inference, 138, 1542–1555. 31