WPS6962


Policy Research Working Paper                     6962




  Estimation of Normal Mixtures in a Nested
   Error Model with an Application to Small
   Area Estimation of Poverty and Inequality
                                 Chris Elbers
                              Roy van der Weide




Development Research Group
Poverty and Inequality Team
July 2014
Policy Research Working Paper 6962


  Abstract
 This paper proposes a method for estimating distribution                           the only application. Monte-Carlo simulations show that
 functions that are associated with the nested errors in                            estimates of poverty and inequality can be severely biased
 linear mixed models. The estimator incorporates Empiri-                            when the non-normality of the errors is ignored. The bias
 cal Bayes prediction while making minimal assumptions                              can be as high as 2 to 3 percent on a poverty rate of 20
 about the shape of the error distributions. The applica-                           to 30 percent. Most of this bias is resolved when using
 tion presented in this paper is the small area estimation of                       the proposed estimator. The approach is applicable to
 poverty and inequality, although this denotes by no means                          both survey-to-census and survey-to-survey prediction.




  This paper is a product of the Poverty and Inequality Team, Development Research Group. It is part of a larger effort by
  the World Bank to provide open access to its research and make a contribution to development policy discussions around
  the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be
  contacted at rvanderweide@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
      Estimation of normal mixtures in a nested error model with an
      application to small area estimation of poverty and inequality

                               Chris Elbers and Roy van der Weide1




Keywords: Normal mixtures, linear mixed models, small area estimation, Empirical Bayes,
poverty, inequality
JEL Classification: I32, C31, C42




1
  Chris Elbers (c.t.m.elbers@vu.nl) is at VU University Amsterdam and the Tinbergen Institute. Roy van der
Weide (rvanderweide@worldbank.org) is at the World Bank. We gratefully acknowledge financial support
from VU University Amsterdam and the World Bank's Knowledge for Change Program (KCP II). A thank you also
goes to Peter Lanjouw for providing comments on earlier versions of this paper.
1     Introduction
We propose a method for estimating distribution functions that are associated with the nested
errors in linear mixed models. The proposed estimator is accommodating Empirical Bayes
prediction and makes minimal assumptions about the shape of the error distributions. Our
objective is to accurately predict nonlinear functions of the dependent variable, where the
shape of the error distribution functions potentially plays an important role. The application
we have in mind is the small area estimation of poverty and inequality, although this denotes by
no means the only application. This particular application has been made popular by the work
of Elbers, Lanjouw and Lanjouw (2003; henceforward ELL). The approach they put forward
has since been applied to obtain maps of poverty and inequality in over 60 countries worldwide.
The highly disaggregated estimates of poverty and inequality have in turn inspired a range of
other applications, for example: Demombynes and Ozler (2005) investigate whether inequality
at the small area level has an impact on local crime rates, and ﬁnd that it does using data
from South Africa; Elbers et al. (2007) conduct an empirical experiment in order to estimate
by how much one could potentially lower the costs of getting resources to the poor if one had
access to a poverty map, and conclude that the gains can be substantial; Araujo et al. (2008)
examine whether villages with higher levels of inequality are less likely to invest in public goods
that would beneﬁt the poor. Using data from Ecuador they ﬁnd empirical support for this
hypothesis which they attribute to elite capture. Recently, Fujii (2010) modiﬁed the approach
to make it better suited for the small area estimation of child malnutrition outcomes with an
application to Cambodia.

1.1   Problem statement
Consider the following standard linear mixed model for log income of household h residing in
area a:
                                     yah = xah β + ua + εah .

Let yd denote a vector of log incomes for all households from domain d (where “domain”
typically refers to a small geographic area). We are interested in estimating W (yd ), where W
is some (possibly nonlinear) function of yd . It is assumed that we have data on x for the entire
population, while data on both x and y is available for a sample S of households only.
    ELL propose to estimate W by:

                                   E [W (yd )|x ∈ d; (x, y ) ∈ S ],

where the expectation is taken over the unobserved errors ua and εah . Because W is nonlinear,
the shape of the error distributions will matter for the expected value of W . Estimation of β is
obviously of primary importance when estimating E [W ], but in this paper we will concentrate
on the error distributions and assume – except for the application in section 6 – that β is known.
We argue that even if β is estimated perfectly, getting the error distribution wrong still has the



                                                  2
potential to introduce a signiﬁcant bias.
      ELL felt their approach would be most convincing if they make minimal assumptions about
the errors ua and εah . They proceed by ﬁrst obtaining estimates of the area errors ua and εah ,
                                                           ˆa and ε
which then allows them to sample from the empirical errors u      ˆah . ua is estimated as the
simple area average of the total residuals (appropriately re-scaled so that the sample variance
of u                          2 that corrects for the contribution of the area average of ε ).
   ˆa equals the estimate of σu                                                            ah
                                                ˆa from the total residual (re-scaled so that its
The estimate for εah is obtained by subtracting u
                         2 ).1
sample variance matches σε
      This non-parametric approach to estimating the error distributions has two limitations.
                                                                      ˆa equals a sum of ua
Firstly, this procedure does not adequately account for the fact that u
    ¯a . This latter term is often large enough for it to aﬀect the ability of the empirical
and ε
                ˆa to reproduce the shape of the actual distribution of ua . Simply ignoring
distribution of u
this “contamination” may result in a signiﬁcant bias. If the empirical distribution of ua is
biased, then this bias will also have implications for the empirical distribution of εah . This is
also referred to as a “convolution problem”, where the objective is to estimate the distribution
function of a random variable that is observed with error.
      Secondly, the approach adopted by ELL does not easily lend itself for Empirical Bayes (EB)
estimation where the distribution of ua is tightened by conditioning on household data y and x
available for domain d (i.e. in the event that some households in domain d are included in the
sample). Working out the conditional distribution is not a trivial exercise without making fur-
ther distributional assumptions. Consequently ELL decided to forego EB estimation altogether.
In doing so, they have accepted a certain loss in eﬃciency by not fully utilizing all available
information. Molina and Rao (2010) recently picked up on this and put forward an alternative
approach that does implement EB estimation. They take ELL as a point of departure but then
assume that both ua and εah are normal distributed, in which case the conditional distribution
too will be normal distributed. Where ELL accept a loss in precision by not implementing EB
estimation, Molina and Rao (2010) accept a loss in precision that might stem from a misspeci-
ﬁcation of the error distribution functions. The data at hand will ultimately determine which
of the two will be accepting the larger loss. ELL are arguably most interested in estimating
poverty and inequality in developing countries where the number of small areas (or domains)
that are covered by the income surveys are often small, think of 5 to 25 percent of all domains
in the population. In this case the beneﬁts of EB estimation will be modest (as survey data
are available for only few areas). However, in more developed countries, or countries where
travel costs that are incurred when covering all small areas are manageable, income surveys
often cover a much larger number of the domains. In fact, there are numerous examples where
surveys cover between 50 and 100 percent of all domains in the country. In those instances,
there may be clear beneﬁts to adopting EB estimation.
      The approach presented in this paper improves on both ELL and Molina and Rao (2010).
Like ELL we make no restrictive assumptions about the error distributions. Our estimator
  1                                                         2
      Note that ELL allows for heteroskedasticity, so that σε can be household-speciﬁc.



                                                        3
for the distribution functions will generally be more accurate than the estimator adopted by
ELL however, as we explicitly account for the nested error structure (that is responsible for the
“convolution problem”). Unlike ELL we also accommodate EB estimation. We achieve this by
ﬁtting ﬁnite normal mixtures (NM) to the error distribution functions. Normal mixtures are
extremely ﬂexible; they are able to ﬁt any well-behaved distribution function, and are ideally
suited for accommodating EB estimation. If the marginal distributions of ua and εah can be
described by normal mixtures, then the conditional distribution too can be described by a
normal mixture with known parameters (that are functions of these parameters and of the data
on which is being conditioned). Estimation of the normal mixtures for ua and εah is complicated
by the fact that neither ua nor εah are observed. Our estimator for the NM parameters may be
viewed as a modiﬁed version of the EM algorithm.
   Monte Carlo simulations indicate that estimates of poverty and inequality can be severely
biased when ignoring the non-normality of the errors. The bias can be as high as 2 to 3 percent
on a poverty rate of 20 to 30 percent. Most of this bias is resolved when implementing our
estimator. This is conﬁrmed by an empirical application to US data.

1.2   Normal mixtures in nested-error models
There are a number of other studies that have explored diﬀerent ways of relaxing the normality
assumption in mixed linear models. Verbeke and Lesaﬀre (1996) is an early example that
also considers normal mixtures as a “non-parametric” representation of the error distribution
function. However, they impose a number of important restrictions. First, only the area random
eﬀects ua are allowed to be non-normal; a normal-mixture is ﬁtted to the distribution of ua under
the assumption that εah is normally distributed. Second, it is assumed that the “component
distributions” that make up the normal mixture share a common variance, which noticeably
simpliﬁes estimation but at the same time signiﬁcantly limits the ﬂexibility of the normal
mixture to ﬁt any given distribution function. This approach has also been followed by Cordy
and Thomas (1997) who work with the same setup and adopt the same set of restrictions. There
is another strand of the literature that permits both error terms to be non-normal distributed by
imposing an alternative parametric family for the distribution functions. See for example Zhou
and He (2008) who ﬁt skewed t-distributions to the nested errors of the linear mixed model.
Recently, there have also been eﬀorts to explore the impact of misspeciﬁcations in the error
distributions for Empirical Bayes predictions derived from linear mixed models, see for example
Skrondal and Rabe-Hesketh (2009) and McCulloch and Neuhaus (2011). They conclude that
the bias is reasonably small. It should be noted however that those studies focus on prediction
of the dependent variable itself, in which case the Empirical Bayes estimates of the area random
eﬀects ua denote the only source of bias. Misspeciﬁcations in the error distributions become
considerably more important when predicting nonlinear functions of the dependent variable,
such as measures of poverty and inequality, as we will show in this paper.
   The remainder of this paper is organized as follows. In Section 2 we brieﬂy discuss how the
error distribution functions associated with the errors in a linear mixed model will matter for


                                               4
prediction, with an application to poverty and inequality measurement. In this section we will
also introduce normal mixtures as a ﬂexible “non-parametric” representation of any given well-
behaved distribution function, and demonstrate the implications for EB estimation. Estimation
of the normal mixture distributions to both errors from the linear mixed model is presented in
Section 3. A modest Monte Carlo simulation study followed by an equally modest empirical
application is provided in Sections 4 and 5, respectively. Finally, Section 6 concludes.


2        Estimation of poverty and inequality: Distributions matter
2.1       Linear mixed model for income
Suppose that at the household level the data generating process (DGP) satisﬁes the equation
already mentioned above:
                                          yah = xT
                                                 ah β + ua + εah ,                                         (1)

where xah denotes a vector with independent variables, and where ua and εah denote zero
expectation error terms that are independent of each other. The subscripts indicate target area
(or domain) a and household h.2 In this paper we assume that errors are homoskedastic, so
                                                                2 + σ 2 . Throughout the paper
that for each household h and area a we have: var[yah |xah ] = σu    ε
it is assumed that consistent estimators for the variance parameters are available, which we
                ˆu
shall denote by σ 2 and σ
                        ˆε2 .3 We will not make any assumptions about the shape of the error

distributions.
        Let A denote the total number of areas covered by the income survey and let na denote the
                                                                                   A
number of households that have been sampled in area a, so that n =                 a=1 na   denotes the total
sample size of the survey. We shall denote the total household error by: eah = yah − xT
                                                                                      ah β , and
its area a average by e         ¯T
                           ¯a − x
                      ¯a = y               ¯a =
                                 a β where e                  h eah /na .   We shall also use the notation
ea = (ea,1 , . . . , ea,na ) which denotes the vector of length na with residuals for all households
                                                                                             ¯a
from area a. With a slight abuse of terminology we will at times refer to the errors eah and e
as data (as if we know the parameter vector β ).
        Let the probability distribution functions for ua and εah be denoted by Fu and Gε . We will
propose a computationally attractive method for estimating these distribution functions, where
we make no restrictive assumptions concerning their functional form, in particular allowing
these functions to be other than normal distribution functions. Another appealing feature of our
non-parametric estimator for the error distribution functions is that it can easily accommodate
Empirical Bayes estimation.
    2
      These areas may refer to geographic areas such as districts or municipalities, but also to non-geographic
domains such as ethnic groups or age groups, say.
    3
      A commonly used estimator for the variance parameters from a nested error model is Henderson’s method III
estimator (see Henderson, 1953; and Searle et al., 1992), which may be viewed as a method of moments estimator
which does not require any assumptions about the shape of the error distributions. Alternative estimators are
restricted maximum likelihood, minimum norm quadratic unbiased estimation (MINQUE; see e.g. Westfall, 1987;
Searle et al., 1992), and so-called spectral decomposition estimation (see e.g. Wang and Yin, 2002; and Wu et
al., 2009).




                                                      5
2.2    Empirical Bayes estimation
Empirical Bayes (EB) estimation, also known as Empirical Best estimation, tightens the error
distributions by conditioning on all available data in the survey for the purpose of prediction.
The ‘observation’ ea (provided that area a is covered by the survey) clearly carries information
                                                                                        ¯a will
about the area random eﬀect ua . In the extreme, for example, where na tends to inﬁnity e
perfectly reveal ua .
   In eﬀect EB estimates are obtained by integrating out area errors ua using probability density
p(ua |ea ), i.e. the probability density of ua conditional on ea observed from the survey. (Non-EB
estimates are obtained by using the unconditional density p(ua ).) The challenge is to work out
p(ua |ea ) along with the marginals p(ua ) and p(εah ) without imposing restrictions about their
form so that the resulting conditional density p(ua |ea ) can also take on any form. Currently, the
literature on EB estimation avoids this challenge by assuming normally distributed marginals,
in which case the conditional distribution will be normal too.
   For normally distributed errors it can be shown that p(ua |ea ) can be written as a function
   ¯a , the mean of sample errors from domain a. This is not true for general error distributions.
of e
However, conditioning on the full vector ea becomes computationally intractable. Therefore we
                  ¯a rather than the full vector ea even in the case of non-normal errors.
will condition on e
   A second simpliﬁcation concerns the known regression residuals for sample households.
When conditioning on ea , we ignore that for these households the total error eah = ua + εah is
known (since it is one of the components of ea ). Instead we will assume that εah is independent
of ea even for sample households. The reason is that in practice sample household h cannot
be traced among households in the target domain a (households for which only data on x is
available). In most practical settings, the error thus introduced will be negligible. In fact, this
error tends to zero as the size of the sample relative to the population size tends to zero.

2.3    Distributions matter
Let y(a) and e(a) denote vectors of length Na with elements yah and eah for all households from
the population. Similarly, x(a) will denote a matrix with rows given by xT
                                                                         ah for all Na households.
ya , ea and xa will denote the survey sample analogues. As mentioned above our objective is to
estimate:

                 E [W (y(a) )|x(a) , ya ] =   W (x(a) β + e)p(e|ea )de                         (2)

                                                W (x(a) β + u + ε)p(ε)p(u|ea )dεdu,            (3)

where the function W will generally be non-linear.
   Non-normality of the errors u and ε will aﬀect the estimates of E [W ] via two diﬀerent
channels. Firstly, when the function W is indeed non-linear, the expected value E [W ] will be a
function of the higher moments of the distributions of u and ε. Getting these moments wrong,
in other words getting the distributions wrong, is then likely to introduce bias. Secondly, if


                                                  6
the distributions of u and ε are wrong then the conditional density p(u|ea ), which concerns EB
estimation, will also be wrong. This will aﬀect all moments of the conditional distribution of u,
including the ﬁrst, and hence has the potential to introduce bias in the estimate of E [W ] even
if W is linear. (Note that in the case of linear W , non-EB estimates of E [W ] are unbiased even
if the distributions for u and ε are wrong.) Recently, Skrondal and Rabe-Hesketh (2009) and
McCulloch and Neuhaus (2011) have explored the magnitude of the bias that is introduced by
getting the ﬁrst moment of p(u|ea ) wrong, in the case of linear W , and found it to be modest.
Our estimator would therefore be most relevant for the case of nonlinear W .

2.4     Empirical Bayes estimation with normal mixtures
Let us assume that Fu and Gε can be represented by mixture distributions:

                                                  i=mu
                                        Fu =              πi Fi                                     (4)
                                                    i=1
                                                  j =mε
                                        Gε =              λ j Gj ,                                  (5)
                                                   j =1


where the Fi ’s and Gj ’s denote a basis of distribution functions which we will also refer to
as components or component distribution functions. The πi ’s and λj ’s denote unknown non-
negative probabilities that satisfy     i πi   = 1 and       j   λj = 1, which we will also refer to as
mixing probabilities. mu and mε denote the number of components used to represent Fu and
Gε , respectively. We will denote the probability density functions associated with Fi and Gj by
respectively fi and gj .
      Mixture distributions are remarkably well equipped to ﬁt any well-behaved distribution
function. For example, kernel density estimators are closely related to mixture distributions.
We will be working with normal component distributions, so that the mixture distributions are
normal mixtures.

Assumption 1 The components Fi are normal distribution functions with mean µi and vari-
      2 . Similarly, components G are normal distribution functions with mean ν and variance
ance σi                          j                                             j
 2.
ωj

Note that the modeler is at liberty to work with a diﬀerent basis of component distributions.
This choice does not have real implications for the ability of the mixture to ﬁt a given distribution
function.
      If p(ua ) and p(εah ) are normal-mixtures, then p(ua |e
                                                            ¯a ) is a normal mixture too. This is a
powerful result as the integral in equation 3 will generally have to be computed by simulation,
and sampling from normal mixtures is straightforward. Lemma 2 shows how the parameters
that deﬁne p(ua |e
                 ¯a ) can be obtained as a function of the parameters of the normal-mixtures
p(ua ) and p(εah ). Implementing EB estimation is thus as easy as sampling the area errors from
the normal-mixture p(ua |e             ¯a is observed in the survey sample.
                         ¯a ) whenever e


                                                  7
                                                              ¯a , which we denote by p(ua |e
Lemma 2 The probability density function of ua conditional on e                             ¯a ),
is a normal-mixture with known parameters:

                                             i=mu j =mε k=mε
                              p(ua |e
                                    ¯a ) =                           wijk ϕ ua ; mijk ; s2
                                                                                         ijk ,                    (6)
                                              i=1       j =1   k=1


where:

                                  2                                                            2 + ω2
                                                                                              ωj
                                 σi                                                                 k
         mijk =           2     2    2 )/n                 ea − (νj + νk )/na ) +
                                                          (¯                            2 + ω2 + n σ2       µi
                         σi + (ωj + ωk     a                                           ωj    k    a i
                               2 + ω 2 )σ 2
                             (ωj    k i
          s2
           ijk   =        2 + ω2 + n σ2             ,
                         ωj    k    a i


                 ˜ijk /
and where wijk = w               ijk   ˜ijk with:
                                       w

                                                                  2      2    2
                     ˜ijk = πi λj λk ϕ(¯
                     w                 ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ),                           (7)

                                                                               2 ) and (λ , ν , ω 2 )
where ϕ denotes the normal probability density function, and where (πi , µi , σi         j j     j
denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively.
                                                                                     ¯a to a
   If we apply the Central Limit Theorem to approximate the marginal distribution of ε
                                                          2 /n , the expression for p(u |e
normal distribution with mean zero and variance equal to σε   a                        a ¯a )
simpliﬁes considerably:

                                 i=mu
                                                                              2        2          −1
                       ¯a ) ≈
                 p(ua |e                               ¯a + (1 − γai )µi ; 1/σi
                                         αi ϕ ua ; γai e                        + na /σε                ,         (8)
                                  i=1

            2 /(σ 2 + σ 2 /n ), and where α = α
with γai = σi                                 ˜i/
                 i     ε    a              i                           ˜i
                                                                      iα    with:

                                                                 2    2
                                           ˜ i = πi ϕ(¯
                                           α          ea ; µi ; σi + σε /na ).                                    (9)

                                           ¯a , given the density function p(ua |e
   The expected value of ua conditional on e                                     ¯a ) from Lemma 2,
is seen to solve:
                                        ¯a ] ≈
                                 E [ua |e               αi (¯         ¯a + (1 − γai )µi ) ,
                                                            ea ) (γai e                                          (10)
                                                    i

where γai =    2 /(σ 2
              σi         +    2 /n ),
                             σε                       ea ) denotes the mixing probabilities of p(ua |e
                                        and where αi (¯                                              ¯a ).
                    i             a
   Note that the standard assumption of normal errors is nested as a special case, where there
                                       2 = σ 2 . The ﬁrst and second moment of the normal
is just one component with µi = 0 and σi    u
conditional density p(ua |e
                          ¯a ) in this case are seen to solve:

                                              E [ua |e         ¯a
                                                     ¯a ] = γa e                                                 (11)
                                                                      2
                                                     ¯a ] = (1 − γa )σu
                                             var[ua |e                  ,                                        (12)




                                                                8
            2 /(σ 2 + σ 2 /n ) (see e.g. Molina and Rao, 2010).4 For non-normal errors, we have
where γa = σu    u     ε    a
that E [ua |e                                          ¯a , and var[ua |e
            ¯a ] is generally a non-linear function of e                ¯a ] will generally be a function
   ¯a .
of e

2.5       Application to small area estimation of poverty and inequality
For our application let yah measure per capita income (or expenditure) for household h residing
in area a, and let sah denote the number of household members for that same household. In
vector notation, let y(a) and s(a) be vectors with elements yah and sah for all households from
area a. The objective is to determine the level of welfare for area a which can be expressed as a
function of y(a) and s(a) : W (y(a) , s(a) ). The welfare function W is typically non-linear. Popular
examples are the share of individuals whose income falls below a pre-speciﬁed poverty line (also
known as the head-count poverty rate), or the Gini index of income inequality.
    Collecting data on income (or expenditure) yah for any given household is generally found
to be expensive relative to collecting data on demographics, education, employment status, and
housing. This is particularly true for developing countries where much of the income does not
come from wage employment. Consequently, income data is often only available in the form of
so-called income surveys. The sample size of these surveys is suﬃcient to estimate national and
possibly sub-regional welfare, but too small to estimate welfare directly at the level of much
smaller areas (i.e. the target areas a).
    Elbers et al. (2003; henceforward ELL) advocate an approach that combines the income
survey with unit record population census data. The census has data on the independent
variables xah from equation (1), such as demographics, education, employment and housing,
but not the household income variable yah . Crucially, the data on xah are also collected by
the income survey. The idea is to use the income survey to estimate the parameters from
equation (1), and then use the model to predict income for every household in the census. With
these predicted incomes we can subsequently estimate welfare W for each target area a.
    Standard errors can be obtained by means of simulation which is ideally suited for estimating
quantities that are non-linear functions of the random variables at hand, as is the case with
measures of poverty and inequality. Let R denote the number of simulations. The estimator
then takes the form:
                                                        R
                                                   1               (r)
                                             ˆ=
                                             µ                 ˜(a) , s(a) ,
                                                             W y                                                       (13)
                                                   R
                                                       r=1
        (r)                                                                           (r)      ˜(r)
where y
      ˜(a)    denotes the r-th simulated (or predicted) income vector with elements y˜ah = xT
                                                                                            ah β +
 (r)   (r)                                                       ˜(r) and the errors u(r)     (r)
˜a + ε
u    ˜ah .    With each simulation, both the model parameters β                      ˜a and ε˜ are                   ah
drawn from their estimated          distributions.5     In the end this gives R simulated poverty rates. The
    4                                                                                  2    2    2    2                   2
      Note that the unconditional variance solves: var[ua ] = var[γa e¯a ] + (1 − γa )σu = γa (σ u + σε /na ) + (1 − γa )σu ,
                             2             2     2          2    2          2
which equals var[ua ] = γa σu  + (1 − γa )σu = σu  , since σu + σε /na = σu   /γa .
    5
      Our preferred method is to draw β˜(r) by re-estimating the model parameters using the r-th bootstrap version
                                      ˜
of the survey sample. Alternatively, β (r) may be drawn from its estimated asymptotic distribution. The diﬀerence
between these two alternatives is expected to be modest, unless the survey sample is particularly small so that
ﬁnite sample eﬀects may play a role.



                                                              9
point estimates and their corresponding standard errors are obtained by computing respectively
the average and the standard deviation over these simulated values.
                                                        ˜a from the estimated unconditional
        It should be noted that ELL draw the area error u
                                                                           ˜ah too is estimated
distribution, which is estimated non-parametrically. (The distribution for ε
non-parametrically.) The advantage of this approach is that it is fully ﬂexible in that it does
not restrict the shape of the error distributions. A possible shortcoming is that it does not take
                                                                         ˜a from a distribution
full advantage of all the available data. Ideally one would want to draw u
that is conditioned on all relevant data that has been sampled from area a.
        Molina and Rao (2010; henceforward MR) do exactly that, they closely follow ELL but then
draw the area error from the conditional distribution. This is referred to as Empirical Bayes
(EB) estimation. Importantly, MR also diﬀer from ELL in that they restrict the errors to be
normally distributed. As we have seen, under this assumption the conditional distribution of
˜a is normal too and its parameter can be easily determined.6 When errors are non-normal, it
u
                                                          ˜a will take; it will generally be of a
is not obvious what form the conditional distribution for u
diﬀerent form than the unconditional distribution.
        As is noted in section 2.3, getting the error distributions right is not merely a matter of
eﬃciency. When the welfare function W is a non-linear function of the error terms, using wrong
error distributions will also introduce a bias in the welfare estimates. Whether the magnitude
of this bias due to misspeciﬁcation is important in practice is an empirical question.
        The choice between non-normal errors combined with non-EB estimation (which is more
ﬂexible, but does not fully utilize all available data) or normal errors combined with EB estima-
tion (which is less ﬂexible, but fully utilizes the available data) may be determined/motivated
by: (a) the degree of non-normality found in the data, and (b) how much information one
stands to ignore/lose. The latter depends largely on: (i) how many areas have been sampled by
the income survey (as for areas not represented in the survey EB and non-EB estimation are
equivalent), and (ii) the size of the area error relative to the total error.
        The approach developed in this paper aims to combine the best of both worlds; we adopt EB
estimation while permitting non-normal distribution functions. The non-parametric estimator
for the distribution functions used by ELL does not lend itself for EB estimation. We propose
a non-parametric estimator that does.

Remark 3 Note that an added advantage of representing the non-normal error distributions
by normal-mixtures is that E [w(yah )] can be evaluated as the sum of expectations over normal
                                                                                     (i)   (j )        (i)
errors, i.e. E [w(yah )] = E [w(xT
                                 ah β + ua + εah )] =          ij   κij E [w(xT
                                                                              ah β + ua + εah )], where ua and
 (j )                                                            2 and ω 2 . This means that
εah are normal distributed with means µi and νj , and variances σi      j
evaluating E [W (yah )] analytically under normal-mixture errors is no more diﬃcult than under
normal errors.
   6
        This is arguably why errors are generally assumed normal in the literature on EB estimation.




                                                         10
3     Estimation of normal mixtures when errors are nested
3.1   Estimation of unrestricted normal mixtures
                                                        2 ) and (λ , ν , ω 2 ), which determine the
The objective is to estimate the parameters (πi , µi , σi         j j     j
                                      i=mu                   2 ) and G (ε) =     i=mε                   2 ), re-
normal mixtures (NM) Fu (u) =         i=1    πi Fi (u; µi , σi        ε          j =1   λj Gj (ε; νj , ωj
spectively, where mu and mε denote the number of components.
    Estimation of NM to observed data is a routine task, mostly using the EM algorithm, see
e.g. Dempster et al. (1977) and McLachlan and Peel, (2000). Suppose that some data y is
drawn from a NM with m components and parameter vector θ. The algorithm introduces a
latent random variable zij that equals 1 if observation yj has been drawn from component i,
and 0 otheriwse. Let the log-likelihood function that treats both y and z as data be denoted
by L(θ; y, z ), and its derivative with respect to θ by Lθ (θ; y, z ). The corresponding maximum-
likelihood (ML) estimator solves the following moment condition:

                             E [Lθ (θ; y, z )] = E [E [Lθ (θ; y, z )|y ]] = 0.                            (14)

The EM algorithm essentially solves the same moment condition but iteratively:

                                                           ˆ(k) ]] = 0,
                                    E [E [Lθ (θ; y, z )|y, θ                                              (15)

      ˆ(k) denotes the iteration-k estimate for θ. Solving equation (15) yields θ
where θ                                                                               ˆ(k+1) . The
                                                                    ˆk ). The advantage of the EM
procedure results in a non-decreasing sequence of likelihoods L(y ; θ
algorithm is that solving equation (15) is often considerably easier than solving equation (14).
Note that the unobserved data z is integrated out conditional on the observed data y and the
current estimate of θ.
    The nested error structure however poses a challenge that prevents a straightforward ap-
plication of the EM algorithm. Conventionally, mixture distributions are estimated to data
that are observed directly. We wish to estimate the distributions for ua and εah but we do not
observe either of them. In some sense ua and εah are both observed with error (even if the total
error eah is known with certainty), which makes this a convolution problem.
    We propose a modiﬁcation of the EM algorithm that is able to deal with this convolution
problem. In a nutshell, we treat the measurement error as a latent variable and integrate it out
the same way as the latent variable z is integrated out by the EM algorithm. When estimating
                                                  ¯a = ua + ε
Fu (the NM for ua ), the observable data includes e                    ¯a =
                                                            ¯a , where ε                h εah /na   will act as
                                                                    ¯a along with za . When
measurement error. The modiﬁed EM algorithm will then integrate out ε
estimating Gε (the NM for εah ), the observable data includes eah = ua + εah . Now ua takes on
the role of measurement error and will hence be integrated out jointly with za . Note that we
would need some initial estimate of Gε when estimating Fu this way (as we have to integrate
    ¯). Similarly, we need some initial estimate of Fu when estimating Gε (to integrate out u).
out ε
This by itself is very much in the spirit of the EM algorithm which requires an initial estimate
of θ to determine to distribution of z conditional on the observed data.


                                                    11
    Our modiﬁed EM algorithm solves the following moment condition when estimating Fu :

                                              E [Lθu (θu ; e    ¯a , za )|e
                                                           ¯a , ε              ˆ(k) ] = 0,
                                                                          ¯a , θ                              (16)
                                                                                 u
                                          a

which is equivalent to:

                                        E [E [Lθu (θu ; e    ¯a , za )|e
                                                        ¯a , ε         ¯a , ε    ˆ(k) ]|e
                                                                            ¯a , θ      ¯a ] = 0.             (17)
                                                                                   u
                                  a

Similarly, it solves the following moment condition when estimating Gε :

                                                                               ˆ(k) ] = 0,
                                              E [Lθε (θε ; ea , ua , za )|ea , θ                              (18)
                                                                                 ε
                                          a

which is equivalent to:

                                                                                 ˆ(k) ]|ea ] = 0,
                                        E [E [Lθε (θε ; ea , ua , za )|ea , ua , θ                            (19)
                                                                                   ε
                                  a

where ea = (ea1 , . . . , ea,na ).
                                           ¯a conditional on the observed e
    Speciﬁcally, in order to integrate out ε                              ¯a (see equation 17),
                                                                         ¯a |e
we need an initial estimate of the probability distribution function for ε   ¯a . It can be veriﬁed
       εa |e
that p(¯   ¯a ) is again a normal-mixture distribution.

Lemma 4 The probability density function of ε
                                            ¯a conditional on e                         εa |e
                                                              ¯a , which we denote by p(¯   ¯a ),
is a normal-mixture with known parameters:

                                                  i=mu j =mε k=mε
                               εa | e
                             p(¯    ¯a ) =                                     ¯a ; mijk ; s2
                                                                        wijk ϕ ε            ijk ,             (20)
                                                   i=1     j =1   k=1


where:
                                       2 + ω2
                                      ωj                                                  2
                                            k                                        na σ i         νj + νk
             mijk =           2          2             2     ea − µi ) +
                                                            (¯                 2      2       2
                             ωj   +     ωk    +   na σ i                      ωj   + ωk + na σi        na
                                 2 + ω 2 )σ 2
                               (ωj    k i
              s2
               ijk   =        2 + ω2 + n σ2                 ,
                             ωj    k    a i


                 ˜ijk /
and where wijk = w                ijk   ˜ijk with:
                                        w

                                                                    2      2    2
                       ˜ijk = πi λj λk ϕ(¯
                       w                 ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ),                     (21)

                                                                               2 ) and (λ , ν , ω 2 )
where ϕ denotes the normal probability density function, and where (πi , µi , σi         j j     j
denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively.

Similarly, to integrate out ua conditional on ea (see equation 19), we need the probability
distribution for ua |ea . As mentioned earlier however, we use the distribution of ua |e
                                                                                       ¯a , assuming


                                                                  12
                        ¯a ). It follows that ua |e
that p(ua |ea ) ≈ p(ua |e                         ¯a too is normal-mixture distributed.

                                                              ¯a , which we denote by p(ua |e
Lemma 5 The probability density function of ua conditional on e                             ¯a ),
is a normal-mixture with known parameters:

                                           i=mu j =mε k=mε
                          p(ua |e
                                ¯a ) =                             wijk ϕ ua ; mijk ; s2
                                                                                       ijk ,                (22)
                                            i=1       j =1   k=1


where:

                                 2                                                          2 + ω2
                                                                                           ωj
                               σi                                                                k
         mijk =         2 + (ω 2 + ω 2 )/n               ea − (νj + νk )/na ) +
                                                        (¯                             2 + ω2 + n σ2   µi
                       σi     j     k      a                                          ωj    k    a i
                           2 + ω 2 )σ 2
                         (ωj    k i
          s2
           ijk =        2 + ω2 + n σ2             ,
                       ωj    k    a i


                 ˜ijk /
and where wijk = w           ijk   ˜ijk with:
                                   w

                                                                 2      2    2
                    ˜ijk = πi λj λk ϕ(¯
                    w                 ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ),                      (23)

                                                                               2 ) and (λ , ν , ω 2 )
where ϕ denotes the normal probability density function, and where (πi , µi , σi         j j     j
denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively.

                                              ¯a ) instead of p(ua |ea ) we will be solving a modiﬁed
It should be noted that by working with p(ua |e
moment condition, namely:

                                   E [E [Lθε (θε ; ea , ua , za )|e         ˆ(k) ]|e
                                                                  ¯a , ua , θ      ¯a ] = 0.                (24)
                                                                              ε
                            a

This denotes a genuine departure from the original EM algorithm where one conditions on all
the data that features in the log-likelihood function. We will refer to the resulting estimator as a
pseudo-EM estimator. We expect the loss in precision to be minor, while the gain in practicality
is substantial.
   The resulting estimators, in the form of iterative equations, are presented below (see An-
                                                            εa | e
nex 7.3 for a derivation). Given some initial estimate of p(¯    ¯a ), we may implement the
estimator for Fu . Subsequently, given this estimate for Fu , we may implement the estimator
                                                                                        εa |e
for Gε . The newly obtained estimates can in turn be used to update our estimates for p(¯   ¯a )
and p(ua |e
          ¯a ), after which we may obtain a new round of estimates for Fu and Gε . This is con-
tinued until convergence. (In practice, one iteration is found to be suﬃcient to obtain accurate
estimates.)



Estimator for Fu
The ﬁxed-point solution to the following set of iterative equations yields the estimator (ˆ
                                                                                          πi , µ
                                                                                               ˆi , σ
                                                                                                    ˆi2)




                                                             13
                2 ) for i = 1, . . . , m :
for (πi , µi , σi                       u


                        (k+1)                          (k)
                    ˆi
                    π           = E                τ    εa )|e
                                                   ˆai (¯         ˆ(k) (¯
                                                             ¯a ; p     εa | e
                                                                             ¯a ) /A                                                     (25)
                                             a
                                                    (k)
                                       E          ˆai (¯
                                                 aτ          ea
                                                        εa )(¯             ¯a )|e
                                                                          −ε         ˆ(k) (¯
                                                                                ¯a ; p     εa |e
                                                                                               ¯a )
                      (k+1)
                    ˆi
                    µ           =                                                                                                        (26)
                                                            (k)
                                             E            ˆai (¯
                                                         aτ     εa )|e    ˆ(k) (¯
                                                                     ¯a ; p     εa |e
                                                                                    ¯a )
                                                    (k)                                  (k+1) 2
                                       E          ˆai (¯
                                                 aτ     εa )                 ¯a − µ
                                                                        ¯a − ε
                                                                        e         ˆi                |e    ˆ(k) (¯
                                                                                                     ¯a ; p     εa |e
                                                                                                                    ¯a )
                     2(k+1)
                   ˆi
                   σ            =                                                                                           ,            (27)
                                                                          (k)
                                                            E          aτ    εa )|e
                                                                        ˆai (¯         ˆ(k) (¯
                                                                                  ¯a ; p     εa |e
                                                                                                 ¯a )

with:
                                                                (k)                      (k)     2(k)
                                                             π    ¯a − ε
                                                             ˆi ϕ e         ˆi , σ
                                                                       ¯a ; µ    ˆi
                                       (k )
                                     ˆai (¯
                                     τ      εa )   =                                                      ,                              (28)
                                                                  (k)                      (k)     2(k)
                                                              iπ    ¯a − ε
                                                               ˆi ϕ e         ˆi , σ
                                                                         ¯a ; µ    ˆi

where ϕ denotes the normal probability density function, and where the expectations are taken
     ¯ conditional on e
over ε                ¯a using the iteration-k estimate of the conditional density function
                                                                                                                            (k)   (k)   2(k)
ˆ(k) (¯
p     εa |e                       ˆ(k) (¯
          ¯a ). Lemma 4 shows how p     εa | e                                      πi , µ
                                             ¯a ) can be obtained as a function of (ˆ         ˆi
                                                                                         ˆi , σ                                                ).




Estimator for Gε
                                                                                          ˆj , ν
The ﬁxed-point solution to the following set of iterative equations yields the estimator (λ    ˆj , ω
                                                                                                    ˆj2)

                2 ) for j = 1, . . . , m :
for (λj , νj , ωj                       ε



                           ˆ (k+1) = E
                           λ
                                                                (k)
                                                             ˆahj (ua )|e
                                                             τ          ¯a /n                                                            (29)
                             j
                                                       ah
                                                              (k)
                                             E           ah τ
                                                            ˆahj (ua )(eah         − ua )|e
                                                                                          ¯a
                             (k+1)
                           ˆj
                           ν           =                                                                                                 (30)
                                                                          (k )
                                                        E                ˆahj (ua )|e
                                                                      ah τ          ¯a
                                                              (k)                                  (k+1) 2
                                             E              ˆahj (ua )
                                                         ah τ                    eah − ua − ν
                                                                                            ˆj                    |e
                                                                                                                   ¯a
                            2(k+1)
                          ˆj
                          ω            =                                                                                ,                (31)
                                                                                   (k)
                                                                      E           ˆahj (ua )|e
                                                                               ah τ          ¯a

with:
                                                             ˆ (k) ϕ eah − ua ; ν
                                                             λ
                                                                                  (k )
                                                                                ˆj , ω
                                                                                        2(k)
                                                                                       ˆj
                                   (k)                         j
                                 ˆahj (ua )
                                 τ                 =                                                          ,                          (32)
                                                               ˆ (k )            (k)  2(k)
                                                             j λj ϕ eah − ua ; ν
                                                                               ˆj , ωˆj

where ϕ denotes the normal probability density function, and where the expectations are taken
                       ¯a using the probability density function p(ua |e
over ua conditional on e                                               ¯a ). See Lemma 5 for the
probability density function for ua |e
                                     ¯a .




                                                                          14
   In order to get the iterative scheme started, we would of course need suitable initial estimates
          ¯a ) and p(ua |e
      εa |e
for p(¯                  ¯a ). Note that even without any prior knowledge about the distributions
                                                                                           ¯a by
for ua and εah , we should be able to obtain a reasonable estimate of the distribution for ε
appealing to the Central Limit Theorem (CLT). Under the assumption that εah are independend
                                                    ¯a tends to a normal distribution with mean
across households, we have that the distribution of ε
                            2 /n . In a typical household income survey, n will be in the range
zero and variance equal to σε   a                                         a
of 10 to 100, which is suﬃciently large for the CLT to take eﬀect. The corollaries below derive
                                ¯a ) and p(ua |e
                            εa |e
the initial estimates for p(¯                                                                  ¯a
                                               ¯a ) under the assumption of normal distributed ε
(with known variance).

                                                ¯a conditional on e
Corollary 6 The probability density function of ε                 ¯a , which we denote by
  εa | e
p(¯    ¯a ), is a normal-mixture with known parameters:

                                  i=mu                      2 /n                             2 σ 2 /n
                                                           σε   a                          σi   ε    a
                  εa |e
                p(¯   ¯a ) =                   ¯a ;
                                          wi ϕ ε         2        2        ea − µi );
                                                                          (¯               2       2
                                                                                                                ,   (33)
                                  i=1
                                                        σε /na + σi                       σi + σε /na

           ˜i /
where wi = w             ˜i
                        iw    with:
                                                                    2    2
                                             ˜i = πi ϕ(¯
                                             w         e a ; µ i ; σi + σε /na ),                                   (34)
                                                                                  2 denote the
where ϕ denotes the normal probability density function, and where πi , µi , and σi
parameters associated with the normal-mixture distribution Fu .

                                                                  ¯a , which we denote by
Corollary 7 The probability density function of ua conditional on e
p(ua |e
      ¯a ), is a normal-mixture with known parameters:

                                   i=mu
                                                                                2        2            −1
                 p(ua |e
                       ¯a ) =                            ¯a + (1 − γai )µi ; 1/σi
                                           αi ϕ ua ; γai e                        + na /σε                  ,       (35)
                                    i=1

            2 /(σ 2 + σ 2 /n ), and where α = α
with γai = σi                                 ˜i/
                 i     ε    a              i                           ˜i
                                                                      iα    with:

                                                                   2    2
                                             ˜ i = πi ϕ(¯
                                             α          ea ; µi ; σi + σε /na ),                                    (36)

                                                                                  2 denote the
where ϕ denotes the normal probability density function, and where πi , µi , and σi
parameters associated with the normal-mixture distribution Fu .

   The following propositions provide some properties of the estimators.

Proposition 8 At every iteration-k , the mean and variance for the probability density function
  (k)      ˆ (k) ϕ(ε; ν (k )   2(k)
ˆε (ε) =
g          λj    j    ˆ ,ω   ˆj     ) solve:
                                      j


                              (k)           1
                            ˆε
                      E [ε; g     ] =                eah − E [ua |e    ˆ(k) (ua |e
                                                                  ¯a ; p         ¯a )]
                                            n
                                                ah
                              (k )          1
                     var[ε; g
                            ˆε     ] =               E [(eah − ua )2 |e    ˆ(k) (ua |e
                                                                      ¯a ; p         ¯a )] − E 2 [ε; g
                                                                                                     ˆε(k)
                                                                                                           ].
                                            n
                                                ah



                                                                15
Proposition 9 At every iteration-k , the mean and variance for the probability density function
ˆ (k)        (k)    (k)    2(k)
fu (u) =   ˆ ϕ(u; µ
           π i   i ˆ ,σ  ˆ     i) solve:i


                            ˆ(k) ] =           1
                      E [u; fu                                ¯a − E [¯
                                                              e       εa | e    ˆ(k) (¯
                                                                           ¯a ; p     εa | e
                                                                                           ¯a )]
                                               A      a

                            ˆ(k) ] =           1                                                             ˆ(k) ].
                     var[u; fu                                E [(¯    ¯a )2 |e
                                                                  ea − ε           ˆ(k) (¯
                                                                              ¯a ; p         ¯a )] − E 2 [u; f
                                                                                         εa |e                u
                                               A      a


3.2    Some practical parameter restrictions
Let us also consider the case where the component distributions are assumed to be given, so
that only the mixing probability vectors π and λ will have to be estimated. This is obviously
a special case of the more general approach where the probabilities are jointly estimated with
the parameters of the component distributions. Keeping the latter ﬁxed markedly beneﬁts the
numerical convergence, indicating that the approach deserves to be considered a distinct option
in and of itself. This is precisely the approach advocated by Cordy and Thomas (1997).
   The estimators for π and λ are obtained as solutions to the following iterative equations:

                                                                    (k)           2 + σ 2 /n
                                    (k+1)         1               ˜i ϕ e
                                                                  π    ¯a ; µi , σi    ε    a
                                   ˜i
                                   π          =                        (k)
                                                                                                            ,                  (37)
                                                  A                                 2 + σ 2 /n
                                                          a         ˜i ϕ e
                                                                   iπ    ¯a ; µi , σi    ε     a


and:
                                                      ˜ (k)
                                                      λ                                          2 + ω2
                                      1                 j             ˜i ϕ
                                                                     iπ         eah ; µi + νj , σi    j
                          ˜ (k+1)
                          λ         =                                                                                 ,        (38)
                            j
                                      n                   ˜ (k)
                                                          λ                                        2 + ω2
                                             ah       j     j              ˜i ϕ
                                                                          iπ      eah ; µi + νj , σi    j

      ˜i denotes the estimator for πi . Note that Cordy and Thomas (1997) only consider the
where π
estimator for Fu (i.e. the estimator for πi for i = 1, . . . , mu ), as they are not interested in the
distribution function for εah .
   Since the parameters of the component distributions are not being estimated, the modeler
                                                                     2 and ω 2 beforehand. Cordy
will have to set the values for the means µi and νj , and variances σi      j
and Thomas (1997) recommend to set a common variance (as is also done in kernel density
estimators where the common variance is refered to as the bandwidth parameter). A natural
                                                                                       ˆu
                                                                                       σ 2                   σ 2
                                                                                                              ˆε
                                                 2 =
choice is to set this common variance equal to: σi                                                                 2 =
                                                                                             for all i (and hence ωj
                                                                                       mu                    mε ). Note
                                   ¯ 2 we
that for any given common variance σ                                                 ¯ 2 + i π i µ2
                                                                      have: var[u] = σ            i , with
                                                                                                                   2
                                                                                                            i pi µi ≥ 0.
This means that at a minimum the common                               variance must be chosen so that: var[u] − σ ¯ 2 ≥ 0.
                                                    2
                                                   σu                                                     u −1
                               ¯2 =
Obviously this is satisﬁed for σ                   mu ,        which yields:                     2
                                                                                           i πi µi   = ( mm u
                                                                                                                 2.
                                                                                                               )σu
   A convenient choice for the mean parameters is to take equally spaced µi (and νj ) with a
range that respects the values attained by the observed data. Note that the estimated mixing
probabilities are expected to satisfy: E [u] =                                ˆ i µi
                                                                             iπ        =        ¯a /A
                                                                                               ae        (matching ﬁrst moment), as
                             u −1
well as:    ˆ i µ2
           iπ    i    =   ( mm u
                                   σu
                                  )ˆ 2      (matching second moment), although this is not guaranteed by the
estimation approach. This is in contrast with the approach where the component distribution
parameters are also being estimated, in which case the ﬁrst and second moments are guaranteed.



                                                                          16
Remark 10 One possible extension to the set of component distributions proposed above (with
diﬀerent µi but common σ 2 ) can be obtained by adding components with common zero mean
                                      2.
(i.e. µr = 0) but diﬀerent variances σr


4    A small simulation study
This section presents a modest Monte-Carlo simulation experiment. We will focus our attention
to the following two questions: (1) How eﬀective is our approach in ﬁtting non-normal error
distributions (that are not necessarily normal-mixtures)?, and (2) What are the implications
for the estimation of poverty and inequality? Do distinct deviations from normality have the
potential to introduce a meaningful bias in estimates of poverty and inequality if this non-
normality is ignored?
    We make the following assumptions. The simulated “country” consists of 500 domains (or
target areas), which denotes the level at which measures of poverty and inequality will be
estimated. Each domain is home to 3000 households. Household per capita incomes will be
generated by means of the following model:

                                    yah = βxah + ua + εah ,                                 (39)

where yah denotes the log of household income, and where xah represents a single covariate. For
                                                               2 + σ 2 = 0.3, β = 1, E [x ] = 0,
the model parameters we consider the values: var[ua + εah ] = σu    ε                    ah
and where var[xah ] is chosen so that R2 = 0.4 (which represents a goodness-of-ﬁt that is typical
for empirical household income models).
    It will be convenient to introduce a parameter that measures the size of the random area
                                                                             2 /(σ 2 + σ 2 ). We
eﬀect relative to the total error term. Let us denote this parameter by ρ = σu    u     ε
will be considering both the case of small “area eﬀect” (ρ = 0.05) and medium/large “area
eﬀect” (ρ = 0.25).
    The non-normal errors ua and εah will be drawn from a Log-Dagum distribution with prob-
ability density function:
                                              apeap(x−log(b))
                                  f (x) =                      p+1 ,                        (40)
                                            1 + ea(x−log(b))
where p > 0 is the parameter that determines the shape (or skewness) of the distribution.
Smaller values of p will result in larger deviations from normality. The remaining two parameters
                                                                        2 or σ 2 . Note that the
a and b will be ﬁxed by imposing zero mean and setting the variance to σu     ε
Dagum distribution is not an uncommon choice when modeling income data (see e.g. Kleiber
(2007), and the references therein). The covariate xah will be drawn from a normal distribution.
    Finally, our artiﬁcial survey will sample 15 households from each domain.




                                                17
4.1   Estimation of Fu and Gε
The ﬁrst test will be whether we can successfully uncover the probability density functions for
ua and εah . We will keep the shape parameter for Fu ﬁxed at pu = 0.5, but will consider three
diﬀerent values of the shape parameter for Gε : pε = (0.10, 0.25, 0.50). The benchmark density
functions are obtained as a kernel density estimate applied to the actual realizations of the
errors in the census. Our estimators for Fu and Gε are based on normal-mixtures with 3 and 2
component distributions, respectively.
   Figure 1 presents our estimates of the probability density function for ua . The estimates in
the right panel show a remarkably good ﬁt. These correspond to the case where the random
area eﬀect is large, with ρ = 0.25, in which case the contribution of ua matters. The imperfect
ﬁt shown in the left panel suggests that it is harder to estimate Fu when the random area eﬀect
is small (in this case ρ = 0.05). This is arguably due to the poor signal-to-noise ratio in this
case; ua will only make up a small fraction of the total residual which is what is observed.
However, this lack of precision will also have little to no implication for estimates of poverty
and inequality, precisely because of the smallness of ua . In sum, in this example estimates of
Fu are precise when they matter, and less precise when they matter less.
                                                        Probability density function for u
                                                                                          1.5
                         3.5
                         3.0
                         2.5




                                                                                          1.0
          dens.uhat(x)




                                                                           dens.uhat(x)
                         2.0
                         1.5




                                                                                          0.5
                         1.0
                         0.5
                         0.0




                                                                                          0.0




                               −0.6   −0.4   −0.2       0.0   0.2   0.4                         −1.5   −1.0   −0.5       0.0   0.5   1.0

                                                    x                                                                x




Figure 1: Normal-mixture density (Black) versus benchmark density (Red): (a) ρ = 0.05 (left
panel); (b) ρ = 0.25 (right panel)

   Our estimates for the probability density functions for εah are presented in Figure 2. What is
apparent is that, despite using no more than two components, our estimates provide remarkably
accurate ﬁts regardless of the size of the random area eﬀect. When ρ is small, we have that
εah is large relative to ua , and so the signal-to-noise ratio is in our favour when estimating Gε .
It matters less in this case that we do not have a precise estimate for Fu . On the other hand,
when ρ is large and hence εah is of a more modest magnitude relative to ua , our estimation of
Gε may be helped by having a precise estimate of Fu . (Note that we ﬁrst estimate Fu , and then
use this estimate to subsequently estimate Gε .)



                                                                          18
                                                     Probability density function for ε




                                                                                       1.0
                           0.8




                                                                                       0.8
                           0.6
          dens.epshat(x)




                                                                      dens.epshat(x)

                                                                                       0.6
                           0.4




                                                                                       0.4
                           0.2




                                                                                       0.2
                           0.0




                                                                                       0.0
                                 −4   −3   −2   −1     0    1   2                            −3   −2   −1   0   1   2

                                                x                                                       x




Figure 2: Normal-mixture density (Black) versus benchmark density (Red): (a) ρ = 0.05 (left
panel); (b) ρ = 0.25 (right panel)


4.2   Implications for estimates of poverty
Next we investigate whether having more precise estimates of Fu and Gε will also give us more
precise estimates of poverty, and whether any gain in precision is economically meaningful. For
ease of exposition we will focus on the percentage of households with incomes below the poverty
line as the measure of poverty which we wish to estimate. Tables 1 and 2 provide estimates of
the bias for diﬀerent values of the shape parameter for Gε (the shape parameter for Fu is ﬁxed
at pu = 0.5), and for diﬀerent values of the (log) poverty line. The poverty rates associated with
the diﬀerent (log) poverty lines are roughly 15, 20, 30, and 45 percent. It should be noted that
the bias in this case is estimated as the diﬀerence between the estimated and the true poverty
rate averaged over the 500 target areas for one given replication of the “census”.
   At least two observations stand out. First, estimates obtained under the (incorrect) assump-
tion of normal errors can be severly biased, with a bias of 3 to 4 percent on a poverty rate that
ranges between 20 and 30 percent depending on the shape parameter for Gε . Our approach
provides far superior estimates of poverty in these cases with a bias of less than a percent.
Second, the beneﬁt of accommodating non-normal errors as opposed to assuming normal errors
changes with the value of the poverty line. At the far left tail of the (log) income distribution
(i.e. for particularly low values of the poverty line), it will be hard to empirically separate the
two distributions. The diﬀerence will be more pronounced where the distributions have more
“mass”, i.e. for intermediate to higher values of the poverty line.

4.3   Implications for estimates of inequality
Table 5 compares the bias for estimates of income inequality for diﬀerent values of the shape
parameter and diﬀerent values of the area location eﬀect. We focus on the Mean-Log-Deviation
(MLD) as the measure of inequality which we wish to estimate. Note that by deﬁnition we have
that the realizations of the area error ua will drop out of the true measure of MLD for that area,

                                                                    19
     Poverty                   ρ = 0.05                                  ρ = 0.25
  line/shape       0.50          0.25          0.10          0.50          0.25          0.10
        -0.75   0.67 (1.61)   0.64 (1.69)   1.32 (2.06)   1.38 (2.51)   1.40 (2.70)   1.06 (2.55)
        -0.50   0.76 (2.51)   0.82 (3.29)   1.35 (4.01)   1.39 (3.09)   1.55 (3.78)   1.32 (3.92)
        -0.25   0.67 (2.73)   0.92 (4.16)   1.14 (5.15)   1.01 (2.82)   1.32 (3.92)   1.34 (4.37)
            0   0.31 (1.96)   0.73 (3.60)   0.60 (4.51)   0.20 (1.53)   0.61 (2.73)   0.89 (3.32)

Table 1: Bias for EB estimates of head-count poverty: (a) Normal-Mixture distribution (left
number); (b) Normal distribution (right number)

     Poverty                   ρ = 0.05                                   ρ = 0.25
  line/shape       0.50          0.25          0.10          0.50           0.25          0.10
        -0.75   0.19 (0.98)   0.12 (1.07)   0.91 (1.45)   0.23 (0.79)   0.22 (0.97)   -0.08 (0.85)
        -0.50   0.25 (1.95)   0.22 (2.73)   0.82 (3.47)   0.36 (1.63)   0.41 (2.31)    0.16 (2.48)
        -0.25   0.30 (2.40)   0.44 (3.83)   0.62 (4.82)   0.41 (2.03)   0.57 (3.13)    0.53 (3.60)
            0   0.23 (19.7)   0.58 (3.60)   0.35 (4.51)   0.28 (1.66)   0.57 (2.86)    0.79 (3.47)

Table 2: Bias for NON-EB estimates of head-count poverty: (a) Normal-Mixture distribution
(left number); (b) Normal distribution (right number)

     Poverty                   ρ = 0.05                                  ρ = 0.25
  line/shape       0.50          0.25          0.10          0.50          0.25          0.10
        -0.75   0.03 (0.03)   0.03 (0.03)   0.03 (0.03)   0.04 (0.05)   0.04 (0.05)   0.04 (0.05)
        -0.50   0.04 (0.05)   0.04 (0.05)   0.04 (0.06)   0.05 (0.06)   0.05 (0.06)   0.05 (0.06)
        -0.25   0.05 (0.06)   0.05 (0.07)   0.05 (0.07)   0.06 (0.07)   0.06 (0.07)   0.06 (0.08)
            0   0.06 (0.06)   0.06 (0.07)   0.06 (0.07)   0.07 (0.07)   0.07 (0.07)   0.07 (0.08)

Table 3: RMSE for EB estimates of head-count poverty: (a) Normal-Mixture distribution (left
number); (b) Normal distribution (right number)

     Poverty                   ρ = 0.05                                  ρ = 0.25
  line/shape       0.50          0.25          0.10          0.50          0.25          0.10
        -0.75   0.04 (0.04)   0.04 (0.04)   0.04 (0.04)   0.10 (0.10)   0.09 (0.09)   0.09 (0.09)
        -0.50   0.05 (0.06)   0.05 (0.06)   0.05 (0.06)   0.12 (0.12)   0.12 (0.12)   0.12 (0.12)
        -0.25   0.07 (0.07)   0.07 (0.08)   0.07 (0.08)   0.14 (0.15)   0.14 (0.15)   0.14 (0.15)
            0   0.07 (0.08)   0.07 (0.08)   0.08 (0.09)   0.15 (0.15)   0.16 (0.16)   0.16 (0.16)

Table 4: RMSE for NON-EB estimates of head-count poverty: (a) Normal-Mixture distribution
(left number); (b) Normal distribution (right number)


which varies around 20 for our simulated data (we multiplied the numbers by a factor 100).
Note however that the size of the “location eﬀect” may still have a bearing on our estimates of
inequality, to the extent that it impacts on our estimates of the distribution for εah .
   Similarly to what we observed for poverty, the potential gain of accommodating non-normal
errors as opposed to incorrectly assuming normal errors can be quite large. We observe biases
of 2 to 3 with true inequality around 20 when assuming normal errors, compared to biases of
roughly 0.5 when using our approach.



                                                20
                                                      Shape parameter
                                  MLD          0.50         0.25       0.10
                               ρ = 0.05     0.36 (1.67) 0.52 (2.59) 1.12 (3.71)
                               ρ = 0.25     0.37 (1.26) 0.61 (2.18) 0.47 (2.25)

Table 5: Bias for estimates of inequality: (a) Normal-Mixture distribution (left number); (b)
Normal distribution (right number)


5        A small empirical study
For this small empirical application we will treat the 2010 micro census from the United States,
a 1 percent sample, as a “new” country (i.e. as if the data constitutes a full population census).
The data is publically available at IPUMS USA. There are a total of 1.25 million households7
residing in 422 counties, which is the level at which we will be estimating poverty and inequality
(i.e. the “small area level”). The census includes individual income data as well as data on a
wide range of individual characteristics. We use the income data to compute household income
per capita. All other individual level variables are also aggregated at the household level as
this denotes the level at which we will build the income model. We deﬁned the head of the
household as the member of the household with the highest level of individual income. (In many
empirical applications much of the available data is at the household level.) Table 6 provides
some descriptive statistics of the data.

    Variable             Description                                              Mean    Std. Dev.
    lnycap               Log of household income per capita                       9.939     0.998
    metro suburb         In sub-urb of major city (dummy)                         0.290     0.454
    metro none           Away from major city (dummy)                             0.201     0.401
    hh size 1            Household of size one (dummy)                            0.300     0.458
    hh size 2            Household of size two (dummy)                            0.342     0.474
    hh size 3            Household of size three (dummy)                          0.147     0.354
    share child15        Share of children aged younger than 15                   0.108     0.199
    hh hedu2             At least one member with a masters degree (dummy)        0.366     0.482
    hh employed          Number of employed household members                     1.104     0.936
    hd female            Head of household is female (dummy)                      0.417     0.493
    hd logage            Log of age of household head                             3.889     0.371
    hd ledu              Household head has lower education only (dummy)          0.048     0.214
    hd hedu1             Household head has college degree (dummy)                0.226     0.418
    hd hedu2             Household head has a masters degree (dummy)              0.312     0.463
    hd employed          Household head is employed (dummy)                       0.660     0.474
    hd hisp              Household head is Hispanic (dummy)                       0.091     0.288
    hd black             Household head is African-American (dummy)               0.106     0.308

Table 6: Descriptive statistics of variables used in empirical application (derived from 2010
IPUMS USA)

        We draw a balanced sample of 6330 households (15 households per county). The log of
    7
        The exact number of households in the census is 1, 242, 967.



                                                          21
household per capita income will be our dependent variable. Our independent variables include:
urbanity, household size, age composition, gender, ethnicity, education, and employment. There
are a total of 16 regressors (excluding the constant). The regression model is shown in Table
7. We obtained an adjusted R-squared of 0.432, which denotes a typical level for this type of
income (or consumption) regression model. The location eﬀect is estimated at 0.027, which is
not unusual for developed nations but rather small for developing countries where one might
ﬁnd location eﬀects in the range of 0.05 to 0.25.

                                                              lnycap
                              metro suburb              0.171∗∗∗    (8.21)
                              metro none               -0.0542∗    (-1.68)
                              hh size 1                 0.733∗∗∗ (16.32)
                              hh size 2                 0.620∗∗∗ (16.00)
                              hh size 3                 0.344∗∗∗    (9.73)
                              share child15             -0.160 ∗∗  (-2.20)
                              hh hedu2                  0.277∗∗∗    (6.47)
                              hh employed              0.257∗∗∗ (14.24)
                              hd female                -0.212∗∗∗ (-10.80)
                              hd logage                 0.898∗∗∗ (30.46)
                              hd ledu                  -0.331∗∗∗ (-6.40)
                              hd hedu1                  0.175∗∗∗    (7.06)
                              hd hedu2                  0.450∗∗∗    (9.81)
                              hd employed              0.484 ∗∗∗   (15.25)
                              hd hisp                  -0.159∗∗∗ (-4.36)
                              hd black                 -0.248∗∗∗ (-7.78)
                               cons                     5.201∗∗∗ (39.18)
                              N                           6330
                              adj. R2                    0.432
                              t statistics in parentheses
                              ∗               ∗∗               ∗∗∗
                                  p < 0.10,        p < 0.05,         p < 0.01

Table 7: Regression model with log household income per capita as dependent variable (sample
from 2010 IPUMS USA)

   Figure 3 shows the estimates of the probability density function associated with the area
error ua and the household error εah , respectively. The “red” line denotes the normal mixture
density estimate, “blue” line denotes the normal density function that best ﬁts the data. As
we are dealing with empirical data, not simulated data, we do not know the distributions from
which the errors are drawn, and hence are not able to include the “true” density function as
a benchmark. The estimates suggest that the household errors are arguably drawn from a
distribution that deviates visibly from a normal distribution. We do not see a similar deviation
from the normal density for the area error. The smallness of the variance of the area error
relative to the total error however, suggests that the signal to noise ratio is low making it
diﬃcult to identify the underlying density function. The small variance also means that the
area error makes only a minor contribution to the total error and so its density function is less
important. It is clearly more important that our estimate of the density associated with the

                                                       22
                                                     Probability density functions for u and ε




                                                                                                      0.6
                        3.0




                                                                                                      0.5
                        2.5




                                                                                     dens.epshat(x)

                                                                                                      0.4
         dens.uhat(x)

                        2.0




                                                                                                      0.3
                        1.5




                                                                                                      0.2
                        1.0




                                                                                                      0.1
                        0.5
                        0.0




                                                                                                      0.0
                              −0.4   −0.3   −0.2   −0.1   0.0   0.1   0.2    0.3                              −4     −2         0   2

                                                          x                                                                 x




Figure 3: Normal-mixture density (Red) versus normal density (Blue) estimates for: (a) area
error ua (left panel); (b) household error εah (right panel)


household error is accurate.
   The log poverty line is set at 9.5 which yields an aggregate poverty rate of about 28.3
percent. County level estimates of poverty and inequality are obtained using both EB and non-
EB estimates under the assumption of normal errors and under the less restrictive assumption
of normal mixture errors. For each of the speciﬁcations we obtain 422 estimates of poverty and
inequality which we then compare to the true estimates derived from the full census. We will
judge the accuracy of the estimates on the basis of two summary statistics: the aggregate bias
and the RMSE (both obtained by aggregating the county level diﬀerences between our estimates
and the true rates over the 422 counties). The results are presented in Table 8.

                                                                                Bias                           RMSE
                                                      EB                    0.029 (0.048)                   0.045 (0.059)
                                                      NON-EB                0.024 (0.044)                   0.042 (0.056)

Table 8: Bias and RMSE for estimates of county-level poverty using 2010 data from the United
States: (a) Normal-Mixture distribution (left number); (b) Normal distribution (right number)

   We observe a rather large bias when estimating county level poverty under the assumption
of normal distributed errors; almost 5 percentage points given a national poverty rate of about
28 percent. By relaxing the normality assumption, and assuming normal mixture distributions
instead, we are able to reduce this bias to just below 3 percent which denotes a non-trivial
improvement. We see similar reductions in the RMSE. Working with EB or non-EB estimates
makes little to no diﬀerence in this empirical application. This was to be expected given the
smallness of the location eﬀect. Note that the fact that some bias still remains suggests that
a degree of model misspeciﬁcation still persists; these could include mis-speciﬁcations in the
structural model but also forms of heteroskedasticity that are currently being ignored.




                                                                                   23
6     Concluding remarks
We have developed an estimator for the error distribution functions associated with the nested
errors from a linear mixed model. An attractive feature of the estimator is that it accommodates
Empirical Bayes prediction. This is achieved without making any restrictive assumptions about
the shape of the distrbution functions. Monte-Carlo simulations presented in this paper show
that estimates of poverty and inequality can be severely biased when non-normality of the errors
is ignored. The bias can be as high as 2 to 3 percent on a poverty rate of 20 to 30 percent.
Most of this bias is resolved when using our approach.


7     Appendix
7.1    A lemma and corollary that are used in the proofs of Lemmas 4 and 5
Lemma 11 Let v1 and v2 be two independent normal random variates, with Evi = µi , var vi =
 2 for i = 1, 2. Let v = v + v . Denote the normal density function with expectation µ
σi                    3   1   2
and variance σ 2 by ϕ(·; µ, σ ). Then for the joint density p(v1 , v3 ) and the conditional density
p(v1 |v3 ) we have

                p(v1 , v3 ) = ϕ(v1 ; µ1 , σ1 )ϕ(v3 − v1 ; µ2 , σ2 )
                            = p(v1 |v3 )p(v3 )
                                             2
                                            σ1                    2
                                                                 σ2            σ1 σ2
                            = ϕ v1 ;             (
                                          2 + σ2 3v − µ 2 ) +  2 + σ 2 µ1 ,     2 + σ2
                                         σ1    2              σ1    2          σ1    2

                                ×ϕ v3 ; µ1 + µ2 ,         2 + σ2 .
                                                         σ1    2



Proof This can be directly veriﬁed. Alternatively, note that both v3 and v1 |v3 are normally
distributed. The distribution of v1 |v3 has expectation

                                                   2
                                                  σ1                  σ2
                             E (v1 |v3 ) =    2      2 (v3 − µ2 ) + 2 2 2 µ1
                                             σ1   + σ2             σ1 + σ2

and variance
                                                             σ12σ2
                                                                 2
                                         var (v1 |v3 ) =     2 + σ2 .
                                                            σ1     2




Corollary 12 Let x and y be independent random variables distributed as normal mixtures,
and let q = x + y . Then (i) q is also distributed as a normal mixture, and (ii) the conditional
distributions p(x|q ) and p(y |q ) are also normal mixtures. Moreover, the mixing parameters of
p(q ), p(x|q ) and p(y |q ) are closed expressions in terms of the distributional parameters of x and




                                                       24
y . In particular, if the density function of x is
                                                         a
                                            f (x) =            πk ϕ(x; µk , σk ),
                                                        k=1

(with πk > 0 and          k   πk = 1) and the density function of y is

                                                         b
                                             g (y ) =          λj ϕ(y ; νj , τj ),
                                                        j =1


(with λj > 0 and          j   λj = 1) then the conditional distribution of x, given q is

                                         h(x|q ) =           κjk ϕ(x; mjk , sjk ),
                                                        jk


where the mixing probabilities κjk are proportional to

                                      κjk ∝ πk λj ϕ q ; µk + νj ,               2 + τ2
                                                                               σk    j


and
                                                     2                   2
                                                                        τj
                                                    σk
                                    mjk =         2 + τ2 (q − νj ) +  2 + τ 2 µk
                                                 σk    j             σk    j
                                                    2τ 2
                                                   σk j
                                     s2
                                      jk =        2 + τ2
                                                 σk    j


Proof The random variable x can be expressed as
                                                               a
                                                    x=              zk x k ,
                                                             k=1

where the xk are independent (latent) random variables, and also independent of (z1 , ..., za ). xk
is normally distributed according to component distribution k , with density ϕ(xk ; µk , σk ). The
zk s are indicator random variables taking values 0 and 1 with probability πk . One and only one
indicator takes the value 1, the others are 0.8 Likewise y can be expressed as

                                                               b
                                                    x=              wj y j ,
                                                             j =1


where the (wj , yj ) independent of the (zk , xk ). Note that zk = zk                        j   wj and wj = wj   k zk ,   so
that
                              q =x+y =          zk x k +           wj yj =          zk wj (xk + yj ).
                                            k                  j               jk

  8
      In other words, the (z1 , ..., za ) have a multinomial distribution with class probabilities πk and a single draw.



                                                               25
Part (i) of the Corollary now immediately follows; the mixing probabilities for q are P [zk wj =
1] = πk λj and the component distribution for component (k, j ) has expectation µk + νj and
          2 + τ 2.
variance σk    j
   For the conditional distribution p(x|q ) we have

             p(x|q ) ∝
             p(x, q ) =                πk λj ϕ(x; µk , σk )ϕ(q − x; νj , τj )
                                  jk
                                                                                                                     
                                                                2                          2
                                                                                          τj
                                                               σk                                            σk τj
                         =             πk λj ϕ x;        2 + τ2      (q − νj ) +       2 + τ2    µk ,                
                                                         σk    j                       σk    j               2 + τ2
                                                                                                            σk
                                  jk                                                                              j

                              ×ϕ q ; µk + νj ,                  2 + τ2 .
                                                               σk    j


The last equality follows from the Lemma.
   The conditional distribution p(x|q ) is therefore again a normal mixture with mixing proba-
bilities for the (k, j ) component proportional to

                                            πk λj ϕ q ; µk + νj ,                2 + τ2
                                                                                σk    j


and corresponding component distribution
                                                                                                       
                                             2                              2
                                                                           τj
                                            σk                                            σk τj
                         ϕ x;          2         2 (q   − νj ) +      2         2 µk ,
                                                                                                        .
                                       σk   +    τj                   σk   +    τj         2
                                                                                          σk   +    2
                                                                                                   τj

The distribution of p(y |q ) follows in exactly the same fashion.


7.2   Proofs
                                             ¯a conditional on e
Lemma 4: The probability density function of ε                 ¯a , which we denote by
  εa | e
p(¯    ¯a ), is a normal-mixture with known parameters:

                                                 i=mu j =mε k=mε
                               εa | e
                             p(¯    ¯a ) =                                       ¯a ; mijk ; s2
                                                                          wijk ϕ ε            ijk ,                       (41)
                                                  i=1     j =1      k=1


where:
                                    2 + ω2
                                   ωj                                                         2
                                         k                                               na σ i                νj + νk
           mijk =             2         2            2      ea − µi ) +
                                                           (¯                      2      2       2
                             ωj   +    ωk   +   na σ i                            ωj   + ωk + na σi               na
                                2 + ω 2 )σ 2
                              (ωj    k i
            s2
             ijk     =        2 + ω2 + n σ2                ,
                             ωj    k    a i




                                                                    26
                 ˜ijk /
and where wijk = w          ijk   ˜ijk with:
                                  w

                                                                 2      2    2
                    ˜ijk = πi λj λk ϕ(¯
                    w                 ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ),                        (42)

                                                                               2 ) and (λ , ν , ω 2 )
where ϕ denotes the normal probability density function, and where (πi , µi , σi         j j     j
denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively.
Proof The result stated in this lemma follows from a direct application of Corollary 12.

                                                                  ¯a , which we denote by
   Lemma 5: The probability density function of ua conditional on e
p(ua |e
      ¯a ), is a normal-mixture with known parameters:

                                         i=mu j =mε k=mε
                         p(ua |e
                               ¯a ) =                            wijk ϕ ua ; mijk ; s2
                                                                                     ijk ,                    (43)
                                          i=1       j =1   k=1


where:

                                 2                                                        2 + ω2
                                                                                         ωj
                               σi                                                              k
         mijk =         2 + (ω 2 + ω 2 )/n             ea − (νj + νk )/na ) +
                                                      (¯                              2 + ω2 + n σ2      µi
                       σi     j     k      a                                         ωj    k    a i
                           2 + ω 2 )σ 2
                         (ωj    k i
          s2
           ijk =        2 + ω2 + n σ2           ,
                       ωj    k    a i


                 ˜ijk /
and where wijk = w          ijk   ˜ijk with:
                                  w

                                                                 2      2    2
                    ˜ijk = πi λj λk ϕ(¯
                    w                 ea ; µi + (νj + νk )/na ; σi + ( ωj + ωk )/na ),                        (44)

                                                                               2 ) and (λ , ν , ω 2 )
where ϕ denotes the normal probability density function, and where (πi , µi , σi         j j     j
denote the parameters associated with the normal-mixture distributions Fu and Gε , respectively.
Proof The result stated in this lemma follows from a direct application of Corollary 12.

   Proposition 9: At every iteration-k , the mean and variance for the probability density
         ˆ (k)       (k )    (k)  2(k)
function fu (u) =  ˆi ϕ(u; µ
                  iπ       ˆi , σˆi ) solve:

                         ˆ(k) ] =    1
                   E [u; fu                     ¯a − E [¯
                                                e       εa | e    ˆ(k) (¯
                                                             ¯a ; p     εa | e
                                                                             ¯a )]
                                     A    a

                       ˆ(k) ] =      1                                                         ˆ(k) ].
                var[u; fu                       E [(¯    ¯a )2 |e
                                                    ea − ε           ˆ(k) (¯
                                                                ¯a ; p         ¯a )] − E 2 [u; f
                                                                           εa |e                u
                                     A    a

Proof Let us ﬁrst evaluate the ﬁrst moment:

                                                                  (k) (k)
                                              Ek [u] =           ˆi µ
                                                                 π  ˆi ,                                      (45)
                                                            i




                                                           27
                                                                                 ˆ (k)
where Ek [·] takes expectations using the iteration-k estimate of the pdf for u: fu (u) =
        (k )    (k)  2(k)                                                                                      (k)                  (k)
      ˆi ϕ(u; µ
     iπ       ˆi , σˆi ).                                                ˆi
                                        Substituting the expressions for π                                                   ˆi , we obtain:
                                                                                                                         and µ

                                                                                                                          (k )
                             (k) (k)                              1                     (k)               Ek [        aτ ˆai (¯ ea − ε¯)|e¯a ]
                           π
                           ˆi µ ˆi          =                       Ek [              ˆai |e
                                                                                      τ     ¯a ]                                 (k)
                                                                                                                                                                         (46)
                                                                  A              a                                   Ek [ a τ  ˆai |e¯a ]
                       i                            i
                                                    1                                 (k)
                                            =                            Ek [ˆ         ¯)|e
                                                                                  ea − ε
                                                                             τai (¯       ¯a ]                                                                           (47)
                                                    A              a
                                                            i
                                                    1                                 (k)
                                     E =                          Ek [               τ         ¯)|e
                                                                                          ea − ε
                                                                                     ˆai (¯       ¯a ].                                                                  (48)
                                                    A       a                i

                                                   (k)
By deﬁnition we have that                        ˆai
                                                iτ              = 1, so that the latter simpliﬁes to:

                                                                             1
                                                        Ek [u] =                                     ε| e
                                                                                            ¯a − Ek [¯
                                                                                            e           ¯a ].                                                            (49)
                                                                             A        a

      Next is the expression for the variance of u:

                                                                             (k )          2(k)         2(k)           2
                                          vark [u] =                    π
                                                                        ˆi             σ
                                                                                       ˆi         +µ
                                                                                                   ˆi               − Ek [u].                                            (50)
                                                                   i

                                                            (k)        (k)                  2(k)
                                 ˆi , µ
Substituting the expressions for π    ˆi                                             ˆi
                                                                                 and σ             yields:

                                                                                                                        (k)                       (k )
        (k)     2(k)         2(k)                           1                       (k)                 Ek [         aτ
                                                                                                                      ˆai (¯ea            −ε  ˆi )2 |e
                                                                                                                                           ¯− µ      ¯a ]               2(k)
      π
      ˆi      σ
              ˆi       +   ˆi
                           µ            =                     Ek [                ˆai |e
                                                                                  τ     ¯a ]                                            (k)
                                                                                                                                                                  ˆi
                                                                                                                                                                 +µ
                                                            A                                                                Ek [     ˆai |e
 i                                              i                        a                                                           aτ     ¯a ]
                                                1                                     (k)                          (k)                             (k)           2(k)
                                        =                         Ek [               τ    ea − ε
                                                                                     ˆai (¯       ˆi )2 |e
                                                                                               ¯− µ      ¯a ] + Ek [                              ˆai |e
                                                                                                                                                  τ        µi
                                                                                                                                                       ¯a ]ˆ
                                                A                            a                                                                a
                                                        i

                                                1                                    (k)                                       (k) 2(k)                   (k )                 (k)
                                        =                    Ek                  ˆai (¯
                                                                                 τ         ¯)2 + 2
                                                                                      ea − ε                                  τ
                                                                                                                              ˆai µ
                                                                                                                                  ˆi         −2          τ    ea − ε
                                                                                                                                                         ˆai (¯      µi |e
                                                                                                                                                                   ¯)ˆ   ¯a .
                                                A                        a                                               a                         a
                                                        i

                                        (k )                                                  (k )                                             (k)       (k)
                        ˆai we know that Ek [
From the expression for τ                                                                  aτ
                                                                                            ˆai (¯ ea − ε
                                                                                                        ¯)|e
                                                                                                           ¯a ]               = Ek [         ˆai |e
                                                                                                                                            aτ         µi .
                                                                                                                                                   ¯a ]ˆ          Plugging
this into the last obtained equation, the last two terms are seen to cancel out, and we obtain:

                                 (k )       2(k)            2(k)                      1                              (k)
                                π
                                ˆi       σ
                                         ˆi         +µ
                                                     ˆi                  =                         Ek               τ
                                                                                                                    ˆai (¯    ¯)2 |e
                                                                                                                         ea − ε    ¯a                                    (51)
                                                                                      A                   a
                            i                                                                i

                                                                                      1                                  (k)
                                                                         =                         Ek (              τ          ¯)2 |e
                                                                                                                           ea − ε
                                                                                                                     ˆai )(¯         ¯a                                  (52)
                                                                                      A      a                 i
                                                                                      1
                                                                         =                                  ¯)2 |e
                                                                                                       ea − ε
                                                                                                   Ek (¯         ¯a .                                                    (53)
                                                                                      A      a

This completes the proof.




                                                                                       28
7.3     A derivation of the EM-type estimator for Fu
Below we will derive the EM-type estimator for the normal-mixture distribution Fu . The
estimator for Gε is obtained in a very similar fashion.
   Let Ψ denote the vector that contains the parameters of this normal-mixture, which includes
                                                              2 of the component distributions.
the mixing probabilities πi as well as the parameters µi and σi
Assume for the moment that ua denotes observed data. The EM algorithm would then proceed
as follows. It introduces the random variable zai that equals 1 if ua comes from component i
and 0 otherwise, and deﬁnes L(Ψ; u, z ) as the log likelihood function that would apply if both
u and z were observed. It then integrates out z by taking expectations over z conditional on
                                                                          ˆ (k) . The solution
an estimate of Ψ that is available at iteration k , which we shall denote Ψ
                                                                  ˆ (k) ) = E [L(Ψ; u, z )|Ψ
ˆ (k+1) that maximizes the resulting log likelihood function Q(Ψ; Ψ
Ψ                                                                                          ˆ (k) ] can
be obtained analytically, unlike the solution to the original log likelihood function. The EM
estimator for Ψ in this case is obtained as the ﬁxed point of this iterative equation.
                                                                                   ¯a − ε
   Unfortunately, we do not observe ua . According to the nested error model, ua = e    ¯a . We
        ¯a , but not ε
observe e            ¯a . The EM algortihm however can still be applied. Instead of integrating
out z , we will be integrating out both z and ε¯a . With slight abuse of notation, the resulting log
                                            ˆ
likelihood function is now given by: Q(Ψ; Ψ k) ) = E [L(Ψ; e
                                              (                         ˆ (k) ]. Note that this function
                                                                 ¯, z )|Ψ
                                                              ¯, ε
     ˆ (k) ) will be diﬀerent from the function that is obtained when u is treated as observed
Q(Ψ; Ψ
data and only z is integrated out. (But it is obtained respecting the principles of the EM
algorithm.)
                                                                   ¯ until after we have evaluated
   It will be convenient to wait with taking the expectations over ε
                                ˆ ) with respect to Ψ. It follows that:
the partial derivatives of Q(Ψ; Ψ ( k )


                                                          ˆ (k) , ε
                                        ˆ (k) ) = E [Q(Ψ; Ψ
                                   Q(Ψ; Ψ                         ¯)|e
                                                                     ¯],                           (54)

                                  ¯ conditional on e
where expectations are taken over ε                ¯, and where:

                          ˆ (k) , ε                  (k)                             2
                     Q(Ψ; Ψ       ¯) =            τ
                                                  ˆai (¯             ea − ε
                                                       εa ) log πi ϕ(¯    ¯a ; µi , σi ) ,         (55)
                                          i   a

with:
                                               (k)                      (k)     2(k)
                                              π    ¯a − ε
                                              ˆi ϕ e         ˆi , σ
                                                        ¯a ; µ    ˆi
                               (k)
                             ˆai (¯
                             τ     εa )   =                                              .         (56)
                                                 (k)                      (k)     2(k)
                                              iπ
                                               ˆi ϕ         ¯a − ε
                                                            e         ˆi , σ
                                                                 ¯a ; µ    ˆi

                                          ˆ (k) ) with respect to Ψ = (π, µ, σ 2 ) and bringing this
Evaluating the partial derivative of Q(Ψ; Ψ
inside the expectation operator yields:

        ∂Q                  ˆ (k) , ε
                      ∂Q(Ψ; Ψ       ¯)                          (k )    1
              = E                      |e
                                        ¯ =E                τ
                                                            ˆai (¯
                                                                 εa )      |e
                                                                            ¯a
        ∂πi               ∂πi                          a
                                                                        πi

        ∂Q                  ˆ (k) , ε
                      ∂Q(Ψ; Ψ       ¯)                          (k )    1
              = E                      |e
                                        ¯ =E                τ
                                                            ˆai (¯
                                                                 εa )       ea − ε
                                                                         2 (¯    ¯a − µi )|e
                                                                                           ¯a
        ∂µi               ∂µi                          a
                                                                        σi



                                                           29
             ∂Q                       ˆ (k) , ε
                                ∂Q(Ψ; Ψ       ¯)                         (k )        1             ¯a − µi )2
                                                                                              ea − ε
                                                                                             (¯
               2   = E                 2         |e
                                                  ¯ =E                  ˆai (¯
                                                                        τ    εa )      2             2        − 1 |e
                                                                                                                   ¯a .
             ∂σi                    ∂σi                             a
                                                                                    2σi            σi

                                (k+1)         2(k+1)
                    ˆi
The expressions for µ                        ˆi
                                         and σ         presented in Estimation Method 1-A are readily obtained
after setting the partial derivatives with respect to µi and σi   2 equal to zero and solving for µ
                                                                                                    i
and σ 2 . Finding the value for πi that maximizes Q(Ψ; Ψˆ (k) ) is a Lagrange optimization problem
        i
that incorporates the constraint that                      i πi   = 1. The corresponding FOC solves:

                                                              (k)        1
                                                  E         τ
                                                            ˆai (¯
                                                                 εa )       |e
                                                                             ¯a = λ,                                      (57)
                                                       a
                                                                         πi

where λ denotes the Lagrange multiplier. Rearranging terms and summing both sides over i
yields:
                                                              (k)
                                             E              τ    εa )|e
                                                            ˆai (¯    ¯a =                 λπi .                          (58)
                                                  a    i                             i
                 (k)
Since          ˆai (¯
              iτ     εa )   =     i πi   = 1 by deﬁnition, we are left with:

                                                           λ=           1 = A.                                            (59)
                                                                    a

By substituting this into eq. (57), and solving this for πi , we obtain the desired expression for
 (k+1)
ˆi
π        .


References
Araujo, M., Ferreira, F., Lanjouw, P. and Ozler, B. (2008). Local inequality and project choice:
     Theory and evidence from ecuador. Journal of Public Economics, 92, 1022–1046.

Cordy, C. and Thomas, D. (1997). Deconvolution of a distribution function. Journal of the
     American Statistical Association, 92, 1459–1465.

Demombynes, G. and Ozler, B. (2005). Crime and local inequality in south africa. Journal of
     Development Economics, 76, 265–292.

Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood estimation from incomplete
     data via the em algorithm. Journal of the Royal Statistical Society, Serie B, 39, 1–38.

Elbers, C., Fujii, T., Lanjouw, P., Ozler, B. and Yin, W. (2007). Poverty alleviation through ge-
     ographic targeting: How much does disaggregation help? Journal of Development Economics,
     83, 198–213.

Elbers, C., Lanjouw, J. and Lanjouw, P. (2003). Micro-level estimation of poverty and inequality.
     Econometrica, 71, 355–364.




                                                                    30
Fujii, T. (2010). Micro-level estimation of child undernutrition indicators in cambodia. The
  World Bank Economic Review, 24, 520–553.

Henderson, C. (1953). Estimation of variance and covariance components. Biometrics, 9,
  226–253.

Kleiber, C. (2007). A guide to the dagum distributions. WWZ Working Paper, 23/07.

McCulloch, C. and Neuhaus, J. (2011). Misspecifying the shape of a random eﬀects distribution:
  Why getting it wrong may not matter. Statistical Science, 26, 388–402.

McLachlan, G. and Peel, D. (2000). Finite Mixture Model. New York: John Wiley & Sons.

Molina, I. and Rao, J. (2010). Small area estimation of poverty indicators. Canadian Journal
  of Statistics, 38, 369–385.

Searle, S., Casella, G. and McCulloch, C. (1992). Variance components. New York: Wiley.

Skrondal, A. and Rabe-Hesketh, S. (2009). Prediction in multilevel generalized linear models.
  Journal of the Royal Statistical Society, 172, 659–687.

Verbeke, G. and Lesaﬀre, E. (1996). A linear mixed-eﬀects model with heterogeneity in the
  random eﬀects population. Journal of the American Statistical Association, 91, 217–221.

Wang, S. and Yin, S. (2002). A new estimate of the parameters in linear mixed models. Science
  in China Series A: Mathematics, 45, 1301–1311.

Westfall, P. (1987). A comparison of variance component estimates for arbitrary underlying
  distributions. Journal of the American Statistical Association, 82, 866–874.

Wu, M., Yu, K. and Liu, A. (2009). Estimation of variance components in the mixed eﬀects
  models: A comparison between analysis of variance and spectral decomposition. Journal of
  Statistical Planning and Inference, 139, 3962–3973.

Zhou, T. and He, X. (2008). Three-step estimation in linear mixed models with skew-t distri-
  butions. Journal of Statistical Planning and Inference, 138, 1542–1555.




                                              31