RICHARD W. MADSEN
The following exercise can illustrate the problem of bias in estimators to students in statistics courses. In some advanced courses an alternative estimator may be presented and properties of this estimator may be investigated via Monte Carlo studies.
Assume that an individual (perhaps a demographic or social scientist)
would like to know the mean number of children per family for families
living in a certain area. How can we use the students in a statistics class
to obtain an estimate of this mean? Recognizing that the class members
do not constitute a random sample from the population of interest, ask
members of the class to suggest a method of obtaining an estimate. One
solution likely to be proposed (or which the instructor might suggest for
consideration) is to ask each student to tell the total number of siblings
in his family (counting themselves) and then to find the average response
as a point estimate of the mean. For example in a class of 23 students
I observed the following:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
It is less obvious that the average obtained as an estimate is biased because larger families tend to be over-represented by the method used for obtaining the sample. This would be true even if interest were centred on family size conditioned on at least one child in the family. The problem hinges on the fact that children rather than parents are being interviewed. The effects of this can best be seen by an example.
Consider a small town inhabited by 25 families. Assume that the number
of children per family is as follows:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The true value of the mean number of children per family is 72/75 = 2.880. Also the true mean number for families having 1 or more children is 72/23 = 3.130. Hence if Y = number of children in a randomly chosen family, E(Y) = 2.880 and E(Y|Y > 0) = 3.130
Assume further that all children in this town are currently of elementary
school age and that a simple random sample of n students is to be chosen
(with replacement) from the children in the school. Just as in the class
experiment, each child will be asked the total number of children in his/her
family. If X denotes the response of a randomly chosen child, then the
probability distribution of X will be given by
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Since E(X) = E(
) = 292/72 =
4.056, it is obvious that the sample mean will be a positively biased estimator
for either E(Y) or E(Y| Y > 0).
The point to be made to our students is that if a sampling procedure systematically ignores a portion of the target population (e.g. families having no children) or if certain portions of the population are sampled more heavily than others (e.g. families having large numbers of children rather than small numbers of children), then the resulting estimator, though intuitively appealing, can be severely biased.
In a more advanced course, one might investigate alternative methods
of estimating E(Y| Y > 0). This can be done by finding a suitable function
of the X’s other than
.
We do this by considering a more general example. Consider a town consisting
of N families each having one or more children. (By doing this we
avoid the necessity of conditioning.) Let k0 denote the
maximum family size. Let pk= proportion of families having
k
children. If one family is selected at random and if Y denotes
the number of children in that family, then the expected number of children/family
is
Let K = total number of children in town
Now assume one child is selected from among the K and that the child is asked the total number of children in his/her family. Let X = the number given by child. Then
(1) P(X=k) = (No of children having k in family)/(Total no. of
children) =
Consequently
From this expression we can easily find the bias when using
as an estimator for E(Y). We have
E(
) = E(X) =
so the bias is V(Y)/E(Y), no matter what sample size is used.
As a specific example of the possible magnitude of the bias when using
actual data, consider the following. This information was taken from the
Canada
Year Book 1978—79 produced by Statistics Canada. It is adapted
from Table 4.34. (The adaptation was necessary since the table grouped
together mothers having 10 or more children.) Here, we let Y be the number
of children borne by a randomly chosen Canadian mother who gave birth in
the last year. The probability distribution of Yis approximately as follows:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From this we determine that E(Y) = 1.929 and V(Y) = 1.370, so the bias in using X
(average number of children as reported by children themselves) to estimate E(Y) is
V(Y)/E(Y) = .710. That is E(X) = 2.639 while the true value is E(Y) = 1.929.
As a second example, we consider the following data based on the number
of children born to women in the United States 15 years old and over who
have ever been married. The information is adapted from 1970 Census
of Population, Subject Reports, Women by Number of Children Ever Born,
Table
2, Bureau of the Census. Since the table grouped together mothers bearing
12 or more children, a slight adaptation has been necessary. In this case
we let Y’ denote the number of children born to a woman randomly chosen
from the set just described and let Y denote the number of children born
to such women conditioned on Y' > 0. The distributions are as follows:
| y' | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| P(Y' = y') | .163 | .181 | .243 | .173 | .104 | .057 | .032 |
| y' | 7 | 8 | 9 | 10 | 11 | 12 | |
| P(Y' = y') | .019 | .012 | .007 | .005 | .003 | .001 | |
| y | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| P(Y = y) | .217 | .290 | .207 | .124 | .068 | .038 | |
| y | 7 | 8 | 9 | 10 | 11 | 12 | |
| P(Y = y) | .022 | .014 | .009 | .006 | .003 | .002 |
Here we have E( Y’) = 2.466 (the expected number of children
born to these women) while E(Y) 2.946 (the expected number given at least
one child). If X is used, the additional bias is V(Y)/E(Y) = 3.567/2.946
= 1.211, so E(
) = 4.157.
This last value of course differs markedly from both E(Y’) and E(Y).
University of Missouri, Columbia.
Back to
contents of The Best of Teaching Statistics
Back to main Teaching Statistics
page