Making Students Aware of Bias

RICHARD W. MADSEN

The following exercise can illustrate the problem of bias in estimators to students in statistics courses. In some advanced courses an alternative estimator may be presented and properties of this estimator may be investigated via Monte Carlo studies.

Assume that an individual (perhaps a demographic or social scientist) would like to know the mean number of children per family for families living in a certain area. How can we use the students in a statistics class to obtain an estimate of this mean? Recognizing that the class members do not constitute a random sample from the population of interest, ask members of the class to suggest a method of obtaining an estimate. One solution likely to be proposed (or which the instructor might suggest for consideration) is to ask each student to tell the total number of siblings in his family (counting themselves) and then to find the average response as a point estimate of the mean. For example in a class of 23 students I observed the following:
 
No. of children in family:
1
2
3
4
5
6
7
frequency:
0
6
7
4
3
2
1

Based on this data the mean number of children/family would be estimated to be
total No. children/total No. families = 83/23 = 3.609
Having reached this stage, students in the class could be asked if there are any deficiencies with this method of estimation. The one obvious answer is that families with no children are systematically excluded from the sample so that the estimates will tend to be positively biased.

It is less obvious that the average obtained as an estimate is biased because larger families tend to be over-represented by the method used for obtaining the sample. This would be true even if interest were centred on family size conditioned on at least one child in the family. The problem hinges on the fact that children rather than parents are being interviewed. The effects of this can best be seen by an example.

Consider a small town inhabited by 25 families. Assume that the number of children per family is as follows:
 
No. of children in family:
1
2
3
4
5
6
7
frequency:
4
6
5
3
2
2
1

 

The true value of the mean number of children per family is 72/75 = 2.880. Also the true mean number for families having 1 or more children is 72/23 = 3.130. Hence if Y = number of children in a randomly chosen family, E(Y) = 2.880 and E(Y|Y > 0) = 3.130

Assume further that all children in this town are currently of elementary school age and that a simple random sample of n students is to be chosen (with replacement) from the children in the school. Just as in the class experiment, each child will be asked the total number of children in his/her family. If X denotes the response of a randomly chosen child, then the probability distribution of X will be given by
 
x
1
2
3
4
5
6
7
P[X=x]:
4/72
12/72
15/72
12/72
10/72
12/72
7/72

Since E(X) = E() = 292/72 = 4.056, it is obvious that the sample mean will be a positively biased estimator for either E(Y) or E(Y| Y > 0).

The point to be made to our students is that if a sampling procedure systematically ignores a portion of the target population (e.g. families having no children) or if certain portions of the population are sampled more heavily than others (e.g. families having large numbers of children rather than small numbers of children), then the resulting estimator, though intuitively appealing, can be severely biased.

In a more advanced course, one might investigate alternative methods of estimating E(Y| Y > 0). This can be done by finding a suitable function of the X’s other than . We do this by considering a more general example. Consider a town consisting of N families each having one or more children. (By doing this we avoid the necessity of conditioning.) Let k0 denote the maximum family size. Let pk= proportion of families having k children. If one family is selected at random and if Y denotes the number of children in that family, then the expected number of children/family is

Let K = total number of children in town 

Now assume one child is selected from among the K and that the child is asked the total number of children in his/her family. Let X = the number given by child. Then

(1) P(X=k) = (No of children having k in family)/(Total no. of children) = 

Consequently

From this expression we can easily find the bias when using  as an estimator for E(Y). We have

E() = E(X) = 

so the bias is V(Y)/E(Y), no matter what sample size is used.

As a specific example of the possible magnitude of the bias when using actual data, consider the following. This information was taken from the Canada Year Book 1978—79 produced by Statistics Canada. It is adapted from Table 4.34. (The adaptation was necessary since the table grouped together mothers having 10 or more children.) Here, we let Y be the number of children borne by a randomly chosen Canadian mother who gave birth in the last year. The probability distribution of Yis approximately as follows:
 
y
1
2
3
4
5
6
7
8
9
10
P[Y=y]
.442
.337
.139
.048
.017
.008
.004
.003
.001
.001

 

From this we determine that E(Y) = 1.929 and V(Y) = 1.370, so the bias in using X

(average number of children as reported by children themselves) to estimate E(Y) is

V(Y)/E(Y) = .710. That is E(X) = 2.639 while the true value is E(Y) = 1.929.

As a second example, we consider the following data based on the number of children born to women in the United States 15 years old and over who have ever been married. The information is adapted from 1970 Census of Population, Subject Reports, Women by Number of Children Ever Born, Table 2, Bureau of the Census. Since the table grouped together mothers bearing 12 or more children, a slight adaptation has been necessary. In this case we let Y’ denote the number of children born to a woman randomly chosen from the set just described and let Y denote the number of children born to such women conditioned on Y' > 0. The distributions are as follows:
 
y' 0 1 2 3 4 5 6
P(Y' = y') .163 .181 .243 .173 .104 .057 .032
y'   7 8 9 10 11 12
P(Y' = y')   .019 .012 .007 .005 .003 .001
               
y 0 1 2 3 4 5 6
P(Y = y)   .217 .290 .207 .124 .068 .038
y   7 8 9 10 11 12
P(Y = y)   .022 .014 .009 .006 .003 .002

 

Here we have E( Y’) = 2.466 (the expected number of children born to these women) while E(Y) 2.946 (the expected number given at least one child). If X is used, the additional bias is V(Y)/E(Y) = 3.567/2.946 = 1.211, so E() = 4.157. This last value of course differs markedly from both E(Y’) and E(Y).

University of Missouri, Columbia.

Back to top

Back to contents of The Best of Teaching Statistics
Back to main Teaching Statistics page