Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Simple Random Sampling and Confidence Intervals in Statistics, Lab Reports of Statistics

The concept of simple random sampling in statistics, where each subset of units from a population has an equal chance of being selected. It also discusses the use of histograms to understand the sampling distribution and the calculation of confidence intervals for population means and totals. A rule of thumb for determining when the normal approximation is reasonable and discusses the limitations of simple random sampling, leading to the introduction of stratified random sampling.

Typology: Lab Reports

Pre 2010

Uploaded on 08/18/2009

koofers-user-kpf
koofers-user-kpf 🇺🇸

5

(1)

10 documents

1 / 31

Toggle sidebar

Related documents


Partial preview of the text

Download Simple Random Sampling and Confidence Intervals in Statistics and more Lab Reports Statistics in PDF only on Docsity! Sampling 1 Last updated April 8, 2008 Chapter 2: Environmental Sampling This chapter discusses means of obtaining data for environmental studies. Either the data will come from a planned experiment in the lab or from sampling done in the field. This chapter discusses several methodologies for obtaining data in a scientifically valid way via sampling. One of the key points to understand is that a valid sampling plan is needed in order to obtain useful data. If the scientist simply goes out into the field and picks sites to sample with no plan ahead of time, then biases and other problems can lead to poor or worthless data. Example: Estimate the number of trees in a forest with a particular disease. How can we do this? One idea is to divide the forest into plots of size 1 acre say and then obtain a random sample of these acres. Count the number of diseased trees in each sampled acre. From this sample, we can use statistical principals to estimate the number of trees in the forest with the disease. Some of the most well-known sampling designs used in practice and discussed here are as follows: • Simple Random Sampling • Stratified Random Sampling • Systematic Sampling • Double Sampling • Multistage Sampling 2.1 Introduction First, we introduce some terminology and basic ideas. Census: This occurs when one samples the entire population of interest. The United States government tries to do this every 10 years. However, in practical problems, a true census is almost never possible. In most practical problems, instead of obtaining a census, a sample is obtained by observing the population of interest, hopefully without disturbing the population. The sample will generally be a very tiny fraction of the whole population. Sampling 2 One must of course determine the population of interest – this is not always an easy problem. Also, the variable(s) of interest need to be decided upon. Element: an object on which a measurement is taken. Sampling Units: non-overlapping (usually) collections of elements from the popu- lation. In some situations, it is easy to determine the sampling units (households, hospitals, etc.) and in others there may not be well-defined sampling units (acre plots in a forest for example). Example. Suppose we want to determine the concentration of a chemical in the soil at a site of interest. One way to do this is to subdivide the region into a grid. The sampling units then consist of the points making up the grid. The obvious question then becomes – how to determine grid size. One can think of the actual chemical concentration in the soil at the site varying over continuous spatial coordinates. Any grid that is used will provide a discrete approximation to the true soil contamination. Therefore, the finer the grid, the better the approximation to the truth. Frame: A list of the sampling units. Sample: A collection of sampling units from the frame. Notation: N Number of Units in the Population n Sample size (number of units sampled) y Variable of interest. Two Types of Errors. • Sampling Errors – these result from the fact that we generally do not sample the entire population. For example, the sample mean will not equal the population mean. This statistical error is fine and expected. Statistical theory can be used to ascertain the degree of this error by way of standard error estimates. • Non-Sampling Errors – this is a catchall phrase that corresponds to all errors other than sampling errors such as non-response and clerical errors. Sampling errors cannot be avoided (unless a census is taken). However, every effort should be made to avoid non-sampling errors by properly training those who do the sampling and carefully entering the data into a database etc. 2.2 Simple Random Sampling (SRS) One of the simplest sampling designs available is the simple random sample. Sampling 5 Sample Size Requirements. When using a confidence interval to estimate µ or Ty, the total, we may require that our estimate lies within d units from the true population parameter. How large a sample size is required so that the half-width of the confidence interval is d? The following two formulas give the (approximate) sample size required for the population mean and total: For the mean µ: n ≥ Nσ 2z2α/2 σ2z2α/2 +Nd 2 , and For the total Ty: n ≥ N2σ2z2α/2 Nσ2z2α/2 + d 2 , where zα/2 is the standard normal critical value (for instance, if α = 0.05, the z0.025 = 1.96). These two formulas are easily derived algebraically solving for n in the confidence interval formulas. Note that these formulas require that we plug a value in for σ2 which is unknown in practice. To overcome this problem, one can use an estimate of σ2 from a previous study or a pilot study. Alternatively, one can use a reasonable range of values for the variable of interest to get an estimate of σ2: σ ≈ Range/6. Example. Suppose a study is done to estimate the number of ash trees in a state forest consisting of N = 3000 acres. A sample of n = 100 one-acre plots are selected at random and the number of ash trees per selected acre are counted. Suppose the average number of trees per acre was found to be ȳ = 5.6 with standard deviation s = 3.2. Find a 95% confidence interval for the total number of ash trees in the state forest. The estimated total l is ty = Nȳ = 3000(5.6) = 16800 ash trees in the forest. The 95% confidence interval is 16800± 1.96(3.2/ √ 100) √ 3000(3000− 100) = 16800± 1849.97. A Note of Caution. The confidence interval formulas given above for the mean and total will be approximately valid if the sampling distribution of the sample mean and total are approximately normal. However, the approximate normality may not hold if the sample size is too small and/or if the distribution of the variable is strongly skewed. To illustrate the problem, consider the following illustration. Suppose a survey is to be conducted to estimate the total number of students in Ohio public schools suffering from asthma. Let us take each county as a sampling unit. Then N = 88 for the eighty eight counties in Ohio. For the sake of illustration, suppose we know the number of students in each county suffering from asthma and that the data is given in the following table: 1 Adams 359 2 Allen 1296 3 Ashlan 520 Sampling 6 4 Ashtab 1274 5 Athens 580 6 Auglaize 558 7 Belmont 638 8 Brown 679 9 Butler 3980 10 Carrol 249 11 Champaign 549 12 Clark 1748 13 Clermo 2083 14 Clinton 586 15 Columb 1221 16 Coshocton 415 17 Crawford 522 18 Cuyahoga 14570 19 Darke 637 20 Defian 447 21 Delaware 1448 22 Erie 1012 23 Fairfield 1710 24 Fayett 373 25 Frankl 13440 26 Fulton 658 27 Gallia 389 28 Geauga 941 29 Greene 1550 30 Guerns 464 31 Hamilton 8250 32 Hancock 888 33 Hardin 448 34 Harris 209 35 Henry 346 36 Highland 601 37 Hockin 264 38 Holmes 380 39 Huron 867 40 Jackson 383 41 Jefferson 778 42 Knox 613 43 Lake 2499 44 Lawren 822 45 Lickin 1979 46 Logan 558 47 Lorain 3618 48 Lucas 4632 49 Madison 517 50 Mahoni 2608 51 Marion 824 52 Medina 2250 53 Meigs 264 54 Mercer 602 55 Miami 1192 56 Monroe 185 57 Montgo 5459 58 Morgan 178 59 Morrow 413 60 Muskin 1206 61 Noble 181 62 Ottawa 436 63 Pauldi 267 64 Perry 440 65 Pickaw 699 66 Pike 406 67 Portage 1812 68 Preble 572 69 Putnam 435 70 Richla 1473 71 Ross 893 72 Sandus 713 73 Scioto 849 74 Seneca 601 75 Shelby 684 76 Stark 4576 77 Summit 6205 78 Trumbu 2556 79 Tuscararawas 1117 80 Union 572 81 VanWert 289 82 Vinton 179 83 Warren 2404 84 Washington 784 85 Wayne 1279 86 Willia 499 87 Wood 1363 88 Wyando 247 Figure 1 shows the actual distribution of students with asthma for the N = 88 counties and we see a very strongly skewed distribution. The reason for the skewness is that most counties are rural with small populations and hence relatively small numbers of children with asthma. Counties encompassing urban areas have very large populations and hence large numbers of students with asthma. Sampling 7 Histogram of Number of Students with Asthma Number of Students F re qu en cy 0 5000 10000 15000 0 20 40 60 Figure 1: Actual distribution of student totals per county. Note that the distribution is very strongly skewed to the right. To illustrate the sampling distribution of the estimated total ty where ty = Nȳ, 10,000 samples of size n were obtained and for each sample, the total was estimated. The histograms show the sampling distribution for ty for sample sizes of n = 5, 25, and 50 in Figure 2, Figure 3, and Figure 4 respectively. The long vertical line denotes the true total of T = 131, 260. Clearly the sampling distribution of ty, the estimated total, is not nearly normal for n = 5. We see a bimodal distribution which results due to the presence of lightly populated and heavily populated counties. Cochran (1977) gives the following rule of thumb for populations with positive skew- ness: the normal approximation will be reasonable provided the sample size n satisfies n ≥ 25G21, where G1 is the population skewness, G1 = N∑ i=1 (yi − µ)3/(Nσ3). For this particular example, we find 25G2 = 357 which is much bigger than the entire number of sampling units (counties)! In order to get an idea of how well the 95% confidence interval procedure works for this data, we performed the sampling 10,000 times for various sample sizes and Sampling 10 If we obtain a sample of size n from a population of size N , and each unit in the population either has or does not have a particular attribute of interest (e.g. disease or no disease), then the number of items in the sample that have the attribute is a random variable having a hypergeometric distribution. If N is considerably larger than n, then the hypergeometric distribution is approximated by the binomial distribution. We omit the details of these two probability distributions. The data for experiments such as these looks like y1, y2, . . . , yn, where yi = { 1 if the ith unit has the attribute 0 if the ith unit does not have the attribute. . The population proportion is denoted by p and is given by p = 1 N N∑ i=1 yi. We can estimate p using the sample proportion p̂ given by p̂ = 1 n n∑ i=1 yi. Note that in statistics, it is common to denote the estimator of a parameter such as p by p̂ (“p”-hat). This goes for other parameters as well. Using simple random sampling, one can show that var(p̂) = ( N − n N − 1 ) p(1− p) n . An unbiased estimator of this variance is given by v̂ar(p̂) = ( N − n N ) p̂(1− p̂) n− 1 . An approximate (1 − α)100% confidence interval for the population proportion is given by p̂± zα/2 √√√√(N − n)p̂(1− p̂) N(n− 1) . This confidence interval is justified by assuming that the sample proportion behaves like a normal random variable which follows from the central limit theorem. The approximation is better when the true value of p is near 1/2. If p is close to zero or one, the distribution of p̂ tends to be skewed quite strongly unless the sample size is very large. The sample size required to estimate p with confidence level (1− α) with half-width d is given by n ≥ z 2 α/2p(1− p)N z2α/2p(1− p) + d2(N − 1) . Note that this formula requires knowing p which is what we are trying to estimate! There are a couple ways around this problem. (1) Plug in p = 1/2 for p in the Sampling 11 formula. This will guarantee a larger than necessary sample size. (2) Use a guess for p, perhaps based on a previous study. 2.7 Stratified Random Sampling. Data is often expensive and time consuming to collect. Statistical ideas can be used to determine efficient sampling plans that will provide the same level of accuracy for estimating parameters with smaller sample sizes. The simple random sample works just fine, but we can often do better in terms of efficiency. There are numerous sampling designs that do a better job than simple random sampling. In this section we look at perhaps the most popular alternative to simple random sampling: Stratified Random Sampling. The idea is to partition the population into K different strata. Often the units within a strata will be more homogeneous. For stratified random sampling, one simply obtains a simple random sample in each strata. Of course, the problem arises as to how many observations to allocate to each strata. Another issue is how to define the strata in the first place. There are three advantages to stratifying: 1. Parameter estimation can be more precise with stratification. 2. Sometimes stratifying reduces sampling cost, particularly if the strata are based on geographical considerations. 3. We can obtain separate estimates of parameters in each of the strata which may be of interest in of itself. Examples. • Estimate the mean PCB level in a particular species of fish. We could stratify the population of fish based on sex and also on the lakes the fish are living. • Estimate the proportion of farms in Ohio that use a particular pesticide. We could stratify on the basis of the size of the farm (small, medium, large) and/or on geographical location etc. These two examples illustrate a couple of points about stratification. Sometimes the units fall naturally into different stratum and sometimes they do not. Notation. Let Ni denote the size of the ith stratum for i = 1, 2, . . . , K, where K is the number of strata. Then the overall population size is N = K∑ i=1 Ni. Sampling 12 If we obtain a random of size ni from the ith stratum, we can estimate the mean of the ith stratum, ȳi by simply averaging the data in the ith stratum. The estimated variance of ȳi is (s2i /ni)(1− ni/Ni), where s2i is the sample variance at the ith stratum. The population mean is given by µ = K∑ i=1 Niµi/N, which can be estimated by ȳs = K∑ i=1 Niȳi/N, with an estimated variance given by σ̂2ȳs = K∑ i=1 ( Ni N )2(s2i /ni)(1− ni/Ni). The estimated standard error of ȳs, ŜE(ȳs) is the square root of this quantity. The population total T = Nµ can be estimated using ts = Nȳs with estimated standard error ŜE(ts) = N · ŜE(ȳs) Approximate (1 − α)100% confidence intervals for the population mean and total using stratified random sampling are given by Population Mean: ȳs ± zα/2ŜE(ȳs), and Population Total: ts ± zα/2ŜE(ts). Example. A survey was done to estimate the average number of invasive honeysuckle plants per acre in a forest. The forest is partitioned into 158 acre plots. N1 = 86 acres of the forest are new growth and N2 = 72 acres are old growth. A sample of n1 = 14 acres of new growth and n2 = 12 acres of old growth forest were obtained yielding the following data: New Growth Old Growth 97 67 42 125 125 155 130 111 25 92 105 86 242 101 310 236 27 43 45 59 220 352 142 190 53 21 ȳ1 = 63.36 ȳ2 = 192.83 s1 = 32.738 s2 = 80.782 Sampling 15 This can be beneficial in some situations. In addition, a systematic sample may yield more precise estimators when the correlation between pairs of observations in the systematic sample is negative. However, if this correlation is positive, then the simple random sample will be more precise. We can use the same formulas for estimating the population mean and total as were used for a simple random sample. These estimators will be approximately unbiased for the population mean and variance. If the order of the units in the population are assumed to be arranged in a random order, then the variance of the sample mean from a systematic sample is the same of the variance from a simple random sample on average. In this case, the variance of ȳ from a systematic sample can be estimated using the same formula as for a simple random sample: (N − n)s2/(Nn). An alternative to estimating the variability is to consider the order of the observations in the systematic sample: y1, y2, . . . , yn and then note that for consecutive neighboring points yi and yi−1, we have E[(yi − yi−1)2] = 2σ2 assuming that neighboring points are independent. From this, it follows that s2L = 0.5 n∑ i=2 (yi − yi−1)2/(n− 1) can be used to estimate the variance and therefore the standard error of the mean ȳ can be estimated using ŜE(ȳ) = sL/ √ n. If the population has some periodic variation, then the systematic sampling approach may lead to poor estimates. Suppose you decide to use a systematic sample to monitor river water and you plan on obtaining samples every seventh day (a 1-in-7 systematic sample). Then this sampling plan reduces to taking a sample of water on the same day of the week for a number of weeks. If a plant upstream discharges waste on a particular day of the week, then the systematic sample may very likely produce a poor estimate of a population mean. Systematic sampling can be used to estimate proportions as well as means and totals. Systematic sampling can be used in conjunction with stratified random sampling. The idea is to stratify the population based on some criterion and then obtain a systematic sample within each stratum. 2.10 Other Design Strategies There are many different sampling designs used in practice and the choice will often be dictated by the type of survey that is required. We have discussed simple ran- dom sampling, stratified random sampling and systematic sampling. Now we briefly discuss a few other well-known sampling methodologies. Cluster Sampling. The situation for cluster sampling is that the population consists of groups of units that are close in some sense (clusters). These groups are known as primary units. Sampling 16 The idea of cluster sampling is to obtain a simple random sample of primary units and then to sample every unit within the cluster. For example, suppose a survey of schools in the state is to be conducted to study the prevalence of lead paint. One could obtain a simple random sample of schools throughout the state. But this could lead to high costs due to a lot of travel. Instead, one could treat school districts as clusters and obtain a simple random sample of school districts. Once an investigator is in a particular school district, she could sample every school in the district. A rule of thumb for determining appropriate clusters is that the number of elements in a cluster should be small (e.g. schools per district) relative to the population size and the number of clusters should be large. Note that one of the difficulties in sampling is obtaining a frame. Cluster sampling often makes this task much easier since it if often easy to compile a list of the primary sampling units (e.g. school districts). Cluster sampling is often less efficient than simple random sampling because units within a cluster often tend to be similar. Thus, if we sample every unit within a cluster, we are in a sense obtaining redundant information. However, if the cost of sampling an entire cluster is not too high, then cluster sampling becomes appealing for the sake of convenience. Note that we can increase the efficiency of cluster sampling by increasing the variability within clusters. That is, when deciding on how to form clusters, say over a spatial region, one could choose clusters that are long and thin as opposed to square or circular so that there will be more variability within each cluster. Estimation and standard error formulas for cluster sampling can be found in most textbooks on sampling (e.g. Scheaffer, Mendenhall, and Ott 1996). Notation. N = The number of clusters n = Number of clusters selected in a simple random sample mi = Number of elements in cluster i M = N∑ i=1 mi = Total number of elements in the population yi = The total of all observations in the ith cluster The population mean µ is estimated by ȳ = ∑n i=1 yi∑n i=1mi . This estimator is a special case of a ratio estimator which we shall introduce a bit later. The estimated variance of ȳ is given by v̂ar(ȳ) = {(N − n)/(NnM̄2)}s2r, Sampling 17 where s2r = n∑ i=1 (yi − ȳmi)2/(n− 1), and M̄ = M/N, the average size of a cluster for the population. Note that often in practice M and hence M̄ are unknown in which case M̄ can be estimated by m̄ = 1 n n∑ i=1 mi. Estimating the Population Total in Cluster Sampling. An estimate of the population total in cluster sampling can be obtained in much the same way it was obtained in simple random sampling: ty = Mȳ. The estimated variance of ty is simply M 2v̂ar(ȳ). What is wrong with using this estimator of the population total? The problem is that it requires that we know M which is often unknown. Alternatively, if we do not know M , we could estimate the population total using Nȳt, where ȳt = 1 n n∑ i=1 yi, is the average of the cluster totals for the sampled clusters. The estimated variance of Nȳt is v̂ar(Nȳt) = N(N − n)s2t/n, where s2t = n∑ i=1 (yi − ȳt)2/(n− 1). Nȳt is an unbiased estimator of the population total, but because it does not use the information on the cluster sizes (e.g. the mi’s), the variance of Nȳt tends to be bigger than the variance of ty. Example. Roberts et al (2004) used a cluster sampling approach to estimate the number of additional deaths in Iraq that resulted due to the Iraq war that started in 2003. From this article, it was widely reported that the number of Iraqi’s killed from the war (so far) is 100,000. Their estimate of Iraqi deaths due to the war was 98,000 (not including Falluja which had a very high number of deaths). A 95% confidence interval for this total was given as (8000, 194000). 33 clusters were sampled based on Governorates and 30 households were interviewed in each cluster. The 33 clusters were sampled using a systematic sampling approach. Additional details can be found in the article. Sampling 20 Letting µx and µu denote the population means of x and u respectively, then we would expect that µx µu ≈ x̄ ū , in which case µx ≈ rµu. Using this relationship, we can define the ratio estimator of mean µx as x̄ratio = rµu, and if N is the total population size, then the ratio estimator of the total τ is tx = rNµu. What is the intuition behind the ratio estimator? If the estimated ratio remains fairly constant regardless of the sample obtained, then there will be little variability in the estimated ratio and hence little variability in the estimated mean using the ratio estimator for the mean (or total). Another way of thinking of the ratio estimator is as follows: suppose one obtains a sample and estimates µx using x̄ and for this particular sample, x̄ underestimates the true mean µx. Then the corresponding mean of u will also tend to underestimate µu for this sample if x and u are positively correlated. In other words, µu/ū will be greater than one. The ratio estimator of µx is x̄ratio = rµu = x̄( µu ū ). From this relationship, we see that the ratio estimator takes the usual estimator x̄ and scales it upwards by a factor of µu/ū which will help correct the under-estimation of x̄. There is a problem with the ratio estimator: it is biased. In other words, the ratio estimator of µx does not come out to µx on average. One can show that E[x̄ratio] = µx − cov(r, x̄). However, the variability of the ratio estimator often tends to be smaller than the variability of the usual estimator of x̄ indicating that it may still be preferable. An estimate of the variance of the ratio estimator x̄ratio is given by the following formula: v̂ar(x̄ratio) = (1− n/N) n∑ i=1 (xi − rui)2/[n(n− 1)]. (2) By the central limit theorem applied to the ratio estimator, x̄ratio follows an ap- proximate normal distribution for large sample sizes. In order to guarantee a good approximation, a rule of thumb in practice is to have n ≥ 30 and the coefficient of variation σx/µx < 0.10. If the coefficient of variation is large, then the variability of ratio estimator tends to be large as well. Sampling 21 An approximate confidence interval for the population mean using the ratio estimator is x̄ratio ± zα/2ŝe(x̄ratio), where ŝe(x̄ratio) is the square-root of the estimated variance of the ratio estimator in (2). An approximate confidence interval for the population total using the ratio estimator is given by tx ± zα/2ŝe(tx), where ŝe(tx) = Nŝe(x̄ratio). When estimating the mean or total of a population when an auxiliary variable is available, one needs to decide between using the usual estimator x̄ or the ratio es- timator. If the correlation between x and u is substantial, then it seems that using the ratio estimator should be preferred. A rough rule of thumb in this regard is to use the ratio estimator when the correlation between x and u exceeds 0.5. There is a theoretical justification for this given in Cochran (1977, page 157) based on assuming the coefficient of variation for x and u are approximately equal. Example. A study of acid rain was undertaken by examining samples of water in 32 lakes in 1977 (Mohn and Volden 1985). In 1976, the pH was measured in the population of all N = 68 lakes which gave a mean value of µu = 5.715 in 1976. Figure 5 shows a scatterplot of the pH values from the sample of n = 32 lakes in 1977. The goal is to estimate the mean pH level µx for all N = 68 lakes for 1977. The data for the n = 32 lakes are given in the following table: 1976 1977 _________ 4.32 4.23 4.97 4.74 4.58 4.55 4.72 4.81 4.53 4.70 4.96 5.35 5.31 5.14 5.42 5.15 4.87 4.76 5.87 5.95 6.27 6.28 6.67 6.44 5.38 5.32 5.41 5.94 5.60 6.10 4.93 4.94 5.60 5.69 6.72 6.59 Sampling 22 5.97 6.02 4.68 4.72 6.23 6.34 6.15 6.23 4.82 4.77 5.42 4.82 5.31 5.77 6.26 5.03 5.99 6.10 4.88 4.99 4.60 4.88 4.85 4.65 5.97 5.82 6.05 5.97 The sample means for the n = 32 lakes are x̄ = 5.3997 and ū = 5.4159, which gives an estimated ratio of r = x̄ ū = 5.3997 5.4159 = 0.9970. The ratio estimator of µx, the average pH in the 68 lakes is x̄ratio = rµu = (0.9970)(5.715) = 5.6979, which is higher than the simple estimate of x̄ = 5.3997. Therefore, the ratio es- timate takes the usual estimate of 5.3997 and scales it up by a factor of µU/ū = 5.715/5.4159 = 1.0552. The sample correlation between pH in 1976 and 1977 for the 32 lakes is 0.883 which indicates that the ratio estimator will be more efficient than the usual simple random sample estimator of the mean. The estimated coefficient of variation for 1976 and 1977 are respectively 0.1234 and 0.1244. Although the coeffi- cient of variation for 1977 exceeds our rule of thumb value of 0.10, it does not exceed it by much. The estimated variance for the ratio estimator can be computed as v̂ar(x̄ratio) = (1−n/N) 32∑ i=1 (xi − 0.9970ui)2/[32(31)] = (1−32/68)(3.2473)/[32(31)] = 0.0017. The standard error of x̄ratio is obtained by taking the square root of this quantity which gives ŝe(x̄ratio) = √ 0.0017 = 0.0412. A 95% confidence interval for µx is 5.6979± 1.96(0.0412) = 5.6979± 0.0808. Note that if we had just used the sample mean to estimate the population mean (obtaining x̄ = 5.3997), the associated standard error would be ŝe(x̄) = (s/ √ n) √ 1− n/N = (0.6716/ √ 32) √ 1− 32/68 = 0.0864 Sampling 25 interest. Then another sample (often a sub-sample of the first sample) is obtained where the variable x of interest is measured. Some examples of easy-to-measure auxiliary variables are • Examine aerial photographs of sampling units to get rough counts of trees, animals etc. • Published data from past surveys. • A quick computer search of files using a keyword for example. In order to perform a double sampling, one first obtains a preliminary sample of size n′ say and measures the variable u. From this preliminary sample, we can get an estimate of µu using µ̂′u = n′∑ i=1 u′i/n ′. Then one obtains the usual sample of size n, perhaps as a sub-sample of the pre- liminary sampled units. From this sample, we can compute the ratio as in a ratio sample: r = x̄ ū . Then, the population total for x can be estimated using tx = rµ̂ ′ u. The variance for the estimated total using double sampling is more complicated than the variance of the ratio estimator because we have an extra source of variability with double sampling – namely the variability associated with the preliminary sample. The estimated variance of the double sampling total estimator is given by v̂ar(tx) = N(N − n′)s2/n′ + N 2(n′ − n) nn′ s2r, where s2r = 1 n− 1 n∑ i=1 (xi − rui)2. Notice that if n′ = N , that is if the preliminary sample is of the entire population (i.e. a census), then the first term in this variance formula becomes zero and we end up with the same formula as the ratio estimator variance. 2.14 Unequal Probability Sampling The sampling procedures discussed up to this point involve simple random sampling of sampling units in which case each unit has the same chance of being selected for the sample. Even with sampling designs more complicated than simple random sampling, such as stratified random sampling, a simple random sample was obtained in each Sampling 26 stratum. In many situations, a simple random sample is either not possible or not preferable. In line-intercept sampling for example, a line is more likely to intercept larger units than smaller units. If we divide an area into plots of sampling units, the plots may not all have the same size. In these cases, the probability of the unit to be selected into the sample will depend on the size of the unit. This is sometimes known as probability proportional to size estimation. Let pi denote the probability that the ith unit will be selected. Hansen-Hurwitz Estimator: Suppose sampling is done with replacement. Recall that when using simple random sampling, the population total is estimated by ty = Nȳ. We can rewrite this as ty = 1 n n∑ i=1 yi/(1/N). If we are sampling with replacement when each unit has the same chance of being selected, then the probability that a unit is selected at any given draw is 1/N . For the Hansen-Hurwitz estimator, we simply replace the 1/N by pi for the ith unit: tHH = 1 n n∑ i=1 yi/pi (Hansen-Hurwitz estimation of total) Horvitz-Thompson Estimator: Sampling with replacement is not done often in practice as in the case of the Hansen-Hurwitz estimator. With the Horvitz-Thompson estimator, the sampling can be done either with or without replacement. We shall consider the case when the sampling is done without replacement. Let πi denote the probability the ith sampling unit is selected in the sample. (Note that if all units have the same chance of being selected and we sample without replacement, then πi = n/N. Can you explain why?) The estimator of the population total is given by tHT = n∑ i=1 yi/πi (Horvitz-Thompson Estimator). The population mean can be estimated using µ̂HT = tHT/N. assuming the n units selected are all distinct (this will not necessarily be the case when sampling with replacement). The variance formula for the Horvitz-Thompson estimator is quite complicated and involves probabilities of the form πij which denotes the probability that units i and j are both selected. Recent research into simpler variance formulas that do not require knowing the πij has been published, see for example Berger (2004). If sampling is done proportional to size and size of units vary, then the πij will vary in value as well. Sampling 27 Detectability In some sampling cases, the elements may be difficult to detect within the sampling units. This may be the case in certain wildlife populations (e.g. fish, birds, etc.). If one is obtaining a simple random sample from a population of N units, then whether or not an animal in the unit is detected may not be certain, but instead a probability is associated with the chance the animal is detected. A non-animal example could occur when soil samples are assessed for a particular contaminant, some of the material may be missed due to sparsity of the contaminant. Definition. The probability that an object in a selected unit is observed is termed its detectability. For the sake of discussion, we shall refer to the objects as “animals.” The following is some notation: y = # of animals observed τ = total # of animals p = probability an animal is observed. If we assume independence between observations and a constant detectability proba- bility p throughout a region, then Y ∼ Binomial(τ, p), that is, Y , the number of animals observed follows a binomial distribution on τ trials and success probability p. Therefore, the expected value of Y is E[Y ] = τp, which indicates that we can estimate the total number of animals by solving for τ and using an estimate for the mean: τ̂ = y/p. The variance of the binomial random variable Y is τp(1− p) and thus var(τ̂) = τp(1− p) p2 = τ(1− p) p , which can be estimated by substituting τ̂ for τ to get v̂ar(τ̂) = y(1− p) p2 . Notice that if the probability p of detection is small, then this variance becomes large. If the area of the region of interest is A, then we can define the animal density as D = τ/A, Sampling 30 selected n points at random from the entire two-dimensional region and then select transects based on perpendicular lines that go through these selected points from the baseline. 2.15 The Data Quality Objectives Process The collection of data can be time consuming and expensive. Therefore, it is very important to plan matters very carefully before undertaking a survey or experiment. If too small a sample size is used, then there may not be enough information to make the resulting statistical analysis useful. For instance, confidence intervals may be too wide to be of any use or a statistical test may yield insignificant results even if there is a real effect. On the other hand, one does not want to unnecessarily expend too much money and resources obtaining more data than what is necessary in order to make a decision. The U.S. Environmental Protection Agency (EPA) developed the Data Quality Ob- jectives (DQO) to ensure the data collection process will be successful. Details can be found on the web at http://www.epa.gov/quality/qs-docs/g4-final.pdf. The steps of the DPO can be summarized as following: 1. State the problem: describe the problem, review prior work, and understand important factors. 2. Identify the decision: what questions need to be answered? 3. Identify the inputs to the decision: determine what data is needed to answer questions. 4. Define the boundaries of the study: time periods and spatial areas to which the decisions will apply. Determine when and where data is to be gathered. 5. Develop a decision rule: define the parameter(s) of interest, specify action limits, 6. Specify tolerable limits on decision errors: this often involves issues of type I and type II probabilities in hypothesis testing. 7. Optimize the design for obtaining data: consider a variety of designs and at- tempt to determine which design will be the most resource-efficient. This process may very well end up being an iterative process. Not only will later steps depend on the earlier steps but the later steps may make it necessary to rethink earlier steps as the process evolves. For instance, one may initially set unrealistic error bounds (type I and/or II) and then come to realize that these constraints would make the project go way over budget. References Berger, Y. G. (2004), “A Simple Variance Estimator for Unequal Probability Sampling without Replacement,” Journal of Applied Statistics, 31, 305–315. Sampling 31 Cochran, W. G. (1977), Sampling Techniques, 3rd edition, Wiley, New York. Mohn, E. and Volden, R. (1985) “Acid precipitation: effects on small lake chemistry,” in Data Analysis in Real Life Environment: Ins and Outs of Solving Problems, (Eds J. F. Marcotorchino, J. M. Proth and J. Janssen), pp. 191-196, Elsevier, Amsterdam. Roberts, L., Lafta, R., Garfield, R., Khudhairi, J., Burnham, G., (2004), “Mortality before and after the 2003 invasion of Iraq: cluster sample survey,” The Lancet, 364, 1857-1864. Scheaffer, R., Mendenhall, W. and Ott, R. (1996), Elementary Survey Sampling, 5th edition, New York: Duxbury Press. Thompson, S. K. (1992), Sampling, New York: Wiley.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved