Rating: 8.3/10.
This is my notes from the second edition of the textbook from 2009. It provides a solid foundation of sampling, survey design, and statistical methods for analyzing errors and variance. Some weak points include the fact that all code examples are in SAS instead of R, which is more popular for statistical computing. Many examples involve outdated methods, such as randomly calling phone numbers, rather than more contemporary web use cases like A/B testing (not mentioned anywhere).
Chapter 1. There are numerous ways in which a survey can be biased: eg, if you survey those who are easily reachable or who self-volunteer for the survey, or if you do not account for non-response, or allow substitutions, the results will not generalize to the population, even if the sample size is large. Best practices for survey questions: the exact wording of a question has a significant effect on the result, so keep the language as simple to understand as possible and to ask one thing at a time in the clearest way; do not ask loaded questions; instead, ask in a neutral manner and force the participant to choose in one direction or the other. Ensure that there are no terms in a question that are potentially ambiguous, such as the word “you,” which can refer to an individual or a household.
Chapter 2. The expected value of the estimator E[\hat{t}] is the weighted average of the statistic weighted by the probability of sampling (\pi_i). Bias is the difference between the expected value of the estimator and the population statistic t. The simplest method of sampling is simple random sampling, either with or without replacement. In cases where the number of samples is close to the population size, it is necessary to add a finite population correction. In an analysis, we are interested in a variety of statistics on the sample mean, \bar y, such as the standard error, variance, coefficient of variation. Note that for SRS to hold, every unit must have the same probability of being selected independently (not if there is any dependency like if cluster or stratified sampling is used).
By the central limit theorem, the distribution of the sample mean \bar{y} follows a normal distribution, which allows us to derive the confidence interval. At least 30-50 samples is required for a decent approximation, but more if the samples are skewed. The sample size estimation is a function of the desired relative precision: typically determine variance from a pilot study, previous results, or guesswork to determine the sample size needed to achieve the desired effect in the real study.
The design-based approach to sampling (or randomization theory), involves an index random variable (binary) representing whether each unit in the population is sampled or not. From this, we can analyze the expected value and variance of the statistic; there are no assumptions about the distribution or properties of the target variable but the analysis depends heavily on how the sampling is done (SRS or something else). The model-based approach involves predicting unobserved values given the observed ones and relies on strong assumptions, such as independence and the distribution of the target variable. The sampling method is not important and is assumed to be independent as part of the model. In the case of SRS, the design-based and model-based approaches give the same results because the model-based approach assumes all units are independent.
Chapter 3: Stratified sampling is to try to balance the number of samples between subgroups (eg: male and female) randomly, and this generally results in a lower variance. In stratified random sampling, perform SRS on each stratum. The overall estimate for the mean (\bar y) is the weighted average of each stratum’s mean (\bar y_h), and the variance is the sum of the stratum variances. If there are different inclusion probabilities in each stratum, then the weight towards the total mean is the reciprocal of the inclusion probability, intuitively how many population units are represented by a sampled unit.
How to choose the allocation between strata? In proportional allocation, the number of samples in each item is proportional to the size of the stratum, making it self-weighted, so the sample mean does not require any other weighting mechanism, and almost always results in a lower variance than SRS, as long as the difference in means between strata is not too small. However, the optimal allocation formula to minimize the variance, it considers the size of each stratum, the cost of sampling it, and the variance. If the stratum is large, has high variance, or is inexpensive to sample, then we should sample more from it. Ideally, we should define strata where within-strata variability is expected to be small and between-strata variability is large. Quota sampling is when, after defining the strata, sampling is done not randomly but in whatever way is convenient to meet a quota within a stratum, but like convenience sampling this may be heavily biased.
Chapter 4. Ratio estimation is useful in various scenarios where you have auxiliary variables relevant to the variable of interest, with the total or average of the auxiliary variable \bar{x} known for the population. The target variable y is only available from sampling, and the ratio B = y/x is equivalent to estimating this ratio or estimating y as \bar{x} is known for the population. Using auxiliary information increases precision compared to SRS; the sample estimate \hat{\bar{y}}_r is slightly biased when estimating the sample ratio, but the bias is usually small and decreases as the sample size gets larger. Ratio estimation is more effective when there is a higher correlation between x and y.
Another instance where ratio estimation arises is estimating the mean between domains that are sampled, eg: male vs female income, but where the sample M/F is random. Another case is poststratification, where weights of each strata is adjusted to match the population. Regression estimation is similar to ratio estimation, but we add another intercept term to be estimated. When using ratio estimation on stratified samples, we can either use combined statistics to estimate the ratio or estimate ratios for each stratum and then combine them; it is generally better to use the combined ratio unless the ratio is expected to differ greatly between strata.
Ratio and regression estimation are model-assisted, meaning they are motivated by models, but the confidence interval will still be correct even if the data does not fit the model assumptions. A different approach is model-based ratio or regression estimation, basically the textbook linear regression model where we assume the data fits the model and is independent, so the sampling method does not matter. The model-based and design-based methods will agree on the point value of the estimate, but the model-based method will have lower variance. However, this will underestimate the variance if the model assumptions are incorrect, so it is important to check the model assumptions, whereas in the design-based approach, the confidence interval is always correct.
Chapter 5. Cluster sampling is useful when it’s impractical to take a random sample of something, for example, if it is geographically dispersed. Instead, randomly sample some clusters and then take either all or some of the units from each cluster. It is somewhat similar to stratified sampling but has a larger variance than a random sample because the units within each cluster tend to be correlated. We use the notation “psu” for primary sampling unit (clusters) and “ssu” for secondary sampling unit (individuals). The notation in this chapter is relatively messy because of the two levels of sampling.
The simpler case is one-stage cluster sampling, where we take all of the units in each cluster that we decide to sample. Cluster sampling is generally less efficient than SRS when the clusters are more similar internally than across clusters, which is almost always true in natural scenarios. However, this is offset by being cheaper to sample than SRS. In two-stage cluster sampling, we take an SRS of clusters and then an SRS of units within each cluster. When clusters are of unequal sizes, this approach might fail because the variance can be very large.
The optimal number of clusters and the samples of clusters may be determined by calculus when the cost of sampling each cluster and unit within each cluster, and an estimate of the variance, is known from a pilot study. Systematic sampling is a special case of cluster sampling when you take a regular interval between samples: if they are random, this is equivalent to SRS; if the samples are ordered, it is more efficient than SRS, but if there is a periodic pattern, then there is a risk of being very wrong if the systematic sample coincides with the periodic pattern. For a model-based analysis, an appropriate model is a linear mixed-effects model, with the cluster represented by a fixed effect and the unit within the cluster represented by a random effect. However, check that the model’s assumptions are met for the model results to hold.
Chapter 6. Sampling with unequal probability is sometimes helpful – it is better to sample with higher probability the units that are contributing more to that total than to sample randomly. Eg: sales of stores (to be estimated) can be sampled proportionally to the store’s area (a known quantity). We can use the cumulative size method to sample with a defined probability by generating a random number and mapping each range interval to a possible sampling unit. Another way is Lahiri’s method, which is better if there are many units; it uses rejection sampling by generating a random number and accepting the unit with a probability proportional to the desired sampling probability. The most mathematically simple setup is sampling with replacement, so it is possible for a unit to be sampled more than once, and use an indicator variable to denote how many times each unit is sampled. However, a more natural way is sampling without replacement, but this is more complicated since the sampling probability for each unit depends on what has been sampled already.
The Horvitz-Thompson (H-T) estimator avoids complications of what was sampled before by weighing by the average probability of selecting each unit. This gives the correct point estimate for the total, but the variance has several algebraically identical ways of expressing the variance of this estimator. All of them require the joint inclusion probability for every pair of observations that are sampled to estimate the variance, and this can even be negative in some situations. The joint inclusion probability is often not known, or expensive to work with since there are so many of them, so it is sometimes simpler to assume that they are sampled with replacement: this makes the variance estimation is easier, but it will overestimate the variance if the sampling fraction is large.
Some algorithms for selecting samples with unequal probability without replacement are not covered in this book but are implemented in statistical packages. The HT estimator is general enough to include previously covered sampling designs such as stratified and cluster sampling. 3-P sampling (probability proportional to prediction), is sometimes used; eg, in forestry, predict the volume of a tree and use that proportion to sample, and if the tree is chosen to sample, then actually measure the tree volume. This is a special case of Poisson sampling, which whenever the sample size is not known until it is done.
Chapter 7. Complex survey designs may involve a combination of multiple strategies like cluster sampling, stratification, and ratio estimation. This can be modelled as producing sampling weights that are different for each unit – stratified and cluster sampling both give the correct point estimate if using the correct weight, but if only the weights are known and the design is not known, then it is impossible to infer results like confidence intervals correctly. If ignoring these and simply analyzing as if it came from SRS, then the results will be invalid.
If the sampling weights are known, they can be used for plotting the CDF, histogram, box plots, density functions, etc. A weighted scatterplot can be done by weighing those larger dots or shading or subsampling proportional to its weight. Fitting any trend must also account for the weight. The design effect (deff) measures the difference in variance caused by the design compared to SRS (where deff of SRS is 1).
Chapter 8. Non-response. In most surveys, some proportions of the participants will not respond, and estimating the population mean from only the ones who respond will produce a bias, and this bias is by definition unknown. The best way is to find ways to reduce the non-response rate. Many reasons that cause people to not respond, such as the medium of the survey, the timing, questionnaire design, ordering of questions, and even if participants only respond to some questions, this is still useful for imputation of non-response variables from available data from the participant.
If the variables are missing at random, ie, the non-response rate is unrelated to the participant, then you can simply weight the responses according to the response ratio. Poststratification is using the known population counts of different strata to adjust for non-response; if any group is too small, it is recommended to combine it with a neighboring group to reduce variance. However, it is dangerous to assume it is missing at random since the non-response rate is usually correlated with some attributes and is not entirely random.
Some imputation methods include grouping into classes and substituting the mean of the class or picking randomly from another member of the class. If releasing a public dataset with imputation, it is important to distinguish between values that are original versus imputed; often the collector of the dataset can do a better job of data imputations than anybody else because they have access to some variables that will be scrubbed for privacy reasons in the public version of the dataset. A model-based analysis of non-response is possible – one way is to assume a geometric distribution of the number of attempts needed to sample a unit, and this makes sense in certain natural science settings, such as counting animals in a region. To validate this assumption, a chi-squared test may be used.
Chapter 9. The variance of complex formulas involving random variables is difficult to analyze in general, as a simple formula for the variance only exists for linear combinations of random variables. If it’s not a linear combination, for example, a ratio, you can use a Taylor approximation to linearize, ie, obtain a linear approximation around a point. This can be done for the ratio estimator to obtain the variance formulas in Ch4.
The problem is analyzing the variance of complex survey designs: the random group method splits the data into random miniatures, each with the same design as the original. This makes it easier to estimate the overall variance of the data from the variances of the random groups. However, you need a large dataset for this to work effectively and must split it into enough random groups (>= 10) to be effective. Balanced repeated replication (BRR) splits the dataset into two halves and does this randomly many times in different ways. If there is stratification, each split maintains the stratification, and then averaging the variances of the splits can approximate the overall variance.
The Jackknife method involves running the procedure while deleting one data point at a time, and the variance in the results can be used to estimate the overall variance. This requires a lot of computation, n times more than the original, as it is done once for every point. Once the variance is estimated, the confidence interval can be derived from the mean and variance estimates. Usually, it is reasonable to assume that the confidence interval is centered around the mean. We can also obtain confidence intervals for the median and percentiles.
Chapter 10. The chi-square test is useful in various scenarios: whenever we have expected counts versus observed counts, it can test whether the deviation is significant. This can be applied to testing independence between two variables, for homogeneity between categories, or testing the fit of a model. Whenever a clustering survey design is used, you must be careful to analyze it not as a random sample, which will give incorrectly low p-values. The Rao-Scott test matches the mean (and possibly the variance) of the chi-squared distribution to account for design effect. Independence implies a product relationship between values; you can use a log-linear model and check how well a linear model fits the data to infer whether they are independent.
Chapter 11. The standard textbook linear regression assumes a linear relationship where the error is independent and normally distributed. For unequal probability sampling, you can use weighted linear regression, but it’s a good idea to plot the data and check independence assumptions, which are quite strong. Linearization techniques may be applied to get the variance and confidence interval for linear regression variables B (which includes slope and intercept). Model-based approaches work for sample data as well, but only if the model assumptions are reasonably correct. One useful heuristic to check model fit is to fit the model using the sampling weights and fit it again without weights, and if the model is correct, then the two should be similar. Linear mixed models are good for fitting clustered sampled data, and logistical regression can be used similarly to linear regression in complex survey design and can be weighted as well. The generalized regression estimator (GREG) is useful when you have auxiliary information available x where the population total is known, and then adjust the estimate of the population total y using a regression model that links x and y.
Chapter 12: Two-Phase Sampling. In phase one, take a relatively large sample and measure attributes x that are inexpensive to measure; in phase two, take a subsample of the phase one sample and measure the y attribute, which is the target but expensive to measure. Then, use ratio estimation to estimate y for the population. This can be combined with stratification based on information collected in phase one, eg, using SRS for phase one and then stratified sampling for phase two. Other techniques such as the jackknife method can also be applied in two-phase sampling. This technique is most useful when there is a high correlation between the target and auxiliary variables; otherwise, it is more efficient to use your resources for a single phase of sampling.
Chapter 13. Population size, e.g., the number of fish in a lake, can be estimated by sampling it twice and observing how many fish tagged in the first sample are captured again. This method assumes independence between the two samples and that each unit has an equal probability of being sampled. A confidence interval for population size may be derived by finding the minimum and maximum values that pass the p-value for independence using a chi-squared test. This technique is useful for counting humans as well. If you have multiple lists of people, but each one is incomplete, you can estimate the total number of people.
Chapter 14. When sampling a characteristic that is rare in the population, you can sample with unequal probability, targeting those more likely to have the characteristic, and also sample different multiple frames and combine them later: eg, if sampling Alzheimer’s patients, focus on old age nursing homes instead of the general population. Snowball and adaptive cluster sampling is when the rare unit is found, start sampling its neighbors to likely find more of the rare unit, adjusting the sampling probability after collecting the data. Small area estimation is used when you want to estimate something for many subpopulations, many of which are too small for direct estimation to be reliable due to high variance. In these cases, ratio estimation may be useful or the Fay-Herriot model, which combines regression with a small area estimate that is unreliable and the population estimate that is more reliable.
Chapter 15. The total survey quality is the sum of several different types of error. Coverage error occurs when units in the target population are not part of the sampling frame, eg: if you randomly call phone numbers, it is not possible to reach people who don’t have a phone number. Measurement error occurs when a participant lies in their response, especially if the question is about a sensitive topic or illegal activities. One way to mitigate this is to randomize which question is answered: use a randomization device, and the participant will either answer a sensitive question or a straightforward question with a known average value; the researcher doesn’t know which question is answered, only the aggregate value. The total survey error is the sum of multiple such errors, including coverage error, nonresponse error, measurement error, processing error, and sampling error. Only the sampling error can be easily corrected and analyzed with mathematics. The other errors are better avoided by designing the survey more effectively, as it is hard to quantify their extent after data collection and difficult to apply statistical corrections reliably.