Bayesian Analysis of Respondent-driven sampling

Bachelor dissertation entitled Prevalence Estimation and binary regression methods for respondent-driven sampling with outcome uncertainty.

The work was carried out at the School of Applied Mathematics of Fundação Getulio Vargas (FGV/EMAp), under the supervision of Luiz Max Carvalho, as part of the requirements for obtaining a Bachelor’s degree in Applied Mathematics. The object of study was prevalence estimation with regression for a snowball-like sampling, called Respondent-driven Sampling (RDS), using Bayesian Inference as the main tool.

Download the text here.

Abstract

Hard-to-reach populations are difficult to access for researchers or refuse to enrol in public health surveys, making enumeration and sampling challenges. Respondent-driven sampling (RDS) is a chain-referral technique used to recruit individuals from hard-to-reach populations. The survey encourages the participants to recruit their peers, giving incentives to each recruitment and for participation. Since there is no enumeration of the subjects, RDS is a non-probabilistic sampling strategy. Moreover, the graphical structure of RDS suffers from missing data, and several assumptions about the recruitment process are necessary.

After having the sampled individuals, understanding their characteristics is a focus in epidemiology, given that these are usually high-risk populations to some diseases. There- fore, estimating the disease prevalence, the proportion of infected individuals, and the dependence among other observed variables is a critical step for public decision making. Diagnostic tests for disease identification are subject to misclassification, and incorporating their accuracy corrects biases in the prevalence estimation problem. This work proposes the use of regression techniques for prevalence estimation in respondent-driven samples. We use conditionally autoregressive models to represent correlation among the individuals induced by recruitment.

In modern statistics, understanding situations with unknown information and quantifying them plays a significant role. We use Bayesian inference for uncertainty quantification for our models. In the Bayesian paradigm, probability distributions for quantities of interest represent the belief about them. We discuss different prior specification approaches for the parameters and examine uncertainty about the graph structure using a graphical model of RDS. To perform sampling from the parameter distribution, we used the Hamiltonian Monte Carlo sampler. Diagnostics of this method helped to improve our model programming. Verification of the model through simulation and external datasets showed robust results, and we propose model extensions for the limitations of this work.

Keywords: respondent-driven sampling, regression analysis, Bayesian inference, prevalence estimation, misclassification, sensitivity, specificity

GitHub repository: All the work is available at GitHub.