What Is an RDD? Spark, Stats, and Survey Research

RDD is an acronym with three common meanings depending on the field: Resilient Distributed Dataset in big data computing, Regression Discontinuity Design in statistics and research, and Random Digit Dialing in survey methodology. Each refers to something completely different, so the meaning depends on where you encountered the term.

Resilient Distributed Dataset (Computing)

In Apache Spark, the open-source framework for processing massive datasets, an RDD is an immutable distributed collection of data elements, partitioned across nodes in a cluster and operated on in parallel. It’s the foundational data structure that made Spark possible, and understanding it helps explain how big data processing works at scale.

The name captures three core properties. “Resilient” means the dataset can recover from failures. “Distributed” means it’s split across multiple machines. “Dataset” means it’s a collection of records you can process. RDDs don’t need to exist in memory at all times. Instead, each RDD stores a record of how it was derived from other datasets, a concept called its lineage. If a partition of an RDD is lost due to a machine failure, the system uses that lineage information to recompute only the lost partition, in parallel on other machines, without rolling back the entire program.

This lineage-based recovery is what set RDDs apart from earlier distributed computing approaches. Traditional systems relied on checkpointing, which meant constantly writing copies of data to disk as a safety net. RDDs skip that overhead entirely because the instructions for recreating any piece of data are always available.

Transformations and Actions

You interact with RDDs through two types of operations. Transformations produce a new RDD from an existing one. For example, a filter transformation applies a test to each record and keeps only the ones that pass, while a map transformation applies a function to every record to produce a modified version. Transformations are “lazy,” meaning they don’t actually execute until you need the result.

Actions are what trigger the computation. A reduce action, for instance, combines all records down to a single value by repeatedly applying a function. When you call an action, Spark traces back through the chain of transformations, figures out the most efficient way to execute them, and runs the computation across the cluster.

RDDs vs. DataFrames in Modern Spark

While RDDs remain available, modern Spark development has largely shifted toward DataFrames and Datasets. These higher-level interfaces include schema information (meaning the system knows the structure of your data) and benefit from Spark’s built-in query optimizer, which can automatically find more efficient execution plans. DataFrames also use more efficient storage formats. RDDs are still useful when you need fine-grained control over exactly how your data is partitioned and processed, but for most workloads, DataFrames deliver better performance with less code.

Regression Discontinuity Design (Statistics)

In research and statistics, RDD stands for Regression Discontinuity Design, a method for estimating cause-and-effect relationships from observational data. It applies in situations where a treatment or intervention is assigned based on whether someone falls above or below a specific cutoff on a continuous measure.

The classic example uses blood pressure. Clinical guidelines recommend starting blood pressure medication when systolic pressure reaches 140 mmHg. A patient at 139 mmHg and a patient at 140 mmHg are, for practical purposes, nearly identical. Whether someone lands just above or just below that threshold on any given measurement is largely a matter of chance, similar to random assignment in a clinical trial. Researchers can then compare outcomes (like rates of cardiovascular disease) between the two groups to estimate the causal effect of treatment.

The variable used to assign treatment, such as blood pressure in this example, is called the forcing variable or running variable. It must be continuous (or nearly so), and the cutoff must be predetermined rather than chosen after looking at the data. Crucially, people near the threshold must not be able to manipulate their score to land on a preferred side. If patients could deliberately lower their blood pressure reading to avoid medication, the comparison would break down.

Why Researchers Use It

RDD has an important advantage over other observational methods. Techniques like propensity score matching require assuming there are no unmeasured factors influencing the results, an assumption that’s essentially impossible to guarantee in real-world research. RDD’s assumptions are more concrete and, critically, can be tested directly from the data. Researchers can check whether people just above and below the cutoff look similar on other characteristics, providing visible evidence that the comparison is valid.

Health researchers have used this design to study topics ranging from the effect of state handgun purchase age minimums on adolescent suicide rates (using age as the forcing variable) to the impact of national obesity intervention programs on cardiovascular outcomes (using BMI thresholds). Any policy or clinical guideline that creates a sharp eligibility boundary is a potential candidate for this approach.

Random Digit Dialing (Survey Research)

In polling and public health surveillance, RDD refers to Random Digit Dialing, a method for selecting participants in telephone surveys. Rather than working from a phone directory, which would miss unlisted numbers, researchers generate phone numbers randomly within known working area codes and exchanges. Every telephone number in the target area has a known chance of being selected, which makes the resulting sample more representative of the general population.

This technique became standard for large-scale health surveys and political polls. A random digit dialing survey estimating the prevalence of chronic fatigue syndrome, for example, identified nearly 2.2 million American adults suffering from CFS-like illness. The method’s strength is its ability to reach a wide range of households without relying on any pre-existing list.

The Cell Phone Problem

Traditional RDD was designed for landlines, and the mass migration to cell phones has created significant challenges. People who use only cell phones and have no landline are systematically excluded from landline-based RDD surveys. This group tends to be disproportionately young, male, single, and living in rental housing. As the cell-phone-only population has grown, the gap between who these surveys reach and who actually lives in the population has widened.

This noncoverage bias isn’t just a statistical nuisance. If survey estimates about health behaviors or insurance coverage systematically miss younger and more mobile populations, policymakers may get a distorted picture of actual public health needs. Modern survey operations now typically include cell phone samples alongside landline samples to reduce this bias, though doing so increases costs and introduces new logistical challenges like reaching people across time zones and area codes that no longer reflect where someone lives.