What Is Multiple Baseline Design and How It Works

A multiple baseline design is a type of single-case experiment that tests whether an intervention works by introducing it at different times to different people, behaviors, or settings. Instead of removing a treatment to see if a behavior reverts (as in a reversal design), this approach staggers the start of the intervention across at least three separate “baselines” and watches for a consistent pattern: each baseline should change only when the intervention reaches it, not before.

How the Design Works

Imagine you want to test whether a new teaching strategy improves reading fluency. You set up three students (or three different skills, or three different classrooms) and begin collecting baseline data on all of them at the same time. While all three are still in baseline, you introduce the intervention to the first student only. The second and third students keep doing what they were doing. You wait until the data are stable, then introduce the intervention to the second student. Finally, you bring in the third.

The key question at each step is: did the behavior change right when the intervention started, and only then? If reading fluency jumps for Student 1 while Students 2 and 3 stay flat, that’s a good sign. If the same pattern repeats when Student 2 gets the intervention, and again for Student 3, you have strong evidence that the intervention, not something else happening at the same time, caused the change. This staggered introduction is what separates a multiple baseline design from simply running the same before-and-after study three times.

The Logic Behind It: Prediction, Verification, Replication

Three concepts drive the design’s logic. First, each baseline phase establishes the existing level and pattern of behavior, which lets you predict what would happen if you did nothing. Second, when an untreated baseline stays flat while a treated one changes, that verifies the prediction: without the intervention, behavior would have continued as before. Third, repeating this same pattern across additional baselines provides replication within a single study. This built-in replication is the primary basis for concluding that the results are internally valid.

A stable baseline before introducing the intervention is critical. If a baseline is already trending upward, you can’t tell whether a post-intervention improvement was caused by the treatment or was simply a continuation of the existing trend. Standard practice is to keep collecting baseline data until the pattern levels off before moving to the next phase.

Three Common Variations

The design comes in three main forms, depending on what you stagger the intervention across:

Across participants. The same intervention targets the same behavior in three or more different people. This is the most common version and tests whether the treatment works for multiple individuals.
Across behaviors. One person receives the intervention for three or more different behaviors, introduced one at a time. This is useful when you want to show the treatment affects specific skills independently.
Across settings. One person and one behavior, but the intervention is introduced in different environments (classroom, home, playground) at staggered times. This tests whether the treatment’s effect is tied to context.

In all three versions, the core logic is identical: each new introduction of the intervention is a chance to replicate the effect.

Multiple Probe Design: A Practical Variation

A multiple probe design follows the same staggered logic but collects far fewer data points during the baseline phases. Instead of measuring every session, researchers take periodic “probes,” brief check-ins on the behavior, to confirm it hasn’t changed. All participants (or behaviors, or settings) get an initial probe before anyone starts the intervention. After the first participant completes a phase of intervention, everyone gets probed again. This second round confirms that the untreated baselines remain stable.

This variation is especially practical when repeated baseline testing could be tedious, produce practice effects, or feel pointless to participants who are waiting their turn. It gives up some data density in exchange for a more realistic and less burdensome testing schedule.

Why Researchers Choose It Over Reversal Designs

The most common alternative for single-case experiments is the reversal (or ABAB) design, where you introduce a treatment, withdraw it, and reintroduce it to see if the behavior follows. That works well when the behavior can actually reverse, like attention in a classroom. But many interventions teach skills that can’t be unlearned. Once a child learns to read a set of words, you can’t ask them to forget. Withdrawing treatment in those cases won’t produce a clean reversal, so the design fails.

Multiple baseline designs solve this problem because they never require withdrawing the intervention. Each baseline serves as its own control, and the staggered timing does the work that withdrawal would have done. This also avoids the ethical discomfort of deliberately taking away a treatment that’s helping someone just to prove it was working.

How Many Baselines You Need

Design quality guidelines recommend at least three baselines (three participants, three behaviors, or three settings). With fewer than three, you don’t have enough replication opportunities to rule out coincidence.

A study published in the Journal of Applied Behavior Analysis examined this question statistically and found that when a multiple baseline design includes at least three tiers and two or more of those tiers show a clear change, the false positive rate stays below 5% while statistical power exceeds 80%, both standard benchmarks for reliable research. Requiring every single tier to show a perfect effect can actually reduce power, so researchers should weigh the overall pattern rather than demanding flawless results from each baseline.

How Results Are Analyzed

Most multiple baseline studies rely on visual analysis. Researchers graph the data for each baseline as a separate panel stacked vertically, sharing the same time axis. They look at four features within and across phases: the average level of the behavior, the trend (is it going up, down, or flat?), the variability (how much the data bounce around), and the immediacy of change when the intervention starts. An ideal result shows a clear, immediate shift in level right at the point of intervention, with the untreated baselines remaining stable until their turn comes.

Beyond visual inspection, researchers sometimes calculate overlap-based effect sizes to quantify how much the intervention and baseline data separate from each other. The most common metrics include Percentage of Nonoverlapping Data (PND), which counts what share of intervention-phase data points exceed the single highest baseline data point, and Nonoverlap of All Pairs (NAP), which compares every possible pairing of baseline and intervention data points. Higher non-overlap means a stronger, more consistent effect. These numbers are especially useful for combining results across studies in meta-analyses, where visual analysis alone isn’t practical.

Limitations to Watch For

The design’s biggest vulnerability is interdependence between baselines. If introducing the intervention for one participant somehow affects the others, perhaps because they share a classroom and observe each other, the untreated baselines won’t stay flat. When that happens, you lose the comparison that makes the design work. This is sometimes called “covariation” among baselines, and it can make an effective treatment look ambiguous or an ineffective one look successful.

Extended baselines create a second practical problem. The later baselines have to wait longer before receiving the intervention, which can feel frustrating for participants and raises ethical questions about delaying a potentially helpful treatment. The multiple probe variation helps reduce this burden but doesn’t eliminate it entirely.

Finally, because the design typically involves small numbers of participants, generalization depends on replication across separate studies rather than large sample sizes within one. A single well-run multiple baseline study can demonstrate that an intervention works under specific conditions, but broader claims require the same result to appear across different researchers, populations, and settings.