Spatial statistics is a branch of statistics designed to analyze data where location matters. It provides the mathematical tools to detect patterns, relationships, and trends in geographic data, built on a simple but powerful idea known as the First Law of Geography: “everything is related to everything else, but near things are more related than distant things.” That principle, articulated by geographer Waldo Tobler, is the foundation for nearly every method in the field. Rather than treating each data point as independent, spatial statistics explicitly accounts for where observations fall on a map and how their proximity to one another shapes the patterns we see.
Why Location Changes the Math
Most traditional statistics assume that one observation doesn’t influence another. Spatial data violates that assumption constantly. Housing prices in a neighborhood cluster together. Air pollution readings at nearby sensors are more similar than readings taken miles apart. Disease cases concentrate around contamination sources. This tendency for nearby observations to resemble each other is called spatial autocorrelation, and ignoring it leads to misleading results, inflated confidence in findings, and missed patterns.
Spatial statistics corrects for this by building location directly into the analysis. Instead of asking “what is the average?” it asks “where is the average higher, and is that geographic pattern real or just chance?” That shift from asking what to asking where opens up an entirely different set of questions and tools.
Three Types of Spatial Data
The methods used in spatial statistics depend heavily on what kind of data you’re working with. A UNESCO overview of the field identifies three primary types, each with its own analytical tradition.
- Geostatistical data: Point measurements of something that varies continuously across space. Think of soil samples measuring lead contamination, weather station readings for temperature, or groundwater depth measurements taken at scattered wells. The variable exists everywhere, but you’ve only measured it at specific spots.
- Lattice (areal) data: Counts or averages tied to defined geographic units like counties, census tracts, or grid cells. Crime counts per neighborhood, election results by district, and cancer rates by county all fall into this category. The data comes pre-aggregated into areas rather than existing as individual points.
- Point pattern data: The locations themselves are the data. Rather than measuring a value at each point, you’re studying the arrangement of events: where earthquakes occurred, where trees are growing in a forest, or where 911 calls originated. The central question is whether the points are clustered, dispersed, or randomly scattered.
Predicting Values Between Sample Points
One of the most widely used spatial statistical techniques is kriging, a method for estimating values at locations where no measurement was taken. If you’ve sampled soil contamination at 50 points across a site, kriging lets you create a complete map of estimated contamination levels everywhere on the property.
Kriging works in two steps. First, it examines every pair of sample points and plots how different their values are against how far apart they sit. This plot, called a variogram, captures the spatial structure of the data: how quickly similarity drops off with distance. Close-together samples might be very similar, while far-apart samples diverge sharply, and the variogram quantifies that relationship.
Second, the variogram is used to assign weights to each sample point when predicting an unknown location. Points that are closer get more weight, but not in a simplistic way. Kriging accounts for the overall spatial arrangement: if several sample points are clustered together near the prediction location, they share their influence rather than each contributing fully. This makes kriging an “optimal linear predictor,” meaning it minimizes prediction error at each estimated point. It also produces uncertainty estimates alongside each prediction, so you know where your map is confident and where it’s guessing.
Finding Clusters and Hot Spots
Hot spot analysis identifies areas where values are unusually high or low compared to what you’d expect if the data were spread randomly. A hot spot is an area with a higher concentration of events than random chance would produce. A cold spot is the opposite.
The most common approach uses a statistic that calculates a score for each location by comparing the values in its neighborhood to the global average. If a location and its neighbors all have high values, the score will be large and positive, flagging it as a hot spot. If a location and its neighbors all have low values, it gets flagged as a cold spot. Statistical significance is typically set at a strict threshold (often 99% or higher) to avoid false alarms.
A related family of methods, called Local Indicators of Spatial Association, goes a step further. These statistics evaluate each individual observation for evidence of significant spatial clustering of similar values. They can identify not just areas of high or low concentration, but also spatial outliers: a high-value location surrounded by low-value neighbors, or vice versa.
Analyzing Point Patterns
When the locations themselves are the data, the core question is deceptively simple: are these points clustered, dispersed, or random? A function developed by statistician Brian Ripley provides the standard approach. It works by counting how many other points fall within increasing distances of each point, then comparing those counts to what you’d expect under complete spatial randomness.
If there are more nearby points than expected at a given distance, the pattern is clustered at that scale. If there are fewer, the pattern is regular or dispersed (points are spaced apart, like trees competing for sunlight). The power of this approach is that it evaluates clustering at every distance simultaneously, so you can detect patterns that only emerge at certain scales. Trees in a forest might be clustered at a small scale (growing in patches of good soil) but dispersed at a larger scale (patches spaced evenly apart).
Modeling Data Across Regions
When your data is tied to geographic units like counties or districts, spatial dependence gets modeled through adjacency. The most common framework, the Conditional Autoregressive (CAR) model, sets the expected value in one region as a function of the values in its neighboring regions. Two regions are considered neighbors if they share a boundary.
This mirrors how many real processes work. A county’s unemployment rate is influenced by economic conditions in surrounding counties. Disease rates in one district correlate with rates next door because people, pathogens, and environmental exposures don’t respect political boundaries. CAR models capture this by building a matrix that encodes which regions are adjacent and using that structure to model the spatial correlation in the data. On a regular grid, adjacency is straightforward. For irregular regions like counties or zip codes, it’s determined by shared borders.
Real-World Applications in Public Health
Spatial statistics has a long history in epidemiology, dating back to the 1800s when maps of disease rates first emerged to track outbreaks of yellow fever and cholera. The field accelerated dramatically in the 1970s when the first cancer mortality atlases in the United States revealed distinctive geographic patterns for different cancers, sparking a wave of investigations.
Those atlas-driven studies produced concrete discoveries. A regional excess of oral and throat cancer among women led to the identification of smokeless tobacco as a previously unknown risk factor. A cluster of sinus cancer pointed to occupational hazards in the furniture industry. A pocket of elevated lung cancer rates was linked to living near or working in the arsenic industry. None of these connections would have been visible without mapping the data spatially.
More recent work has used spatial statistics to study cancer incidence near waste incinerators, leukemia risk near oil refineries and power lines, liver cancer near vinyl chloride plants, and birth defects near landfill sites. Scandinavian countries, with their comprehensive health registries, have been especially productive in conducting national-scale spatial studies of environmental exposures. Spatial clustering of Hodgkin disease, combined with laboratory evidence, has even suggested a possible infectious origin for the disease.
Software for Spatial Statistics
The R programming language is the most widely used platform for spatial statistics, with a mature ecosystem of specialized packages. The sf package handles spatial data encoding and has become the standard, replacing the older sp package. For geostatistical modeling and kriging, gstat supports prediction and simulation in up to three dimensions. Point pattern analysis is handled by spatstat, one of the most comprehensive packages in all of R. The spdep package creates the spatial weights matrices that underpin areal data analysis and autocorrelation testing. For raster (gridded) data, the raster and terra packages provide the necessary tools.
Outside of R, spatial statistics is also accessible through ArcGIS (which has a dedicated Spatial Statistics toolbox), Python’s PySAL library, and GeoDa, a free tool specifically designed for exploratory spatial data analysis. The choice of platform often depends on whether you need the flexibility of a programming language or prefer a graphical interface.

