What Is a Fault Domain in Cloud Infrastructure?

A fault domain is a set of hardware components that share a single point of failure. If one piece of shared infrastructure goes down, like a power supply, a network switch, or even an entire server rack, everything in that fault domain goes down with it. The concept exists so engineers can deliberately spread workloads across multiple fault domains, ensuring that no single hardware failure takes out an entire application.

How Fault Domains Map to Physical Hardware

In a data center, fault domains correspond to real, physical boundaries. At the smallest level, a single disk is its own fault domain. One level up, a server node is a fault domain because all the disks inside it share the same motherboard and power. A server rack is a larger fault domain: every machine in the rack typically shares the same top-of-rack network switch and the same power distribution unit. If that switch fails or the power strip trips, every server in the rack loses connectivity or power simultaneously.

The hierarchy typically looks like this:

  • Disk: A single drive fails, affecting only the data on that drive.
  • Node: A server fails, taking all its disks offline.
  • Rack: A rack-level failure (power, top-of-rack switch, or network partition) knocks out every server in the rack.
  • Data center: A building-wide event like a cooling failure or regional power outage affects all racks.

To be fault tolerant at any given level, your servers and data need to be distributed across multiple fault domains at that level. If you want to survive a rack failure, your application needs to run on servers in at least two different racks. If you want to survive a data center failure, you need presence in multiple data centers.

Limiting the Blast Radius

The core strategy behind fault domains is limiting what engineers call the “blast radius” of a failure. Instead of letting a single broken component cascade through an entire system, you segment your infrastructure into isolated groups. A power supply failure in one fault domain doesn’t touch servers in another fault domain, because they draw power from a completely different source.

This segmentation is what makes fault domains different from simple redundancy. Redundancy means having backup copies. Fault domain isolation means ensuring those backup copies don’t share the same physical vulnerabilities as the originals. Two copies of your data on two different disks in the same server aren’t truly protected, because one motherboard failure destroys both. Two copies on servers in different racks, connected to different power and network infrastructure, are genuinely independent.

Fault Domains in Azure

Azure makes fault domains an explicit, configurable concept. When you place virtual machines into an availability set, Azure assigns each VM to a fault domain. Each fault domain shares a common power source and network switch, and by default, an availability set spreads VMs across up to three fault domains. This means a power outage or switch failure can only affect roughly one-third of the VMs in that set.

Azure also has a separate concept called update domains, which control planned maintenance rather than unplanned failures. An availability set can have up to 20 update domains, and Azure restarts only one update domain at a time during maintenance, giving each 30 minutes to recover before moving to the next. Fault domains protect against hardware failures; update domains protect against maintenance-related downtime. They work together but solve different problems.

For larger deployments using virtual machine scale sets, Azure supports configurable fault domain counts. Regional (non-zonal) scale sets default to five fault domains, while zone-spanning deployments use “max spreading,” which distributes VMs as widely as possible across the available hardware.

Fault Domains in AWS and Oracle

AWS doesn’t use the term “fault domain” directly, but the concept exists through placement groups. A spread placement group strictly places instances across distinct underlying hardware, so no two instances share the same physical server. A partition placement group goes further: it divides instances into logical partitions, where each partition runs on hardware that’s completely independent of the others. This is the approach used for large distributed systems like Hadoop and Kafka, where correlated hardware failures could mean losing an entire data partition.

Oracle Cloud uses the term explicitly. In Oracle’s model, each availability domain (essentially a data center) contains exactly three fault domains. When you launch instances, you distribute them across these three fault domains so that a hardware failure or maintenance event in one doesn’t affect the others. This gives you rack-level isolation within a single data center without needing to architect across multiple geographic locations.

Fault Domains vs. Availability Zones

Fault domains and availability zones operate at different scales. A fault domain is a grouping of hardware within a single data center, typically a rack or a set of racks sharing power and networking. An availability zone is an entire data center (or group of data centers) that’s physically separated from other zones, with independent power, cooling, and internal networking. A failure in one availability zone is unlikely to affect another because they share no infrastructure at all.

The hierarchy works like nesting dolls: a cloud region contains multiple availability zones, and each availability zone contains multiple fault domains. Fault domains protect you from localized hardware failures within a building. Availability zones protect you from building-level disasters. Regions protect you from events that could affect an entire metropolitan area.

Fault Domains in Kubernetes

Container orchestration platforms like Kubernetes bring fault domains into the software layer. Kubernetes lets you define topology spread constraints that control how pods (the smallest deployable units) are distributed across failure domains like regions, zones, or individual nodes. You label your nodes with their topology information, such as which zone or rack they belong to, and then configure rules that force the scheduler to spread your workload evenly across those labels.

For example, if you have nodes in three zones, you can set a constraint that keeps the number of pods in each zone balanced within a maximum difference of one. If zones currently have 2, 2, and 1 matching pods, the next pod gets scheduled into the zone with only 1. This keeps your application running even if an entire zone becomes unavailable, without requiring you to think about the underlying physical hardware directly.

Practical Decisions Around Fault Domains

When you’re designing a cloud deployment, the number of fault domains you need depends on your tolerance for downtime. Three fault domains (the default in Azure availability sets and Oracle availability domains) means any single hardware failure affects at most one-third of your capacity. If your application can handle running at two-thirds capacity while the failed domain recovers, three is enough. If you need higher resilience, combining fault domains with availability zones gives you protection against both rack-level and building-level failures.

The tradeoff is complexity and cost. Spreading across more fault domains means more network communication between components, potentially higher latency, and more infrastructure to manage. For a simple web application, placing VMs in an availability set with three fault domains is straightforward. For a globally distributed database, you’re layering fault domains inside availability zones inside regions, and each layer adds operational overhead. The goal is matching your isolation strategy to the actual risk: protect against the failures that would hurt your users, without over-engineering against scenarios that are vanishingly unlikely.