What Is Live Migration: Moving VMs Without Downtime

Live migration is the process of moving a running virtual machine from one physical server to another without shutting it down or interrupting the services it provides. The entire transfer happens in the background while the virtual machine continues to operate, typically resulting in only a fraction of a second of pause that users never notice. It’s a foundational technology in modern data centers and cloud computing, enabling everything from routine hardware maintenance to automatic load balancing across thousands of servers.

How Live Migration Works

A virtual machine is essentially a software-based computer running inside a physical server. It has its own operating system, applications, and a chunk of memory assigned to it. When a live migration begins, the system needs to move all of that, including the contents of memory, over a network connection to a different physical server, all while the virtual machine keeps running.

The most common approach is called pre-copy migration. The system starts by copying all of the virtual machine’s memory pages to the destination server while the VM continues running on the original host. The problem is that while this copying is happening, the running VM keeps changing some of those memory pages. These changed pages are called “dirty pages,” and they need to be copied again. So the system runs through multiple rounds, each time re-sending only the pages that changed since the last round. With each pass, the number of dirty pages shrinks.

Once the remaining changes are small enough, the system briefly pauses the virtual machine, copies the final batch of dirty memory pages along with the processor state, and resumes the VM on the destination server. This pause is the only moment of true downtime, and it’s typically so short that network connections don’t even time out.

An alternative approach, post-copy migration, works in the opposite direction. It moves the VM to the destination server first and starts running it there immediately, then pulls over memory pages on demand as the VM needs them. A third method, hybrid-copy, combines elements of both. Each approach makes different tradeoffs between total migration time and the length of that final pause.

Why Organizations Use Live Migration

The most visible use case is hardware maintenance. Physical servers eventually need firmware updates, component replacements, or full hardware refreshes. Without live migration, every virtual machine on that server would need to be shut down, causing service interruptions. With it, VMs are quietly moved to other hosts, the maintenance happens on the now-empty server, and users experience no disruption. Major cloud providers rely on this constantly. AWS, for example, uses live migration to move customer instances to replacement servers during host maintenance events, preserving the instance’s ID and IP address throughout the process.

Load balancing is another primary benefit. When one physical server becomes overloaded with too many demanding workloads, VMs can be migrated to servers with more available resources. This relieves congestion and can improve application performance by giving workloads access to the CPU and memory they actually need.

Power management takes the opposite approach. During low-demand periods, VMs running light workloads can be consolidated onto fewer physical servers, allowing the emptied servers to be powered down. For large data centers running thousands of servers, this meaningfully reduces electricity costs and cooling requirements.

Live migration also enables proactive fault tolerance. If monitoring systems detect early signs of a hardware failure, like a degrading disk or rising temperatures, VMs can be moved off that server before anything actually breaks. This turns what would have been an unplanned outage into a routine, invisible migration.

What Can Go Wrong

Live migration isn’t foolproof. The biggest challenge is the dirty page rate: how fast the running VM is modifying its memory. If an application is writing to memory faster than the network can transfer those changes, the pre-copy process never converges. Each round of copying produces just as many dirty pages as the last, and the migration either takes an unacceptably long time or fails entirely.

This is why network bandwidth between servers matters so much. A VM running a write-heavy database workload on a slow network link is a worst-case scenario for live migration. The total file size being migrated in the first round is the VM’s entire memory footprint, which can be tens or hundreds of gigabytes. If dirty pages keep accumulating across subsequent rounds, those pages get retransmitted repeatedly, wasting network bandwidth and dragging out the total migration time.

The other key metric is the total migration time itself. Even though the VM stays running throughout most of the process, a long migration ties up network resources and leaves the system in a partially migrated state that’s vulnerable to failures on either the source or destination host.

Live Migration Across Platforms

Different virtualization platforms implement live migration with their own tooling and terminology, though the core concept is the same.

VMware vMotion is the most established implementation. It works across different storage types, including network-attached storage, storage area networks, and local disks. VMware’s Distributed Resource Scheduler can automatically trigger vMotion events to rebalance workloads across hosts based on resource availability. You can also set up multiple dedicated vMotion networks, which helps optimize transfers for different workload types.
Microsoft Hyper-V Live Migration provides similar functionality using a different data transfer protocol (SMB 3.0). It generally operates over a single network path unless you configure network load balancing separately. Recent versions added Storage Migration capabilities, though real-time data movement options are more limited compared to VMware.
KVM, the open-source hypervisor built into the Linux kernel, supports live migration natively and is the foundation for most OpenStack deployments and several public cloud platforms.

Live Migration in Public Clouds

If you’re running workloads on AWS, Google Cloud, or Azure, live migration is happening to your instances regularly, whether you realize it or not. Cloud providers maintain massive fleets of physical hardware that constantly needs patching and replacing. Live migration lets them do this transparently.

Google Cloud was an early pioneer of transparent live migration for its Compute Engine instances and performs it routinely during infrastructure maintenance. AWS uses live migration on Dedicated Hosts to move instances to replacement hardware, retaining attributes like instance IDs and IP addresses, with migrations completing within 24 hours of a maintenance event being triggered. In cases where live migration isn’t possible, AWS falls back to reboot-based maintenance, which requires stopping and restarting the instance.

For most cloud users, the practical takeaway is simple: your virtual machines are not permanently tied to a specific piece of hardware. They move around, and the entire system is designed so you never need to care when they do.