How Accurate Are Smartwatches? Health Metrics Tested

Smartwatches are surprisingly accurate for some health metrics and unreliable for others. Heart rate and atrial fibrillation detection perform closest to clinical standards, while calorie estimates can be off by 30% or more. The accuracy you get depends heavily on what you’re measuring, which device you’re wearing, and what you’re doing at the time.

Step Counting: Accurate on Walks, Less So at Home

Step counting is the oldest trick in the wearable playbook, and it works well when you’re actually walking. During treadmill walking, wrist-worn monitors typically land within about 6% of a manual count. That margin balloons in real-world conditions. A CDC-published study found that during daily activities like cooking, cleaning, and moving around the house, the difference between wrist-worn and criterion monitors jumped to 22%.

The pattern makes sense once you understand how step detection works. These devices rely on the swinging motion of your wrist to infer that you’re taking steps. When you push a shopping cart, a stroller, or a wheelchair, your wrist stays relatively still, and studies show step counts can drop 35% to 95% below actual steps in those situations. Slower walking speeds also tend to produce undercounts. Free-living conditions generally lead to overestimates of 10% to 35%, meaning your watch is more likely to give you extra credit than shortchange you on a typical day.

Heart Rate and Atrial Fibrillation Detection

Heart-related features are where smartwatches genuinely shine. A large meta-analysis covering 26 studies and over 17,000 patients found that smartwatch ECG features detect atrial fibrillation with 95% sensitivity and 97% specificity overall. In practical terms, that means these devices catch the vast majority of true cases while rarely flagging a healthy rhythm as abnormal.

Performance does vary by brand. Samsung devices led with 97% sensitivity and 96% specificity. The Apple Watch came in at 94% sensitivity and 97% specificity. The Withings ScanWatch scored lower, at 89% sensitivity and 95% specificity. These numbers are strong enough that some cardiologists now use smartwatch ECG recordings as a starting point for further evaluation, though a clinical ECG remains necessary for a formal diagnosis.

Sleep Tracking: Good for Trends, Imprecise on Stages

When researchers compared three popular devices against polysomnography (the gold-standard sleep study performed in a lab), the results were a mixed bag. Each device had strengths and blind spots in identifying specific sleep stages.

The Apple Watch was best at detecting light sleep, correctly identifying it 86% of the time, but caught only about half of deep sleep periods. The Oura Ring was the most balanced, hitting 78% for light sleep, 80% for deep sleep, and 76% for REM sleep. The Fitbit fell in between, performing reasonably on light sleep (78%) but dropping to 62% for deep sleep and 67% for REM.

One useful way to think about these numbers: when the Oura Ring said you were in deep sleep, clinical equipment confirmed it about 77% of the time. When the Apple Watch flagged deep sleep, it was confirmed 88% of the time, meaning it was conservative but more often correct when it did identify deep periods. For tracking whether your sleep is generally improving or worsening over weeks and months, these devices offer useful directional data. For diagnosing a sleep disorder, they’re not a substitute for a clinical study.

Calorie Burn: The Least Reliable Metric

If there’s one number you should take with a heavy grain of salt, it’s the calorie estimate on your wrist. Smartwatches estimate energy expenditure using a combination of heart rate, motion data, and user profile information like age and weight. The problem is that individual variation in metabolism is enormous, and no wrist sensor can measure it directly.

In controlled testing against indirect calorimetry (which measures the oxygen and carbon dioxide you breathe to calculate actual energy use), the Apple Watch Series 4 underestimated calorie burn by roughly 30%. The Fitbit Versa overestimated it by about 45% in people without disabilities. Those are not small margins. A workout your Fitbit says burned 500 calories may have actually burned closer to 350, or vice versa for an Apple Watch reading.

These errors are consistent enough that your watch can still be useful for relative comparisons. If Tuesday’s run shows higher calorie burn than Monday’s, that directional trend is probably real even if the absolute numbers are off. But basing your food intake on what your watch says you burned is a recipe for frustration.

VO2 Max: A Reasonable Ballpark

VO2 max, the measure of your body’s peak ability to use oxygen during exercise, is increasingly featured on devices from Garmin, Apple, and others. These estimates skip the lab entirely, using heart rate response during outdoor runs or walks to calculate a predicted value.

The accuracy is decent but not precise. Studies generally show wearable VO2 max estimates fall within about 7% to 10% of lab-measured values. A study on the Garmin Fenix 6 found a mean absolute percentage error of about 7% in athletic users, with a moderate-to-good correlation of 0.70 against laboratory results. That means if your true VO2 max is 45, your watch might read anywhere from roughly 42 to 48. Good enough to track fitness trends over months, but not a replacement for metabolic testing if you need a precise number for training zones or medical assessment.

Blood Pressure: Promising but Not Ready

Cuffless blood pressure monitoring is the newest frontier for wearables, and the technology is still catching up to the promise. A systematic review comparing cuffless wrist devices to standard 24-hour blood pressure monitors found that daytime readings were reasonably close, with average differences of about 1 mmHg for the top number (systolic) and less than 1 mmHg for the bottom number (diastolic).

Nighttime is where things fall apart. Systolic readings drifted by about 4.5 mmHg and diastolic readings by nearly 6 mmHg during sleep. That matters because nighttime blood pressure is particularly important for assessing cardiovascular risk. Devices using light-based sensors (photoplethysmography) showed better accuracy across all time periods than other approaches, but researchers note this finding still needs validation in larger studies. For now, cuffless blood pressure features are best treated as a rough trend indicator, not a replacement for a standard cuff.

What Affects Accuracy Across All Metrics

Several factors shift accuracy regardless of what you’re measuring. Fit is the most basic: a loose watch that slides around your wrist produces noisier optical heart rate data, which cascades into worse sleep staging, calorie estimates, and blood oxygen readings. Wearing the device snug, about one finger-width above your wrist bone, helps.

Skin tone and tattoos can interfere with the green LED sensors used for heart rate monitoring, though newer devices with additional sensor wavelengths have narrowed this gap. Movement artifacts, the jostling that happens during high-intensity exercise or activities with lots of wrist motion, remain a persistent source of error across all brands.

Your body type also matters. Most algorithms are trained on datasets that skew toward younger, healthier, average-weight adults. If you fall outside that profile, expect wider error margins, particularly for calorie estimates and VO2 max predictions. The wheelchair user data mentioned earlier is a stark example: errors roughly doubled compared to ambulatory users on the same devices.

The bottom line is that smartwatches are best used as trend-tracking tools rather than diagnostic instruments. A single reading on any given metric can be off by a meaningful amount. But consistent patterns over days and weeks, like a rising resting heart rate, declining sleep quality, or improving fitness estimates, tend to reflect real changes worth paying attention to.