A depth camera is a sensor that measures the distance between the camera and every object in its field of view, producing a full 3D map of the scene rather than a flat photograph. Where a regular camera captures color and brightness across a grid of pixels, a depth camera records how far away each point is. This extra layer of distance data (often called the Z-axis) is what lets devices like smartphones recognize your face, robots navigate a room, and augmented reality apps place virtual furniture on your real floor.
How It Differs From a Regular Camera
A standard RGB camera captures three channels of information per pixel: red, green, and blue. That’s enough to recreate the colors and shapes you see, but the image is fundamentally flat. It has no built-in way to tell whether a person in the frame is two feet away or twenty. A depth camera solves this by adding distance information to every pixel, so the output contains not just what something looks like but where it is in three-dimensional space.
Cameras that combine both color and depth data are sometimes called RGB-D sensors. Each image from an RGB-D camera contains six channels: three for color and three for the X, Y, and Z coordinates of a 3D point cloud generated from the depth data. That combination lets software recover both the appearance and the geometry of a scene from a single capture, which is why depth cameras have become essential in fields ranging from facial recognition to autonomous navigation.
Three Main Technologies
Most depth cameras rely on one of three approaches: time of flight, structured light, or stereo vision. Each has trade-offs in speed, accuracy, and cost.
Time of Flight
A time-of-flight (ToF) camera floods the scene with pulses of near-infrared light, typically at a wavelength around 850 nanometers, which is invisible to the human eye. The camera then measures how long the light takes to bounce back from each object. More precisely, it detects the phase shift between the outgoing light wave and the returning one, then converts that shift into a distance measurement for every pixel. Because a single image is enough to create a correctly scaled 3D model, ToF cameras are fast and relatively simple to process. They also handle dim environments well, since they supply their own illumination.
The light source is usually a solid-state laser or LED, and the sensor is a specialized CMOS chip tuned to respond only to that narrow infrared band. In the continuous-wave method, which is common in consumer devices, the sensor takes four samples per measurement, each shifted by 90 degrees, to calculate the phase offset. This happens so quickly that ToF cameras can deliver real-time depth video without heavy computational overhead.
Structured Light
A structured light camera projects a known pattern, often thousands of infrared dots, onto the scene. A separate infrared sensor reads how those dots deform as they land on surfaces at different distances. By comparing the observed dot positions to the original projected pattern, the system calculates depth at each point through a process called triangulation.
Apple’s Face ID system is the most widely known example. First introduced in the iPhone X, it was the first structured-light depth sensor built into a smartphone. The module contains a dot projector (a type of laser called a VCSEL), a flood illuminator that bathes the face in uniform infrared light, a near-infrared image sensor, and a diffractive optical element that shapes the dot pattern. The entire assembly fits within a few millimeters, with about 6 mm of baseline distance between the emitter and the detector. That compact design is why it can sit inside the slim bezel of a phone.
Stereo Vision
Stereo vision mimics human binocular depth perception. Two cameras, spaced a known distance apart, capture the same scene from slightly different angles. Software identifies matching features in both images and measures how far each feature shifts horizontally between the two views. This horizontal shift is called disparity, and it’s inversely proportional to the distance of that point from the cameras. Objects close to the cameras show a large disparity; distant objects show almost none.
The math boils down to a simple relationship: depth equals the focal length times the baseline distance between the two cameras, divided by the disparity. Stereo vision works with visible light, so it doesn’t need an infrared emitter. The downside is that it requires significant computational power to match features across images, and it struggles in scenes with poor lighting or flat, textureless surfaces where there’s nothing distinctive for the algorithm to latch onto.
What the Data Looks Like
Depth cameras produce two primary types of output. The first is a depth map: a 2D image, often displayed in grayscale, where the brightness of each pixel represents distance. Bright pixels might indicate nearby surfaces and dark pixels distant ones (or vice versa, depending on the convention). A depth map is easy to store and process because it has the same structure as a regular photo, but it only shows the scene from one viewpoint.
The second format is a point cloud, which is a collection of data points positioned on X, Y, and Z coordinates in three-dimensional space. Each point represents a spot on a surface that the camera detected. You can rotate and view a point cloud from any angle, making it far more useful for applications like 3D modeling, mapping, or measuring real-world objects. Many depth cameras generate both formats, producing a depth map in real time and then converting it into a point cloud for downstream processing.
Where Depth Cameras Are Used
Robotics is one of the biggest application areas. Depth cameras let robots build maps of their surroundings and avoid obstacles in real time, a process called simultaneous localization and mapping (SLAM). In one recent agricultural robotics system, a depth camera served as the primary sensor for autonomous navigation inside a commercial chicken farming house. The robot achieved a positioning error of just 0.053% over a 500-meter path and could dodge a moving person and return to its planned route in 3.2 seconds, with a final positional error of only about 3 centimeters.
In consumer electronics, depth cameras enable facial recognition, portrait-mode photography (where the background is blurred based on actual distance data), and augmented reality. AR apps use the depth map to understand which surfaces are floors, walls, or tables, so virtual objects can be placed convincingly in a real space. Gaming systems like the original Microsoft Kinect popularized depth cameras for full-body motion tracking, capturing skeletal movements without any controller.
Other uses include 3D scanning for manufacturing quality control, gesture recognition in automotive interfaces, volume measurement in logistics, and body measurement for online clothing fitting. Anywhere a machine needs to understand the shape and arrangement of physical objects, a depth camera is likely involved.
Limitations to Know About
Because structured light and time-of-flight cameras rely on infrared emitters, they’re susceptible to interference from other infrared sources, including direct sunlight. Outdoors on a bright day, the ambient infrared radiation can overwhelm the camera’s own signal, degrading accuracy or causing the sensor to fail entirely. This is why most IR-based depth cameras perform best indoors or in controlled lighting.
Stereo vision cameras avoid the sunlight problem since they use visible light, but they introduce their own weaknesses: poor performance in low light and difficulty with uniform, featureless surfaces like white walls.
Another consideration is motion. Many depth cameras use a global shutter, which captures the entire sensor at once. Cameras with a rolling shutter instead read out the image row by row, introducing a slight time delay from top to bottom. When objects are moving quickly, that delay can cause skewing or distortion in the depth data. For high-speed applications, a global shutter sensor is typically necessary to keep measurements accurate.
Range is also a factor. Consumer-grade depth cameras (like those in phones and tablets) typically work best within a few meters. Industrial and automotive depth sensors can reach tens of meters or more, but accuracy generally decreases with distance regardless of the technology.

