How to Do Mocap: Methods, Setup, and Data Cleanup

Motion capture records real human movement and translates it into digital animation data. Whether you’re a solo creator rigging characters for a game or part of a team producing film-quality performances, the process follows the same core steps: choose a capture method, prepare your space and performer, record the data, then clean and apply it to a 3D character. The technology ranges from $750 smartphone-paired setups to $50,000+ professional studio rigs, so your entry point depends on your budget and quality needs.

Choose Your Capture Method

There are three main approaches to motion capture, each with different hardware, accuracy, and cost tradeoffs.

Optical (Marker-Based) Systems

This is the gold standard used in film and AAA game studios. Multiple infrared cameras surround a capture volume and track small reflective markers attached to the performer’s body. The cameras triangulate each marker’s 3D position many times per second. Systems from Qualisys and Vicon start at $50,000 or more for a full setup, and they require a dedicated room with controlled lighting. The tradeoff for that cost is sub-millimeter accuracy and extremely reliable tracking across fast, complex movements.

Markers are placed at specific anatomical landmarks. The widely used Helen Hayes marker set, for example, positions markers on the lateral sides of joints and uses small wands protruding from the thigh and shank to help calculate where the knee and ankle centers sit inside the body. Hip joint centers are computed mathematically from pelvic markers and leg length. Before a recording session, the performer stands in a static T-pose so the software can map how all the markers relate to the skeleton underneath.

Inertial (Sensor-Based) Suits

Inertial systems skip the cameras entirely. Instead, small sensor modules are strapped to the performer’s body segments. Each module contains an accelerometer (measuring linear acceleration), a gyroscope (measuring rotational speed), and a magnetometer (measuring orientation relative to Earth’s magnetic field). Software fuses these three signals together to estimate each sensor’s 3D orientation in real time.

The big advantage is portability. You can capture outdoors, in a warehouse, or in a living room with no camera setup. Prices are far more accessible: the Perception Neuron 3 starts around $750 for a basic set and goes up to about $1,600 for a full-body kit. The Rokoko Smartsuit Pro II, popular with indie creators, runs $2,745 to $3,495. For professional-grade inertial capture, Xsens systems range from $5,000 to $13,000 or more.

The main limitation is drift. Gyroscopes accumulate small errors over time, and magnetometers can be thrown off by nearby metal objects, electronics, or reinforced concrete floors. This means inertial data often needs more cleanup than optical data, and very long recording sessions may require periodic recalibration.

Markerless (Camera + AI) Systems

Markerless capture uses standard video cameras and computer vision algorithms to estimate body poses directly from footage. No suit, no markers. The performer just moves naturally in front of one or more cameras. Recent benchmarking shows markerless systems achieving strong agreement with optical systems for major joint movements: hip and knee angles in the forward-backward plane typically land within 3 to 6 degrees of optical measurements, and walking speed errors average around 0.16 meters per second.

These systems shine in environments where marker-based setups struggle. Reflective gym floors and polished surfaces create infrared reflections that confuse optical cameras, but markerless systems using visible light avoid that problem. The tradeoff is reduced accuracy for subtle movements, particularly side-to-side pelvic motion, where agreement with optical systems drops significantly. For many animation and game development purposes, though, the accuracy is more than sufficient.

Set Up Your Capture Space

Your environment matters more than most beginners expect. For optical systems, you need a room where you can control the lighting completely. Infrared cameras are sensitive to stray reflections, so any shiny surface in the capture volume (glossy floors, glass panels, metallic equipment) can create false marker readings. Studios typically cover reflective floors with anti-reflective matting and use camera masking in software to ignore problem areas. Fluorescent and LED lights that flicker at certain frequencies can also interfere, so consistent, non-flickering overhead lighting is ideal.

For inertial suits, the space requirements are much simpler. You mainly need enough room for the performer to move freely. Avoid areas with strong magnetic interference: large metal structures, speakers, and electronic equipment can distort the magnetometer readings that keep the sensors oriented correctly. A clear space of roughly 3 by 3 meters works for most standing performances. Larger areas are needed for walking, running, or fight choreography.

Markerless setups fall somewhere in between. You don’t need to worry about infrared reflections, but the AI algorithms are sensitive to lighting variations. Fluctuating sunlight, harsh shadows, and very bright backlighting can all degrade tracking quality. A well-lit room with even, diffused lighting gives the best results. If you’re shooting outdoors, overcast days produce more consistent tracking than direct sunlight.

Prepare the Performer

For marker-based capture, the performer wears a tight-fitting suit (often a simple compression suit) so markers stay fixed relative to the skin. Marker placement follows a specific protocol depending on what you’re tracking. A full-body setup typically involves 30 to 50 markers on joints, limb segments, and the torso. Each marker sits on a precise anatomical landmark: the bony bump on the outside of your ankle, the point of your shoulder, the base of your spine. Getting these placements right is critical because the software calculates joint centers based on where it sees the markers.

For inertial suits, preparation is faster. You strap on the suit, power up the sensors, and run a calibration routine. This usually involves standing in a T-pose (arms straight out to the sides) and sometimes walking a few steps in a straight line so the software can map each sensor to the correct body segment and establish a baseline orientation.

Markerless systems require the least preparation. The performer can wear normal clothing, though very loose or flowing garments can confuse the tracking algorithms. High-contrast clothing that’s distinct from the background tends to produce cleaner results.

Record the Performance

Before recording your actual performance takes, capture a reference pose. Nearly every system requires a static T-pose as its starting point. This tells the software the performer’s proportions: limb lengths, shoulder width, and the neutral position of every joint. Some systems also ask for a range-of-motion calibration where the performer rotates each joint through its full movement to help the software understand individual body mechanics.

During the actual recording, optical systems typically sample at 100 to 240 frames per second for body tracking (higher for fast sports or stunts). Inertial sensors commonly record at 60 to 100 Hz. The performer runs through each take, and it helps to clap or make a distinct gesture at the start of each take to create a sync point that makes it easy to find specific moments later.

For the best raw data, keep takes relatively short (30 seconds to 2 minutes per action) rather than recording long continuous sessions. This limits drift in inertial systems and keeps file sizes manageable for optical systems that generate massive point-cloud data.

Facial Motion Capture

Body and face are typically captured separately using different methods. For facial performance, a common approach uses a head-mounted camera (HMC), a small camera on a boom attached to a headset that points directly at the performer’s face. This lets the actor move freely while the camera maintains a consistent view of their expressions.

Apple’s ARKit has become a popular entry point for facial capture. It uses the iPhone’s depth-sensing camera to track 52 standard facial shapes in real time: individual eyebrow raises, lip curls, jaw movements, cheek puffs, and eye blinks. The system outputs a stream of blend shape weights, essentially a set of numbers describing how much each facial shape is activated at any given moment. You author matching shapes on your 3D character, connect the weight channels, and the character’s face follows the actor’s performance.

The catch is that ARKit’s internal solver expects very specific relationships between those 52 shapes. If your character’s facial rig doesn’t match how ARKit combines shapes (for instance, two shapes that partially cancel each other out), the resulting animation can break down in subtle but noticeable ways. Professional studios often scan the actor’s face in around 100 distinct expressions and then use a mathematical solver to find the best-fit set of blend shapes that match ARKit’s expectations. This involves solving a large system of equations across thousands of facial mesh points to produce shapes that respond correctly to the weight data.

Clean and Retarget the Data

Raw mocap data is never ready to use straight out of the recording. Every capture session produces artifacts that need fixing: gaps where markers were briefly hidden from cameras, jittery frames from sensor noise, or foot-sliding where the character’s feet drift across the ground when they should be planted.

The cleanup process typically happens in specialized software. You’ll fill gaps (interpolating missing marker positions from surrounding frames), filter noise (smoothing out high-frequency jitter without losing the natural feel of the movement), and correct ground contact. Most mocap software includes semi-automated tools for these tasks, but complex performances still require manual attention, especially around fast hand movements and close body contact where markers tend to get lost.

Once the data is clean, you retarget it to your 3D character. Retargeting maps the recorded motion onto a different skeleton, adjusting for differences in proportions. Your actor might be 5’10” with long arms, while your character is a 4-foot robot with stubby limbs. The retargeting software reads the motion data, extracts the movement patterns into a standardized skeleton structure, then applies those patterns to your target character’s rig. Common motion data formats include C3D (raw 3D point data from optical systems) and BVH (hierarchical skeleton data). Both start with a reference T-pose file that defines how the recorded skeleton maps to your character’s joints.

The retargeted animation often needs a final polish pass by hand. An animator will adjust moments where the proportional differences create awkward poses, refine hand positions for interactions with props or the environment, and blend between different takes to assemble a complete sequence.

Picking the Right Setup for Your Budget

If you’re a solo creator exploring mocap for the first time, a markerless phone-based or webcam-based solution costs nothing beyond the software (some options are free) and lets you experiment immediately. The quality won’t match a dedicated system, but it’s enough to block out animations and learn the workflow.

For indie game developers and small animation studios producing regular content, an inertial suit in the $750 to $3,500 range hits the sweet spot. The Perception Neuron 3 is the most affordable full-body option. The Rokoko Smartsuit Pro II adds finger tracking with optional gloves and has built-in software tools for cleanup and retargeting that flatten the learning curve.

Professional studios producing cinematic content, visual effects, or clinical biomechanics research invest in optical systems from Vicon or Qualisys at $50,000 and up. These deliver the precision needed for close-up character animation and scientific measurement, but they also require trained technicians, a dedicated capture volume, and significantly more time in setup and post-processing. The quality ceiling is higher, but so is every other cost along the way.