What Is SRTP? Encryption, Key Management & Uses

SRTP stands for Secure Real-time Transport Protocol. It’s a security extension of RTP (Real-time Transport Protocol) that encrypts and protects voice, video, and other media streams as they travel across a network. Defined in RFC 3711 by the Internet Engineering Task Force (IETF), SRTP provides three core protections: confidentiality (encryption so no one can eavesdrop), message authentication (proof that data hasn’t been tampered with), and replay protection (preventing attackers from re-sending captured packets to disrupt a call).

How SRTP Relates to RTP

RTP is the standard protocol that carries audio and video data in real time over the internet. Every time you make a VoIP call, join a video conference, or stream live media, RTP is typically handling the delivery of those media packets. The problem is that plain RTP sends everything unencrypted. Anyone who intercepts those packets on the network can listen to your calls or watch your video feeds.

SRTP is officially defined as a “profile” of RTP, meaning it works on top of the existing RTP framework rather than replacing it. It adds a layer of encryption and authentication to each media packet without fundamentally changing how the packets are structured or delivered. This design keeps SRTP lightweight, which matters for real-time communication where even small delays are noticeable. It also secures RTCP (Real-time Transport Control Protocol), the companion protocol that RTP uses to exchange quality statistics and session control information.

What SRTP Protects Against

Without SRTP, several attacks become straightforward for anyone with access to the network path between callers:

  • Eavesdropping. Unencrypted RTP packets can be captured and reconstructed into listenable audio or viewable video using freely available tools. SRTP encryption makes intercepted packets meaningless without the correct keys.
  • Tampering. An attacker could modify media packets in transit, injecting noise, false audio, or corrupted data. SRTP’s authentication tag on each packet lets the receiver verify that nothing was altered.
  • Replay attacks. An attacker could record legitimate packets and re-send them later to confuse or disrupt a session. SRTP tracks packet sequence numbers to detect and reject replayed packets.

How the Encryption Works

SRTP uses the Advanced Encryption Standard (AES) as its default cipher. The specification defines two modes of running AES. The primary mode, and the one used by default, is AES in Segmented Integer Counter Mode (AES-CM). In this mode, AES generates a stream of encrypted output that gets combined with the media data, effectively scrambling it. The counter used to generate this stream is built from a combination of the session’s salting key, the source identifier of the sender, and the packet index, ensuring every single packet produces a unique encrypted output even if the underlying audio or video data repeats.

The second mode defined in the original specification is called f8-mode, a variant of Output Feedback Mode with a more complex initialization step. It was designed for compatibility with certain mobile telephony standards. In practice, AES Counter Mode is far more widely deployed. More recent extensions to the protocol have also introduced AES-GCM (Galois/Counter Mode), which combines encryption and authentication into a single efficient operation.

For message authentication, SRTP appends a short authentication tag to each packet. The receiver recalculates this tag using the shared key and compares it to the one attached to the packet. If they don’t match, the packet is discarded. This check happens on every single packet, so tampered or forged data gets caught immediately.

Key Management

SRTP itself handles encryption and authentication, but it deliberately does not define how the encryption keys are exchanged between participants. That job falls to a separate key management protocol. The most common approaches include:

  • DTLS-SRTP. Used in WebRTC, this method performs a cryptographic handshake directly between the two endpoints before media flows, establishing shared keys without relying on a trusted intermediary.
  • SDES. The encryption keys are exchanged within the signaling messages (like SIP) that set up the call. This is simpler but depends entirely on the signaling channel being secure, typically via TLS.
  • ZRTP. A peer-to-peer key agreement protocol that doesn’t require any pre-shared secrets or public key infrastructure. It uses a short verbal confirmation between callers to verify the connection hasn’t been intercepted.

SRTP supports a concept called the Master Key Identifier (MKI), an optional field in each packet that tells the receiver which master key was used. This allows keys to be rotated during a session without interrupting the media flow. From a single master key, SRTP derives separate session keys for encryption and authentication, so compromising one function doesn’t automatically compromise the other.

Where SRTP Is Used Today

SRTP is effectively the standard for securing real-time media on the internet. WebRTC, the technology built into every major browser for voice and video calls, mandates SRTP for all media streams. When you use Google Meet, Facebook Messenger video calls, or any browser-based communication tool, SRTP is encrypting your audio and video.

VoIP phone systems, both enterprise and consumer, widely support SRTP. Oplatforms like Oike Ooom, Microsoft Teams, Cisco Webex, and most SIP-based business phone systems use SRTP to protect calls. Many regulatory environments, including healthcare (HIPAA) and finance, effectively require media encryption for voice communications, making SRTP a compliance necessity rather than an optional feature.

Video surveillance systems that stream over IP networks also use SRTP when security of the video feed matters. Any application that sends real-time media using RTP can layer SRTP on top for protection.

SRTP vs. Other Encryption Methods

You might wonder why real-time media needs its own encryption protocol instead of just using something like TLS, which secures web traffic. The answer comes down to how real-time media behaves. Audio and video are sent as a continuous stream of small UDP packets, and losing a few packets is acceptable (your ear won’t notice a single missing audio frame), but delay is not. TLS operates over TCP, which guarantees delivery by retransmitting lost packets. That retransmission adds latency that would make a voice call sound choppy or a video feed lag.

SRTP is designed specifically for this environment. It encrypts each packet independently, so a lost packet doesn’t affect the decryption of subsequent ones. It adds minimal overhead to each packet, typically just a few bytes for the authentication tag and optional MKI field. And it operates over UDP, preserving the low-latency characteristics that real-time communication requires.

A VPN tunnel can also encrypt media traffic, but it encrypts everything at the network level without understanding what’s inside. SRTP provides encryption at the application level, meaning it protects the media end-to-end regardless of the network path, and it can be verified and managed independently of other traffic.