Interactive Live Streaming

Interactive Live Streaming Overview:

Interactive live streaming involves the real-time transmission of audio and video content over the internet, facilitating direct engagement between content creators and their audience. Achieving this interactivity can be accomplished by incorporating real-time messaging SDK, such as SuperChat, to enhance user engagement and support various use cases, particularly in the realm of stream monetization. It's important to note that in this context, we are primarily focused on enhancing interactivity within the audio and video streams rather than through text-based chat.

Let's familiarize ourselves with a few terms related to media operations

Audio Encoding:

  • Analog signals, representing continuous waveforms, are converted into digital signals consisting of discrete 0s and 1s.

  • This conversion is typically performed by audio codecs such as MP3, AAC, Opus, and Lyra.

  • The goal is to represent the audio information in a digital format suitable for storage, transmission, and processing by digital devices, in audio there is no concept of frame it's just called bitrate typical encoding varies from 96 to 320Kbps.

Video Encoding:

  • Video encoding involves compressing video frames into a more compact form to facilitate efficient transmission or storage.

  • Common video codecs include H.264, H.265 (also known as HEVC), VP9, and AV1.

  • The compressed data can be efficiently transmitted or stored, but decoding is required to retrieve the original frames.


  • Remuxing is the process of changing the container format of a multimedia file without altering the actual media content.

  • This is different from transcoding, which involves changing the encoding format itself.

  • Remuxing is often done to switch between different container formats without re-encoding the audio or video data.


  • In the context of video, a frame is a 2D matrix of pixels that represents a single image in a video sequence.

  • Video codecs compress and decompress these frames to reduce file sizes and facilitate efficient transmission and storage.

Technical Implementation:

The technical implementation involves the use of WebRTC for publishing multiple streams, both from live sources and local files. GStreamer, a powerful multimedia framework, is employed to stitch these streams together into a single stream output. This is achieved by leveraging the capabilities of GStreamer to handle real-time processing and merging of video and audio sources.

Sariska Implementation(via Conferencing SDK ):

We start a conference where multiple participants join the conference along with the recorder. Participants send their audio and video streams to GStreamer through a videobridge. GStreamer, functioning as an interceptor and transcoder library, enables the re-encoding of these streams ( Multiple video streams are superimposed within a larger frame, producing a single stream as the output).

As the server (or recorder ) participates as a conference member, it can contribute its own audio and video tracks. This functionality enables dynamic server-side ad ingestion and personalized capabilities within the live streaming.

Since publishing occurs over WebRTC with latencies typically within the range of 100 to 300 milliseconds, it enables nearly instantaneous host interactivity. This low latency ensures a responsive and real-time experience for the host in interactive scenarios.

Lets understand this with basic example:

Let's consider a simple conference call scenario with two users. Each user publishes a stream with a resolution of 240 x 180 pixels. The process involves decoding of streams, followed by re-encoding them into a larger frame. This larger frame combines the individual audio and video streams, resulting in a single audio and video stream with dimension of 240 x 360 pixels.

The subsequent phase involves encapsulating the audio and video streams within a digital container format for transmission and storage. This process is distinct from the existing encoding, as each audio and video stream has already undergone suitable encoding before integration into the digital container format.Common formats for this purpose include TS (Transport Stream) or FLV (Flash Video). RTMP (Real-Time Messaging Protocol) uses FLV as its container format, while HLS (Hypertext Transfer Protocol Live Streaming) typically employs TS. Both protocols are designed for streaming media over the internet. RTMP delivers content in a continuous flow, while HLS divides the stream into smaller segments for transmission via HTTP.

Please refer to Guide to interactive live streaming section

Last updated