Sunday, June 10, 2012

RTP - Lip Sync Part 2, The Overview

In part 1, RTCP SR is discussed. In this part, I am going to talk about the overview of lip synchronization on receiver side. A streaming RTP server usually send you the following

2. RTP video packets
3. RTP audio packets

The server will send these data in a synchronized way. In a synchronized way mean a video/audio should be sent in a constant interpacket interval. In perfect situation, the receiver can simply receive these constant time-spaced packets at receiver and playout directly. However, this is impossible. The are many factors that could cause a delay on these contant time-spaced packets. Some of the exmaples are

1. Network jitters - this will cause the packets to arrive in variable timing
2. Packets errors - error correction/missing packets
3. System delay - decompression/queuing/system latency

In general, you need to consider these when you are calculating the playout timing. Before I talk more about playout timing, let's list down the operation needed upon RTP packets received at the network interface

1. Received the packets in NIC card
2. Raw packet data will be read into the program
3. RTP packet processing such as depacketizing and group data into video/audio frame
4. Calculate the playout timing for each frame.
5. Insert these frame into a playout buffer, usually a linked list, and sorted by playout timing.
6. Calculate the playout delay with the consideration of operations such as system jitter, decoding, etc...
7. Playout the frame

For the first 3 steps, if you have a RTP client system, you shouldn't have any issue. The only thing you need to know is how to group RTP payload into packet frame. Usually, audio is self contained and only a single packet per frame. For video, usually it is fragmented into multiple packets. For packets that are being to the same frames, they will have then same RTP timestamp and the last packet will have a market bit on. So, all you need is to check the timestamp and group all packets that has same RTP timestamp together. Make sure their sequence number are in ascending order and the last packet has marker bit on.

For point 4, you need to use RTCP SR to calculate the playout time of the frames. For a generic example, in a 30FPS video, each frames should have a constant 33ms interval between each frames' playout time. How to calculate playout time will be shown in later part

For point 5, you should have a playout buffer to hold audio and video frame. A playout buffer usually is a time-sequenced linked list that is sorted by playout time in acending order.

For point 6, although you have a playout time, that will only tell you which frame should be playout in what specific time. However, we need to know the relative presentation timing from each frame. For example, if you have frame A and frame B and frame A is served, you want to know how long to wait before serving frame B. That wait is the playout delay. To determine playout delay, you have to take decoding, system jitter, etc.. into consideration.

For point 7, playout the frame. This can be sound simple. But, sometime, you may need to take graphic rendering, audio system buffer playout, etc... into consideration. Those variable may affect the accuracy on how you play out the frame.


  1. Thanks man for the detailed explanation of the feature. I ot the overall scenario but i think the playout time delay calculaiton as in point 6 is a bit tedious task and there are many factors which can affect this timing and there might not be a single way to handle that.

  2. Yup.. You are right. Many external factors will affect it. Thus, you may see a lot of system is using few seconds of buffering before playout.

  3. yeah, do this lip sync feature is being handled by the popular open source media streamers like GStreamer or live555? Any idea if i can get reference as how these external factors has been managed or take care for the proper lip sync in the real time streaming application? Any idea on this?

  4. As far as I know, live555 calculate the presentation time for you.

    I do a search and see this FAQ

    From that you can use those presentation time as playback time. Then, on your application, at least, you must clock the time taken for decoding. The frame presentation interval - decoding time will be the pause time before the next frame being served.

    That is the minimal.

    Also, take a look at openRTSP, it will give you more hint on how to handle synchronization

  5. Thanks Thompson for in-dept explanation..


Facebook - Control privacy setting of liked page

By default, Facebook displayed your liked page to public. It is the user responsibility to adjust the privacy setting. And in my opinion, F...