The OpenPhone Project—Internet Telephony for Everyone!

Call your friends and family from your computer—a look at the future or the present? With Linux, the future is now.
Internet Telephony Audio

An Internet phone must carry the audio signal between the parties talking—this is the fundamental requirement. We all know that the Internet can carry high-fidelity audio, since many of us have listened to streaming audio music or radio programs using our Internet connection and PC. Internet telephony is a bit more complicated, though, since it has to do full duplex (two-way) audio in real time, and the human ear is quite sensitive to latency. Too much delay, and the call sounds like it's traveling across a satellite instead of across the Internet!

Bandwidth is also an issue, especially over dial-up lines to the Internet. In the normal PSTN world, the audio on the local loop is digitized into a 64Kbps digital data stream and presented to the phone company CO equipment, which then compresses the data for transport across the phone system backbone. Simple Internet telephony packages use a similar digitization, yielding a bandwidth requirement of 64Kbps in each direction for full-quality voice. That requires 128Kbps of bandwidth, plus extra for signaling and overhead, so this does not work well across a normal dial-up Internet connection that is perhaps 56Kbps.

Luckily, voice compression technology has come a long way and works quite well. It is commonplace to obtain 8:1 (or better) compression ratios using today's voice coders (codecs). Modern Internet telephony interface boards provide these codecs as standard features.

The normal phone system sets up virtual (or sometimes real) dedicated circuits for the voice packets to flow across. The Internet is much more chaotic. On the Internet, packets can take different routes from second to second, and may arrive at their destination out of order, late, not at all or staggered in time. Late packets might as well be lost, since if the packet is not there on time and ready to play, there will be a gap in the audio. Out-of-order packets are probably not useful, either—if it's out of order, it's probably late. Packets that arrive at the destination are unlikely to do so in a neat and orderly way. Sometimes they take a bit longer or shorter than average—the stream is not uniform. This staggering is called jitter. Several techniques are used to deal with these problems: the Real Time Protocol (RTP) and jitter buffers.

The Real Time Protocol

Audio packets need to arrive on time and in the correct order. RTP is a user-level protocol that provides a way to encapsulate data into packets time-stamped with enough information to allow the proper playback of audio. The protocol has a companion control protocol (RTCP) that provides a means for the end points to stay informed about the quality of services they are receiving. The complete protocol is described in RFC1889, which can be found at Several implementations of RTP are available under various open-source licenses. See Resources for more information and pointers to those libraries.

Jitter Buffers

The Internet is not predictable, and packets sent at nice predictable rates do not always arrive at the same rate they were sent. They can arrive slightly sooner or later than the average latency. If a packet arrives slightly late, the audio device which is ready to play the next frame of audio has nothing to play. This causes a discontinuity that degrades the audio quality. In simple applications, this is a short silent period that makes the voice sound choppy. In more advanced applications, comfort noise or some form of audio blending is used to mask the gap. This can make the voice sound warbled or as if it's under water. These effects are observed while using cell phones or Internet phones, by the way—packet loss and latency is a general problem of all digital audio applications. RTP provides a means for the application software to know if packets are out of order, missing or running early or late. The application can then make the appropriate corrections for missing or misordered packets.

The best solution for dealing with jitter (short of a perfect transport path, which would eliminate it entirely) is to make sure the audio device never “runs dry”. This requires jitter buffers. These buffers store a small amount of audio at the beginning and then stay a bit ahead of the flow so that there is always an audio frame to play. Several one-way real-time audio/video programs will buffer up many seconds of data before playing any sound, thus ensuring they will always have plenty of data on hand to play. However, every frame buffered adds latency, which is especially relevant to voice calls. If you buffer 90 milliseconds (ms) of data, you add 90ms to the delay between the time the words are spoken and when they are heard. When added to the latency of the Internet itself, this can rapidly become unacceptable. Some people believe 200ms of latency is a good upper limit for what the human ear can tolerate. Given that many Internet locations are 100ms or more (one way) apart, adding 90ms of jitter buffer latency accounts for a significant fraction of the acceptable delay. A delicate balance lies between the need to jitter buffer and the need to reduce latency.

Jitter buffer techniques are one of the areas where I think the Open Source community can contribute significantly. There is much thought and experimentation going on now to find algorithms and techniques to use adaptive jitter buffers in two-way real-time audio streams. I suspect the cumulative efforts of the Open Source community will find some excellent solutions to this in the next year. I hope the OpenPhone Project can be a catalyst for making it happen sooner.