The OpenPhone Project—Internet Telephony for Everyone!

Call your friends and family from your computer—a look at the future or the present? With Linux, the future is now.
Audio Codecs

The standard PSTN digitization technique is described by the ITU G.711 specification. This document describes an 8KHz sampling rate using 8 bits per sample, for an effective 64Kbps data rate. There are two subsets of this technique: A-Law and Mu-Law. These are scaling factors that take into account the sensitivity of the human ear and allow for a more efficient utilization of the 8-bit encoding space for recording the human voice. Both are in wide use and are considered standard codecs.

However, 64Kbps is excessive for Internet telephony. Two-way audio encoded in this manner yields 128Kbps of audio alone (not counting signaling or overhead), making it impractical for use across a dial-up link, and wasteful of bandwidth even on a LAN or WAN link. Voice compression is the answer, and there are several options from which to choose.

Audio codecs can either be implemented in software or provided by the telephony interface card using an on-board DSP. On a lightly loaded PC, the extra processing load incurred by the encoding is not particularly burdensome, although it can be on more heavily utilized machines. Software-only solutions can add some latency, since the encoding usually happens in a user-space program and is thus at the whim of system load and the scheduler. Internet telephony cards use on-board DSPs to perform the encoding, virtually eliminating the load on the host CPU and the associated latencies.

Perhaps the most widely used codec in Internet telephony is G.723.1. This codec provides either a 6.3 or 5.3Kbps data stream of packets that hold 30ms of audio each. The encoding requires 7.5 milliseconds of lookahead, which, when combined with the 30-millisecond frame, totals 37.5ms of coding delay. This is the minimum theoretical latency using this codec. Of course, real-world latencies will include the time on the Internet and application-level processing at both ends (and the jitter buffer—mustn't forget that).

G.723.1 is patented technology. The patent holders charge royalties for its use, making it impossible to use in open-source software. However, most telephony boards (including the hardware used by the OpenPhone Project) include this codec as part of the board price. Aside from other technical advantages of telephony interface boards over sound cards, the inclusion of licensed compression codecs on the card is perhaps the single best reason we need these cards. With the codec in hardware, we can write open-source software and not violate any licenses. I'd like to point out that the G.723.1 codec is very different from the G.723 codec (a much higher bandwidth codec in which the source code is widely available, but not often used due to high bandwidth requirements).

Rapidly gaining in popularity are the G.729 and G.729A codecs. These codecs use an 8KHz sampling rate, a 10ms audio frame and a 5ms lookahead buffer, for a total of 15ms processing latency. Because the frames are only 10ms long, this codec suffers less from packet loss—the software has less to recover, given the loss of any one packet. G.729A is a version of the codec that uses fewer DSP resources, making it easier to implement on lower-performance (cheaper) DSP chips. The audio quality these codecs can deliver is astounding. I recently experienced a LAN-based call using the G.729A codec and stood next to the person calling me. I could detect no perceptible delay between the direct path (his mouth to my ear) and the link across the network. If I had not known otherwise, I would have sworn it was a normal PSTN call. This codec is the future of Internet Telephony.

In-Band Signaling

In-band signaling includes all the tones you normally hear in the course of a call, including ringing, busy-signal, fast-busy and the all-important dual-tone modulation frequency digits (DTMF). These tones are a normal expected part of using a telephone, but unfortunately, many do not compress well with the algorithms discussed above. One alternative, the one chosen by the OpenPhone project, is to pass these signals as a separate data type within the real-time audio stream. This has the advantage of ensuring that the signals are reproduced accurately at the other end of the connection regardless of the audio codec needed, and it reduces the bandwidth needed to convey the data.

This technique is described in an Internet Draft (see Resources). It basically specifies a new RTP data payload type; the application layer simply sends the data in the stream along with the normal audio. One very tricky aspect needing more work is the code required to remove the actual audio signals from the audio data stream prior to compression. Should these tones and signals get compressed, the codecs will distort these signals at the far end, reducing the chances of properly detecting the signal or tone. It's far better to filter these signals out prior to compression and to pass the signal across the network directly, allowing the other end to reproduce the signal with perfect accuracy.