Echo and Soft VoIP PBX Systems
Those of you who have watched old black-and-white movies depicting long-distance conversations may remember the callers shouting into the mouthpieces in order for the other party to repeat what was said. The reason the callers had to shout was low receiver volume. The attenuated volume was the way echo was dealt with before powerful digital processing was available. The signal heard by a listener was attenuated considerably by the equipment. The echo passed through the attenuator twice—once on the way out and once on the way back—and this provided a measure of echo reduction. The use of attenuation to eliminate echo was not a satisfactory solution, and this method was abandoned when digital echo cancellation became available. However, the technique still is valuable in the soft PBX world as a mechanism for getting rid of the echo that remains after the somewhat limited software echo cancellers have done their job.
Digital echo cancellation is based on subtracting from the received signal a correction based on the response of the system to a short spike of sound, called the finite impulse response (FIR). The FIR is simply the echo you would hear from a short ping.
Figure 2 shows 128 digital sound samples or taps taken at a rate of 8,000 times per second, covering 128/8 = 16 milliseconds. The impulse occurred at time zero. The dots represent the individual sample values that have been normalized to an impulse size of 1.
The first thing to notice is the echo does not appear to be very strong. The impulse had a value of 1, and the highest peak in the response is less than 0.25, falling rapidly to tiny values. But because of the sensitivity of the ear, the echo produced by this system sounds almost as loud as the spoken voice, resulting in a completely intolerable echo on a VoIP system.
The echo from the impulse has an effect that lasts about 10ms (80 taps). To cancel out the echo properly, the input from all the nonzero taps needs to be taken into account. This is why the number of taps in an echo canceller is important. The number of taps is always a power of 2: 32, 64, 128, 256 and so on. Naturally, the higher the number of taps, the higher the computing load and memory requirement.
This echo starts at tap 7, or about 1ms after the impulse. The delay is due to switching and transmission delays on the digital and analog lines. You can see why it is important that echo cancellation takes place close to the echo source. If this echo were being cancelled at the far end of a transatlantic call, there would be many more leading idle taps, so the true echo would be shifted back, perhaps right out of the tap sample. When echo is heard on a system with good echo cancellation, it usually is because an unexpectedly complex system has switching and transmission delays that have shifted the FIR backwards out of the tap sample.
For this call, beyond about 70 taps, the echo tail is small. In practice, this echo canceller would be about as effective at 64 taps, particularly if the leading 8 taps were eliminated by better buffering. That would cut the echo cancellation computation load by half.
The FIR is used to calculate a series of correction factors that represent the echo component of the received signal. Mathematically, the echo to be subtracted for each voice sample is given by the dot product of two vectors of dimension equal to the number of taps. On a 128-tap echo canceller, for example, it would look like this:
Echo = (128 values of FIR) ⋅ (128 previous tap samples of transmission)
By subtracting this “echo” from the signal as received, a substantially echo-free receive signal is obtained. However, because of rounding errors and non-linearities, some of the echo remains. The nonlinear processor cuts out the remaining received signal if the signal is small enough. In higher-performance echo cancellers, the nonlinear processor then substitutes “comfort noise”, background noise so the line does not sound dead.
Obtaining the FIR is an iterative training process based on measuring the residual signal after the calculated echo has been subtracted and changing the FIR estimate. This process requires silence on the other end of the line—there is no doubletalk. The doubletalk detector detects when both parties are speaking at the same time and disables the FIR optimization process until the doubletalk condition has ceased. The iterative FIR optimization converges quite slowly, but as the calculations are done 8,000 times per second, within a second or two of the start of a call, a good echo canceller will be fully trained.