Opening Pandora’s Box?

The “L” word - latency and digital audio systems By Al Keltz


With more and more digital transport and processing appearing in live audio systems, the subject of latency has become quite the hot topic. In fact, at Whirlwind, the number one question we receive from potential users of digital snakes is, “What's the latency?” This is well ahead of questions about frequency response, signal-to-noise ratio and dynamic range.

Latency is delay that occurs in audio systems due to the time it takes for sound to travel from place to place, and/or due to the time it takes for digital components to perform calculations. However, anecdotal effects of latency have recently reached almost mystical proportions.

We've been told about a certain “psycho-acoustic” phenomenon that causes singers to become disoriented, even with extremely small amounts of latency. Then there was a tale of a drummer that was being “driven crazy” because 4 milliseconds (ms) of latency in his monitor made it sound like the drums were bouncing off the back wall of the room. He just couldn't deal with the echo.

Some folks are also expressing concern with “comb filtering” with in-ear monitoring (lEM) systems, where the sound conducted through the head gets mixed with delayed monitor sound, thus causing extreme frequency dropouts - so bad as to make it “a severe issue” and “unusable.”

Some of this sounds plausible, some of it sounds a little far-fetched. Just about everyone agrees that latency is bad, but if you must have it, less is always better. Right?

In light of all of these questions and discussions, it was time to learn first hand about the effects of latency as it pertains to digital audio, digital snakes, live audio and monitors - in particular, in-ear monitors. Unfortunately, I found that there's very little data published about it.

So, after some research, and numerous conversations with audio professionals and manufacturers, what follows is an attempt to remove some of the mystery and present a rational look at latency, it's real effects, and how much can be tolerated by performers and audiences.

Latency issues have been around forever. Any time audio is reproduced with a microphone and a loudspeaker, there will be delay. Some delay occurs because of the distances that sound travels to microphones and from loudspeakers, as well as from reflections off walls and ceilings.

When digital components are used in a sound system, additional delay is added due to the conversion of analog audio to digital, then transporting that digital data over a network, and then the conversion back to analog. Further, digital effects processing can add even more delay due to the time it takes to perform the digital calculations.

Sound is not faster than a speeding bullet. In fact, sound is relatively slow compared to bullets, fighter jets and space shuttles, traveling in dry air at 32 degrees Fahrenheit at a rate of about 742 miles per hour, or 1087 feet per second (ft/sec) 1  For easy figuring, let's call it 1000 ft/sec, so that sound in air is always delayed from the source to our ears by approximately 1 millisecond per foot (1 ms/ft).

Now, give this a try. Plug a microphone into a standard analog sound system and speak while standing 5 feet from the loudspeaker. You're now experiencing about 5 ms of latency. Step back another 5 feet, and you're experiencing about 10 ms of latency. Move the microphone a foot away from your mouth, and it adds another 1 ms.

Delays within an ensemble of musicians can be, and often are, relatively long. Think of a symphony where performers are located across a 40-foot stage. The conductor waves a baton to keep time. The percussion section might be 30 feet (and 30 ms) away, while the second violins are 5 feet (and 5 ms) away.

Does the conductor hear all of the notes attacking at different times? The harps might be 40 feet (and 40 ms) away from the timpani. Do they think they sound out of time with each other? How do the musicians stay in synch with each other?

Actually, research sponsored by the National Science Foundation, through the Stanford University Department of Music, has shown that performers in an ensemble have no problem synchronizing with each other while experiencing latencies as high as 40 ms and even greater. In fact, latencies in the 10 ms to 20 ms range actually have a stabilizing affect on tempo and are thought to be preferred over zero latency. 2

The one thing for sure is that you can't make latency go away - it just has to be dealt with.

There are two main effects of latency: echo and phase cancellation, the latter also referred to as comb filtering. Echo seems to be somewhat of a matter of conjecture regarding how much latency a performer can experience with in-ear monitors before it becomes noticeable. Time for something more concrete, so some Whirlwind engineers and I decided to do a few experiments of our own.

The Haas Effect (or Precedence Effect) is a principle first set forth in 1949 by Helmut Haas, which established that we humans localize sounds by identifying the difference in arrival time between our two ears. The same sound arriving within 25 ms to 35 ms of itself will be suppressed and not be heard as an echo.

Only if sounds are more than about 35 ms apart will the brain recognize them as separate sounds or echoes. We tested this, connecting a microphone through a 20 ms delay and monitored with Shure E3 ear buds.

The principle here: when you monitor your own voice, the delayed sound mixes in the ear with non-delayed sound that is conducted through your head via bones, cartilage and Eustachian tubes. This effect should be exaggerated when using headphones or in-ear monitors, because the listener is not hearing all of the other room reflections with various delays and volumes.

Several people were asked to read aloud from a magazine. The subjects included experienced sound professionals and musicians, amateur players and non-technical people with no monitoring experience. Everyone heard the initial 20 ms delay as a very short echo or “doubled” sound.

Then, the delay was gradually reduced from 20 ms. The subjects were told to stop us when the delay seemed to disappear. Then this was repeated while the person spoke short, sharp syllables like “check, check!”

Every person tested seemed to think the echo disappeared somewhere between 10 ms to 15 ms. I personally found it to be a rather dramatic change too - as if someone had suddenly bypassed the delay unit.

Next, we wanted to evaluate the situation where a guitarist is playing direct to the PA, or a drummer is using electronic drums. I played guitar into the delay and monitored through headphones while only listening to the delay.

At long delays, playing in time is impossible. You have to “wait” to hear the note, and your playing gets slooowwwer and sloooowwwwwer.

It's a weird feeling, almost making me feel like I was going to fall over. Now I know what singers go through when they're trying to sing the national anthem in a stadium where the echo comes back at them with a long delay!

On the other hand, a delay of a few milliseconds was imperceptible. In my guitar experiment, it seems that the delay isn't noticeable at all up to about 10 ms (again). It becomes slightly noticeable between 10 ms to 15 ms almost like it's not really an echo - just “something's there,” but I could still play in time. The delay started to get difficult to contend with somewhere around 15 ms to 20 ms, and above 20 ms I really struggled with timing.

Now, this is an admittedly small sample. However, after these tests, it appears that even with the subjects being told that it's there, they couldn't detect latency as echoes with less than about 10 ms to 15 ms of latency. 3 <file:///Volumes/HTML_Pages/wwlatart.html#ft3>

Still, there is the other issue that arises when mixing non-delayed sound with delayed sound - phase cancellation or “comb filtering”.

Any time a sound arrives at different times, the sound interacts with itself to affect the overall frequency response. This also affects response when a single sound reaches two microphones at slightly different locations at about the same level.

When a sound or electrical wave is mixed with a delayed version of itself, peaks and valleys in the frequency response graph appear due to the out of phase interaction between the two waves. It's easy to see how the “comb filter” got its name. (Figure 1


 Figure 1: Graph of 20 Hz to 20 kHz frequency response of a signal mixed with the same signal delayed by 1 ms.

This frequency interaction is the basis for the long-standing 3:1 principle of microphone placement. 4  If two microphones are placed at a distance from a sound source where the sound is arriving at different times but with almost the same relative strength, the resulting signal mix will produce a comb filter effect. (Figure 2) However, if the microphones are placed closer to each source and/or farther apart from each other, the strength of the unwanted signal in each microphone will be less and the effect is reduced. When the distance between microphones is 3 times (3x) the distance from each microphone to its source, the strength of the sound at each microphone will be lower than the other by approximately 9 dB. This reduces the maximum effect to about 1 dB, essentially inaudible.

Figure 2: These microphones will produce phase cancellation because they are closer to each other than 3x the distance from each mic to the source.

What about combing and in-ear monitors? Even though latencies less than 10 ms to 15 ms are not heard as echoes, latency can still cause audible changes in tone when the monitor audio mixes with the sound heard through the head.

Figure 3 shows an oscilloscope screenshot of an electrical signal called a “sweep.” It is a burst of pure sine waves starting out at 20 Hz and extending on up to 20 kHz. The height or intensity of each portion of the wave is the same.

Figure 3: Sweep signal, 20 Hz to 20 kHz

Figure 4, meanwhile, offers the same sweep signal that has been mixed with itself at the same level but with a latency of 10 ms. The oscilloscope's persistence has been turned on to fill in the waveform to better illustrate the comb effect. Notice the peaks and valleys in the waveform where the waves have added and subtracted from each other.


Figure 4: Sweep signal, 20 Hz to 20 kHz mixed at 10 ms latency.

The “peaks” will occur at specific frequencies depending on the amount of delay. The frequency can be calculated with the formula: l/delay and harmonics of this frequency will repeat throughout the bandwidth. Therefore, the peaks occur at 100 Hz, 200 Hz, 300 Hz, and so on, to 20 kHz. How much this wave is affected depends on the relative strength of the two waves, and the effect is most pronounced when the signals are of equal strength.

The top screen in Figure 5 depicts the same swept wave, but mixed with a latency of 5 ms. The peaks have changed width and have moved up in the frequency band to 200 Hz, 400 Hz, 600 Hz, again, to 20 kHz. The bottom wave in Figure 5 shows the swept wave mixed with 1 ms latency.


Figure 5: Above, a swept wave mixed with latency of 5 ms at the top, and the same wave mixed with latency of 1 ms below.

Now the peaks are now located at 1 kHz, 2 kHz, 3 kHz, and so on. I'd venture that without this evidence, the general consensus would have been that less latency would always produce less comb effect and always be preferable. But a comparison of the waveforms from this section shows that the peaks are fewer and wider at 1 ms, and occurs at frequencies that would affect the vocal range just as the latency at 5 ms or even 10 ms does.

In fact, if the comb filtering at 1 ms were producing unacceptable tone quality, a possible solution might be to actually add a few milliseconds of latency to shift the affected frequencies and move the peaks closer together.

Reversing the polarity of the in-ear monitor signal can have more of an effect on. how things sound than the amount of latency alone. The upper section of Figure 6 shows the result of mixing a signal with 1 ms of latency that is opposite polarity from the original compared to the non-inverted. Note that the peaks and valleys have swapped positions.


Figure 6: Same mix of signals at 1 ms latency, but with the polarity of the delayed portion reversed in the upper section.

Figure 7 provides a comparison of 1 ms, 5 ms, and 10 ms latencies with the inverted result in the upper section. It's also very important to note that not all headphone manufacturers construct their headphones with the same polarity. If equalization for monitoring is set up with a particular brand of headphone and then, instead, a different brand that is opposite in polarity is used, the frequency response can change significantly.

1 ms5 ms10 ms


Figure 7: Comparison of three latencies, with the inverted result in the upper section.

An occlusion effect occurs when some object (like an un-vented ear mold) completely fills the outer portion of the ear canal. What this does is trap the bone-conducted sound vibrations of a person's own voice in the space between the ear mold and the eardrum.

Ordinarily, when people talk (or chew), these vibrations escape through an open ear canal. But when the ear canal is blocked by an ear mold, the vibrations are reflected back toward the eardrum and increase the loudness perception of the person's own voice, especially at lower frequencies. Compared to a completely open ear canal, the occlusion effect may boost the low frequency (usually below 500 Hz) sound pressure in the ear canal by 20 dB or more.

This effect has long been an issue with hearing aids and results in people complaining that their voice sounds “funny,” “hollow” or “in a barrel. 5  It's important not to confuse the occlusion effect with issues of comb filtering caused by latency.

But what does it sound like? Well, try it out! Talk through a mic and a good quality digital delay set at 10 ms, and listen through headphones. Then adjust the delay from 10 ms down to 1 ms and listen to the change in tone.

When I did this, I thought things sounded a bit “nasally” at the lowest latencies, almost like a “smiley” graphic EQ. But at no time was the effect disastrous or unusable. Important: reverse the polarity of the mic at various latencies and listen to the difference. Any frequency that sounded “scooped” will now sound like there's a hump to it, again as illustrated in Figure 7.

So latency can cause some comb filtering with in-ear monitors. But because the monitor console is usually located at the stage, most analog monitor consoles will connect from an analog split at the stage without being digitally processed through the rest of the system - it's not cost effective to run a digital split only 50 feet or so. However, using a digital monitor console or any digital processing in the monitors will still produce some latency.

Even if you have to run your monitors from front of house, and this is being done with a digital snake and digital console, remember that any total latency under 10 ms to 15 ms primarily becomes an issue with changes in frequency response, not echoes.

So far, we've been dealing primarily with monitor wedges, headphones and in-ear monitors. However, added latency is an issue we all should be aware of within an entire sound system.

Suppose you run microphones into a digital snake, to a digital console, to an outboard loudspeaker processor and to power amplifiers. Each section of digital processing adds its own level of latency. The table in Figure 8 shows how much extra latency is produced by components in a digital system versus a total analog system. Other digital components such as digital wireless can add their own amounts of latency.

How does this extra latency affect system operation?

Let's assume 15 ms extra latency for a digital signal chain from the stage to front of house to loudspeakers. If the house mix position is located 75 feet from the stage, the sound person is already experiencing about 75 ms of latency. Adding 15 ms is not much of an issue for that person or the audience in general.

If the performers are hearing a reflection off the back wall of the room that is, say, 200 feet deep, then that reflected delay would increase from about 400 ms to about 415 ms.

If the main loudspeakers are located 20 feet in front of the stage backline, and there is 20 ms of delay being applied on the mains to compensate the system for that distance, that delay can be reduced to 5 ms. Similarly, when calculating delay to towers at an outdoor festival, one would add the 15 ms of extra latency produced by the digital chain.

 So, what do we know?

• Latency is present in all analog and digital audio systems.

Latencies less than approximately 10 ms to 15 ms are not perceived as echoes with in-ear monitors.

It takes time to perform digital calculations. Any device that is performing analog to digital and digital to analog conversion, or is processing audio digitally, will add some latency.

Lowering latency doesn't necessarily reduce the effect of comb filtering. All but the very shortest latencies produce comb effects and the frequencies affected vary with the amount of delay. It's just as important to pay attention to the polarity of the monitor audio signal.

Performers, audiences and sound mixers already tolerate long latencies - much longer than those produced by digital components of a system.

The total latency in a system is the sum of all the individual analog and digital latencies produced by each section.

As long as the digital latency component is constant, it can be accounted for and dealt with when adjusting various components for delay.

Unit Approx. Latency (ms)
Digital Snake3-7
Digital Console1-3
Drive Signal Processor5-10
Total 9-20

Figure 8: A table of extra latency produced by digital components.

The author would like to thank the following for their assistance and contributions to this article:

Carl Cornell, Jim Kelsey, Bob Schwartz Whirlwind Engineering
Marty Garcia - Future Sonics
Dave Kaspersin - DRC Recording
Greg Lukens - Washington Professional Systems
Lee Minich - Lab-X Technologies


  1 It's interesting to note that the speed of sound in Helium is 3190ft/sec, about 3X faster than the 1087ft/sec in air. This changes the resonant frequency of the throat and vocal cords and explains why a person's voice sounds high pitched after inhaling helium.


    3  When latency was 10ms - 15ms, although some people detected the presence of “something”, they felt they could probably live with it. Others made faces. When the latency was 15ms - 20ms, more people heard an effect and felt that it was becoming distracting. The tolerance will vary from performer to performer and with the material performed as to when they will be unable to deal with it.

  4 Source: This is usually referred to as the 3:1 Rule but violating rules sometimes sounds pretty good! My engineering friend Greg Lukens says, “It's a mistake to mix with your eyes.”