Waveguide Synthesis is one of the most effective approaches to generating sounds with physically realistic traits. If you're not convinced of this, perhaps the polished waveguide models of Chet Singer will change your mind. When researching the topic, I've found most material on waveguide synthesis to be heavy on the theory and yet lacking when it comes to application and implementation. This is why I decided to assemble my notes on the implementation and applied exploration of waveguides, using the occasion to play with interactive in-browser synthesis. This text assumes some knowledge of audio DSP.
We begin with a minimal, special case of a waveguide more commonly known as a feedback comb filter. Take a source signal (the excitation), and delay it by a short duration (the delay length). Then take the delayed signal, attenuate it by a certain factor (the feedback), and feed it back into the delay along with the source signal. As the delay receives its own output in a loop, some frequencies will begin to emerge, creating a perceivable pitch, which in simple cases is the inverse of the delay length.
If you play with the model below, you'll notice the pitch seems lower by an octave when feedback is negative. This naturally arises from the fact that a number is repeated after its sign is inverted twice, effectively doubling the delay length. Negative feedback also changes the timbre, because all even harmonics cancel themselves out. This is useful in wind instrument waveguides, where closed bores (such as pan flutes) are often modeled with negative feedback.
Another way of representing this type of model is with a difference equation, which is useful as it can be directly translated to an algorithm. In this case the difference equation is:
$$ y(n) = x(n-d) + ay(n-d) $$
where:
Note how the previous model has very sustained high frequencies, giving it an unrealistic timbre. In reality, a physical medium will absorb higher frequencies more heavily than lower frequencies. This can be roughly modeled with a filter inside the loop. It's also important to feel the difference between filtering inside the loop vs. filtering only the excitation. The following model has butterworth lowpass filters in both positions:
Note how the previous model, when filtered, produces only short notes, even with maximum feedback. Can this be remedied? Raising the feedback above 1 would cause the amplitude to keep increasing theoretically forever (this would be instability). We've been simply multiplying the signal $x$ by the feedback value $a$, in other words, applying a linear function $\mathrm{L}(x) = ax$. In real acoustic systems, louder sounds are more heavily absorbed by the medium, and the simplest way of modeling this is to use a sigmoid non-linear function such as $\mathrm{NL}(x) = \tanh(ax)$ that gives us lower feedback for high values. This will also allow sustained notes without instability.
The first model's pitch was simply determined by the delay length. You may have noticed that's not the case anymore: change the inner filter's frequency, and the pitch also changes. This is because digital filters are based on delay, so by introducing a filter, we're changing the total length of the cumulative delay line. In fact, the filters we're using have a phase delay that's different for every frequency. This makes it more difficult to find an exact solution to compensate for this delay, as the filter introduces slight inharmonicity, and it affects the pitch in a way that depends on the pitch. Thankfully we don't actually need to use our brains: we can use regression instead! Just measure how the filter affects the pitch and fit an equation onto it. For simple lowpass filters, we obtain something in the form:
$$ K = \frac{c_2}{f + \frac{c_1 f^2}{f_c}} $$
where:
Notice that the model above has staccato notes. Why? Because I was too lazy to implement proper note transitions. If we attempt to change the length of the delay line while a note is playing, the output does not remotely resemble a legato sound (at best, we can change it slowly and obtain a slide whistle sound). In the legato transition of a real flute, a tone hole is opened or closed, and the bore effectively has a Y-junction during the transition. This can be better modeled by cross-fading between two fixed-length delay lines in the loop. In fact, two delay lines will suffice for any sequence of notes: one of them can always sound while the other secretly changes length.
When exploring waveguide synthesis, it's useful to have mental models and simplifications to work with. One such insight is to imagine that the looped delay line provides an infinite number of harmonics of its base frequency, some having more weight than others. Without any filters, the weights of harmonics are based on their harmonic distance: the simplest fractions, such as the fundamental, are the heaviest. This weight determines which harmonics are more likely to resonate, with the lighter ones generally being softer or absent (this is probabilistic when the exciter is based on noise). By adding filters into the loop, we shape the spectrum of those weights. When crossfading multiple delays in the loop, we intersect the spectra of their weighted harmonics.
However, complex filtering in the loop means the final pitch of the model will be much harder to control. One practical solution is to play along with it, since these pitch fluctuations can be satisfyingly organic, especially when the context is not musical. This is, I believe, how pitch is handled in some xoxos instruments such as Fauna and Elder Thing. In the following example, the same approach is used to create organic animal calls. There are two parallel bandpass filters in the loop, one of which has three times the frequency of the other.
The following example simulates the higher registers of a flute by means of a bandpass crossfaded with a lowpass in the loop, making the pitch harder to predict. Despite this, an attempt is made to keep the final pitch in tune with the original intended pitch. We use the same formula introduced in the above section Controlling the Pitch, with one modification: $f_c$ is now the sum of the frequencies of both filters.
Form A above is commonly found in educational material, because it's the one that directly matches the physical model. However, it should generally not be used in practice. Instead, we've been using the simplified form B, which is obtained by changing the point at which the delay loop is sampled, then combining delays inside the loop. More generally, the input and output points as well as the order of elements inside the delay loop can often be modified with no audible difference, even if the model is not exactly equivalent. Form C is particularly useful when low latency is required.
Digital delay lines are made of discrete samples, but we need to read them at arbitrary time points between samples. There are a few possible solutions to this:
When experimenting with looped delay lines, the output will sometimes seem to paradoxically vanish at high feedback values. This is usually due to DC offset, that is, any slight positive or negative average offset gets amplified inside the loop until the signal gets squashed. This arises when using non-linearities with too much slope at the origin, that is $|\mathrm{NL}'(0)| > 1$, assuming $\mathrm{NL}$ includes the feedback. The simplest solution is to use a high-pass filter in the loop. In the above examples, a one-pole 30Hz high pass filter is used.
Waveguide synthesis related:
Tools that were used to create this interactive notebook: