Opus Silk Codec
Opus Silk Codec
This paper was accepted for publication at the 135th AES Convention. This version of the paper is from the authors
and not from the AES.
ABSTRACT
In this paper, we describe the voice mode of the Opus speech and audio codec. As only the decoder is
standardized, the details in this paper will help anyone who wants to modify the encoder or gain a better
understanding of the codec. We go through the main components that constitute the voice part of the codec,
provide an overview, give insights, and discuss the design decisions made during the development. Tests have
shown that Opus quality is comparable to or better than several state-of-the-art voice codecs, while covering
a much broader application area than competing codecs.
In each band the background noise level is estimated preliminary estimates. After applying a small bias
by smoothing the inverse energy over time frames. towards shorter lags to avoid pitch doubling, a single
Multiplying this smoothed inverse energy with the candidate pitch lag with highest correlation is found.
subband energy gives the SNR.
The candidate’s correlation value is compared to a
3.2. HP Filter threshold that depends on a weighted combination
A high-pass (HP) filter with a variable cutoff of:
frequency between 60 and 100 Hz removes low-
frequency background and breathing noise. The cut- • Signal type of the prevous frame.
off frequency depends on the SNR in the lowest fre-
• Speech activity level.
quency band of the VAD, and on the smoothed pitch
frequencies found in the pitch analysis, so that high • The slope of the SNR found in the VAD with
pitched voices will have a higher cutoff frequency. respect to frequency.
3.3. Pitch Analysis
As shown in Figure 2, the pitch analysis begins by If the correlation is below the threshold, the sig-
pre-whitening the input signal, with a filter of or- nal is classified as unvoiced and the pitch analysis
der between 6 and 16 depending the the complex- is aborted without returning a pitch lag estimate.
ity mode. The whitening makes the pitch analysis
The final analysis step operates on the input sample
equally sensitive to all parts of the audio spectrum,
frequency (8, 12 or 16 kHz), and searches for integer-
thus reducing the influence of a strong individual
sample pitch lags around the previous stage’s esti-
harmonic. It also improves the accuracy of the cor-
mate, limited to a range of 55.6 to 500 Hz . For each
relation measure used later to classify the signal as
lag being evaluated, a set of pitch contours from a
voiced or unvoiced.
codebook is tested. These pitch contours define a de-
The whitened signal is then downsampled in two viation from the average pitch lag per 5 ms subframe,
steps to 8 and 4 kHz, to reduce the complexity of thus allowing the pitch to vary within a frame. Be-
computing correlations. A first analysis step finds tween 3 and 34 pitch contour vectors are available,
peaks in the autocorrelation of the most downsam- depending on the sampling rate and frame size. The
pled signal to obtain a small number of coarse pitch pitch lag and contour index resulting in the highest
lag candidates. These are input to a finer analysis correlation value are encoded and transmitted to the
step running at 8 kHz, searching only around the decoder.
carded and Burg’s method is used directly on the • Spectral shaping of the quantization noise sim-
input signal. ilarly to the speech spectrum to make it less
audible.
The LPC coefficients (for either voiced or unvoiced
speech) are converted to Line Spectral Frequencies • Suppressing the spectral valleys in between for-
(LSFs), quantized and used to re-calculate the LPC mant and harmonic peaks to make the signal
residual taking into account the LSF quantization less noisy and more predictable.
effects. Section 3.7 describes the LSF quantization.
For each subframe, a quantization gain (or step size)
3.5. Noise Shaping
is chosen and sent to the decoder. This quantization
Quantization noise shaping is used to exploit the
gain determines the tradeoff between quantization
properties of the human auditory system.
noise and bitrate.
A typical state-of-the-art speech encoder determines Furthermore, a compensation gain and a spectral tilt
the excitation signal by minimizing the perceptually- are found to match the decoded speech level and tilt
weighted reconstruction error. The decoder then to those of the input signal.
uses a postfilter on the reconstructed signal to sup-
press spectral regions where the quantization noise The filtering of the input signal is done using the
is expected to be high relative to the signal. Opus filter
combines these two functions in the encoder’s quan- Wana (z)
tizer by applying different weighting filters to the H(z) = G · (1 − ctilt · z −1 ) · , (3)
Wsyn (z)
input and reconstructed signals in the noise shap-
ing configuration of Figure 3. Integrating the two where G is the compensation gain, and ctilt is the
operations on the encoder side not only simplifies tilt coefficient in a first order tilt adjustment filter.
the decoder, it also lets the encoder use arbitrarily The analysis filter are for voiced speech given by
simple or sophisticated perceptual models to simul- NXLP C
!
taneously and independently shape the quantization Wana (z) = 1 − aana (k) · z −k (4)
noise and boost/suppress spectral regions. Such dif- k=1
ferent models can be used without spending bits 2
X
!
on side information or changing the bitstream for- −L −k
· 1−z · bana (k) · z , (5)
mat. As an example of this, Opus uses warped noise k=−2
shaping filters at higher complexity settings as the
frequency-dependent resolution of these filters bet- and similarly for the synthesis filter Wsyn (z). NLP C
ter matches human hearing [8]. Separating the noise is the LPC order and L is the pitch lag in samples.
shaping from the linear prediction also lets us se- For unvoiced speech, the last term (5) is omitted to
lect prediction coefficients that minimize the bitrate disable harmonic noise shaping.
without regard for perceptual considerations. The short-term noise shaping coefficients aana (k)
A diagram of the Noise Shaping Quantization (NSQ) and asyn (k) are calculated from the LPC of the input
is shown in Figure 3. Unlike typical noise shap- signal a(k) by applying different amounts of band-
ing quantizers where the noise shaping sits directly width expansion, i.e.,
around the quantizer and feeds back to the input, aana (k) = k
a(k) · gana , and (6)
in Opus the noise shaping compares the input and k
asyn (k) = a(k) · gsyn . (7)
output speech signals and feeds to the input of the
quantizer. This was first proposed in Figure 3 of The bandwidth expansion moves the roots of the
[9]. More details of the NSQ module are described LPC polynomial towards the origin, and thereby
in Section 3.5.2. flattens the spectral envelope described by a(k).
3.5.1. Noise Shaping Analysis The bandwidth expansion factors are given by
The Noise Shaping Analysis (NSA) function finds
gains and filter coefficients used by the NSQ to shape gana = 0.95 − 0.01 · C, and (8)
the signal spectrum with the following purposes: gsyn = 0.95 + 0.01 · C, (9)
where C ∈ [0, 1] is a coding quality control param- Similar to the short-term shaping, having Fana <
eter. By applying more bandwidth expansion to Fsyn emphasizes pitch harmonics and suppresses the
the analysis part than the synthesis part, we de- signal in between the harmonics.
emphasize the spectral valleys. The tilt coefficient ctilt is calculated as
The harmonic noise shaping applied to voiced frames
ctilt = 0.25 + 0.2625 · V, (12)
has three filter taps
where V ∈ [0, 1] is a voice activity level which, in
bana = Fana · [0.25, 0.5, 0.25], and (10) this context, is forced to 0 for unvoiced speech.
bsyn = Fsyn · [0.25, 0.5, 0.25], (11)
Finally, the compensation gain G is calculated as
where the multipliers Fana and Fsyn ∈ [0, 1] are cal- the ratio of the prediction gains of the short-term
culated from: prediction filters aana and asyn .
An example of short-term noise shaping of a speech
• The coding quality control parameter. This spectrum is shown in Figure 4. The weighted in-
makes the decoded signal more harmonic, and put and quantization noise combine to produce an
thus easier to encode, at low bitrates. output with spectral envelope similar to the input
• Pitch correlation. Highly periodic input signal signal.
are given more harmonic noise shaping to avoid 3.5.2. Noise Shaping Quantization
audible noise between harmoncis. The NSQ module quantizes the residual signal and
• The estimated input SNR below 1 kHz. This thereby generates the excitation signal.
filters out background noise for a noise input A simplified block diagram of the NSQ is shown in
signal by applying more harmonic emphasis. Figure 5. In this figure, P (z) is the predictor con-
varying the number N , we get a means for adjusting An often used technique is to reduce the LTP coef-
the trade-off between a low rate-distortion (R-D) er- ficients, see e.g. [11], which effectively shortens the
ror and a high computational complexity. The same impulse response of the LTP filter.
principle is used in the NSQ, see Section 3.5.3. We have solved the problem in a different way; in
3.7.2. Error Sensitivity Opus the LTP filter state is downscaled in the be-
Whereas input vectors to the first stage are un- ginning of a packet and the LTP coefficients are kept
weighted, the residual input to the second stage is unchanged. Downscaling the LTP state reduces the
scaled by the square roots of the Inverse Harmonic LTP prediction gain only in the first pitch period in
Mean Weights (IHMWs) proposed by Laroia et al. in the packet, and therefore extra bits are only needed
[10]. The IHMWs are calculated from the coarsely- for encoding the higher residual energy during that
quantized reconstruction found in the first stage, so first pitch period. Because of Jensens inequality, its
that encoder and decoder can use the exact same better to fork out the bits upfront and be done with
weights. The application of the weights partially it. The scaling factor is quantized to one of three
normalizes the error sensitivity for the second stage values and is thus transmitted with very few bits.
input vector, and it enables the use of a uniform Compared to scaling the LTP coefficients, downscal-
quantizer with fixed step size to be used without ing the LTP state gives a more efficient trade-off be-
too much loss in quality. tween increased bit rate caused by lower LTP pre-
3.7.3. Scalar Quantization diction gain and encoder/decoder resynchronization
The second stage uses predictive delayed decision speed which is illustrated in Figure 6.
scalar quantization. The predictor multiplies the 3.9. Entropy Coding
previous quantized residual value by a prediction The quantized parameters and the excitation signal
coefficient that depends on the vector index from are all entropy coded using range coding, see [17].
the first stage codebook as well as the index for the
3.10. Stereo Prediction
current scalar in the residual vector. The predicted
In Stereo mode, Opus uses predictive stereo encod-
value is subtracted from the second stage input value
ing [16] where it first encodes a mid channel as the
before quantization and is added back afterwards.
average of the left and right speech signals. Next
This creates a dependency for the current decision
it computes the side channel as the difference be-
on the previous quantization decision, which again
tween left and right, and both mid and side channels
is exploited in a Viterbi-like delayed-decision algo-
are split into low- and high-frequency bands. Each
rithm to choose the sequence of quantization indices
side channel band is then predicted from the cor-
yielding the lowest R-D.
responding mid band using a scalar predictor. The
3.7.4. GMM interpretation prediction-residual bands are combined to form the
The LSF quantizer has similarities with a Gaussian side residual signal S, which is coded independently
mixture model (GMM) based quantizer [15], where from the mid channel M . The full approach is illus-
the first stage encodes the mean and the second trated in Figure 7. The decoder goes through these
stage uses the Cholesky decomposition of a tridiag- same steps in reverse order.
onal approximation of the correlation matrix. What
is different is the scaling of the residual vector by 4. DECODING
the IHMWs, and the fact that the quantized resid- The predictive filtering consist of LTP and LPC. As
uals are entropy coded with a entropy table that is shown in Figure 8, it is implemented in the decoder
trained rather than Gaussian. through the steps of parameter decoding, construct-
ing the excitation, followed by long-term and short-
3.8. Adaptive Inter-Frame Dependency term synthesis filtering. It has been a central design
The presence of long term prediction, or an Adaptive criterion to keep the decoder as simple as possible
Codebook, is known to give challenges when packet and to keep its computational complexity low.
losses occur. The problem with LTP prediction is
due to the impulse response of the filter which can 5. LISTENING RESULTS
be much longer than the packet itself. Subjective listening tests by Google[18] and Noki-
a[19] show that Opus outperforms most existing [2] J.-M. Valin, K. Vos, and T. B. Terriberry,
speech codecs at all but the lowest bitrates. “Definition of the Opus Audio Codec” RFC
In [18], MUSHRA-type tests were used, and the fol- 6716, http://www.ietf.org/rfc/rfc6716.txt, Am-
lowing conclusions were made for WB and FB: sterdam, The Netherlands, September 2012.
• Opus at 32 kbps is better than G.719 at 32 kbps. [3] J.-M. Valin, G Maxwell, T. B. Terriberry, and
K. Vos, ”High-Quality, Low-Delay Music Cod-
• Opus at 20 kbps is better than Speex and ing in the Opus Codec”, Accepted at the AES
G.722.1 at 24 kbps. 135th Convention, 2013.
• Opus at 11 kbps is better than Speex at 11 kbps. [4] K. Vos, S. Jensen, and K. Sørensen, ”SILK
In [19], it is stated that: speech codec”, IETF Internet-Draft, http://-
tools.ietf.org/html/draft-vos-silk-02.
• Hybrid mode provides excellent voice quality at
bitrates from 20 to 40 kbit/s. [5] Burg, J., ”Maximum Entropy Spectral Analy-
sis”, Proceedings of the 37th Annual Interna-
6. CONCLUSION tional SEG Meeting, Vol. 6, 1975.
We have in this paper described the voice mode in [6] K. Vos, ”A Fast Implementation of Burg’s
Opus. The paper is intended to complement the pa- Method”, www.arxiv.org, 2013.
per about music mode [3], for a complete description
of the codec. The format of the paper makes it eas- [7] P. Kabal and R. P. Ramachandran, ”Joint So-
ier to approach than the more comprehensive RFC lutions for Formant and Pitch Predictors in
6716 [2]. Speech Processing”, Proc. IEEE Int. Conf. A-
coustics, Speech, Signal Processing (New York,
7. REFERENCES NY), pp. 315-318, April 1988.
[1] Opus Interactive Audio Codec, http://www.- [8] H.W. Strube, ”Linear prediction on a Warped
opus-codec.org/. Frequency Scale”, Journal of the Acoustical So-