Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
91 views10 pages

Opus Silk Codec

This paper describes the voice coding mode of the Opus audio codec. Opus uses a hybrid model combining linear predictive coding (LPC) to efficiently code low frequencies and modified discrete cosine transform (MDCT) to code higher frequencies. The encoder analyzes the input signal using a voice activity detector, high-pass filter, and pitch analysis before coding the signal using the two models. Opus supports variable bitrates down to 6 kbps and frame sizes of 10-60 ms for various sampling rates. Tests show Opus achieves quality comparable to or better than other voice codecs while supporting more applications.

Uploaded by

proton me
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views10 pages

Opus Silk Codec

This paper describes the voice coding mode of the Opus audio codec. Opus uses a hybrid model combining linear predictive coding (LPC) to efficiently code low frequencies and modified discrete cosine transform (MDCT) to code higher frequencies. The encoder analyzes the input signal using a voice activity detector, high-pass filter, and pitch analysis before coding the signal using the two models. Opus supports variable bitrates down to 6 kbps and frame sizes of 10-60 ms for various sampling rates. Tests show Opus achieves quality comparable to or better than other voice codecs while supporting more applications.

Uploaded by

proton me
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

.oOo.

The Opus Codec


To be presented at the 135th AES Convention
2013 October 17–20 New York, USA

This paper was accepted for publication at the 135th AES Convention. This version of the paper is from the authors
and not from the AES.

Voice Coding with Opus


Koen Vos, Karsten Vandborg Sørensen1 , Søren Skak Jensen2 , and Jean-Marc Valin3
1
Microsoft, Applications and Services Group, Audio DSP Team, Stockholm, Sweden
2
GN Netcom A/S, Ballerup, Denmark
3
Mozilla Corporation, Mountain View, CA, USA
Correspondence should be addressed to Koen Vos ([email protected])

ABSTRACT
In this paper, we describe the voice mode of the Opus speech and audio codec. As only the decoder is
standardized, the details in this paper will help anyone who wants to modify the encoder or gain a better
understanding of the codec. We go through the main components that constitute the voice part of the codec,
provide an overview, give insights, and discuss the design decisions made during the development. Tests have
shown that Opus quality is comparable to or better than several state-of-the-art voice codecs, while covering
a much broader application area than competing codecs.

1. INTRODUCTION • Target bitrates down to 6 kbps are supported.


The Opus speech and audio codec [1] was standard- Recommended bitrates for different sample ra-
ized by the IETF as RFC6716 in 2012 [2]. A com- tes are shown in Table 2.
panion paper [3], gives a high-level overview of the • The frame duration can be 10 and 20 ms, and
codec and explains its music mode. In this paper we for NB, MB, and WB, there is also support for
discuss the voice part of Opus, and when we refer 40 and 60 ms, where 40 and 60 ms are concate-
to Opus we refer to Opus in the voice mode only, nations of 20 ms frames with some of the coding
unless explicitly specified otherwise.
of the concatenated frames being conditional.
Opus is a highly flexible codec, and in the following • Complexity mode can be set from 0-10 with 10
we outline the modes of operation. We only list what being the most complex mode.
is supported in voice mode.

Opus has several control options specifically for voice


• Supported sample rates are shown in Table 1. applications:
Vos et al. Voice Coding with Opus

Sample 2. CODING MODELS


Name Acronym The Opus standard defines models based on the
Frequency
48 kHz Fullband FB Modified Discrete Cosine Transform (MDCT) and
24 kHz Super-wideband SWB on Linear-Predictive Coding (LPC). For voice sig-
16 kHz Wideband WB nals, the LPC model is used for the lower part of
12 kHz Mediumband MB the spectrum, with the MDCT coding taking over
8 kHz Narrowband NB above 8 kHz. The LPC based model is based on the
SILK codec, see [4]. Only frequency bands between
Table 1: Supported sample frequencies. 8 and (up to) 20 kHz1 are coded with MDCT. For
details on the MDCT-based model, we refer to [3].
As evident from Table 3 there are no frequency
Input Recommended Bitrate Range ranges for which both models are in use.
Type Mono Stereo
FB 28-40 kbps 48-72 kbps Sample Frequency Range
SWB 20-28 kbps 36-48 kbps Frequency LPC MDCT
WB 16-20 kbps 28-36 kbps 48 kHz 0-8 kHz 8-20 kHz1
MB 12-16 kbps 20-28 kbps 24 kHz 0-8 kHz 8-12 kHz
NB 8-12 kbps 14-20 kbps 16 kHz 0-8 kHz -
12 kHz 0-6 kHz -
Table 2: Recommended bitrate ranges. 8 kHz 0-4 kHz -

Table 3: Model uses at different sample frequencies,


• Discontinuous Transmission (DTX). This re- for voice signals.
duces the packet rate when the input signal is
classified as silent, letting the decoder’s Packet-
Loss Concealment (PLC) fill in comfort noise The advantage of using a hybrid of these two models
during the non-transmitted frames. is that for speech, linear prediction techniques, such
• Forward Error Correction (FEC). To aid pac- as Code-Excited Linear Prediction (CELP), code
ket-loss robustness, this adds a coarser descrip- low frequencies more efficiently than transform (e.g.,
tion of a packet to the next packet. The de- MDCT) domain techniques, while for high speech
coder can use the coarser description if the ear- frequencies this advantage diminishes and transform
lier packet with the main description was lost, coding has better numerical and complexity charac-
provided the jitter buffer latency is sufficient. teristics. A codec that combines the two models can
achieve better quality at a wider range of sample
• Variable inter-frame dependency. This ad- frequencies than by using either one alone.
justs the dependency of the Long-Term Predic-
tor (LTP) on previous packets by dynamically 3. ENCODER
down scaling the LTP state at frame bound- The Opus encoder operates on frames of either 10 or
aries. More down scaling gives faster conver- 20 ms, which are divided into 5 ms subframes. The
gence to the ideal output after a lost packet, at following paragraphs describe the main components
the cost of lower coding efficiency. of the encoder. We refer to Figure 1 for an overview
of how the individual functions interact.
The remainder of the paper is organized as follows:
3.1. VAD
In Section 2 we start by introducing the coding mod-
els. Then, in Section 3, we go though the main func- The Voice Activity Detector (VAD) generates a mea-
tions in the encoder, and in Section 4 we briefly go sure of speech activity by combining the signal-to-
through the decoder. We then discuss listening re- noise ratios (SNRs) from 4 separate frequency bands.
sults in Section 5 and finally we provide conclusions 1 Opus never codes audio above 20 kHz, as that is the upper

in Section 6. limit of human hearing.

AES 135th Convention, New York, USA, 2013 October 17–20


Page 2 of 10
Vos et al. Voice Coding with Opus

Fig. 1: Encoder block diagram.

In each band the background noise level is estimated preliminary estimates. After applying a small bias
by smoothing the inverse energy over time frames. towards shorter lags to avoid pitch doubling, a single
Multiplying this smoothed inverse energy with the candidate pitch lag with highest correlation is found.
subband energy gives the SNR.
The candidate’s correlation value is compared to a
3.2. HP Filter threshold that depends on a weighted combination
A high-pass (HP) filter with a variable cutoff of:
frequency between 60 and 100 Hz removes low-
frequency background and breathing noise. The cut- • Signal type of the prevous frame.
off frequency depends on the SNR in the lowest fre-
• Speech activity level.
quency band of the VAD, and on the smoothed pitch
frequencies found in the pitch analysis, so that high • The slope of the SNR found in the VAD with
pitched voices will have a higher cutoff frequency. respect to frequency.
3.3. Pitch Analysis
As shown in Figure 2, the pitch analysis begins by If the correlation is below the threshold, the sig-
pre-whitening the input signal, with a filter of or- nal is classified as unvoiced and the pitch analysis
der between 6 and 16 depending the the complex- is aborted without returning a pitch lag estimate.
ity mode. The whitening makes the pitch analysis
The final analysis step operates on the input sample
equally sensitive to all parts of the audio spectrum,
frequency (8, 12 or 16 kHz), and searches for integer-
thus reducing the influence of a strong individual
sample pitch lags around the previous stage’s esti-
harmonic. It also improves the accuracy of the cor-
mate, limited to a range of 55.6 to 500 Hz . For each
relation measure used later to classify the signal as
lag being evaluated, a set of pitch contours from a
voiced or unvoiced.
codebook is tested. These pitch contours define a de-
The whitened signal is then downsampled in two viation from the average pitch lag per 5 ms subframe,
steps to 8 and 4 kHz, to reduce the complexity of thus allowing the pitch to vary within a frame. Be-
computing correlations. A first analysis step finds tween 3 and 34 pitch contour vectors are available,
peaks in the autocorrelation of the most downsam- depending on the sampling rate and frame size. The
pled signal to obtain a small number of coarse pitch pitch lag and contour index resulting in the highest
lag candidates. These are input to a finer analysis correlation value are encoded and transmitted to the
step running at 8 kHz, searching only around the decoder.

AES 135th Convention, New York, USA, 2013 October 17–20


Page 3 of 10
Vos et al. Voice Coding with Opus

Fig. 2: Block diagram of the pitch analysis.

3.3.1. Correlation Measure to filter the input signal (without pre-whitening) to


Most correlation-based pitch estimators normalize find an LTP residual. This signal is input to the LPC
the correlation with the geometric mean of the en- analysis, where Burg’s method [5], is used to find
ergies of the vectors being correlated: short-term prediction coefficients. Burg’s method
provides higher prediction gain than the autocorre-
xT y lation method and, unlike the covariance method, it
C=p , (1)
(xT x · yT y) produces stable filter coefficients. The LPC order is
NLP C = 16 for FB, SWB, and WB, and NLP C = 10
whereas Opus normalizes with the arithmetic mean: for MB and NB. A novel implementation of Burg’s
method reduces its complexity to near that of the
xT y
COpus = 1 . (2) autocorrelation method [6]. Also, the signal in each
T + yT y)
2 (x x sub-frame is scaled by the inverse of the quantization
step size in that sub-frame before applying Burg’s
This correlation measures similarity not just in
method. This is done to find the coefficients that
shape, but also in scale. Two vectors with very dif-
minimize the number of bits necessary to encode the
ferent energies will have a lower correlation, similar
residual signal of the frame rather than minimizing
to frequency-domain pitch estimators.
the energy of the residual signal.
3.4. Prediction Analysis
As described in Section 3.3, the input signal is pre- Computing LPC coefficients based on the LTP resid-
whitened as part of the pitch analysis. The pre- ual rather than on the input signal approximates a
whitened signal is passed to the prediction analy- joint optimization of these two sets of coefficients
sis in addition to the input signal. The signal at [7]. This increases the prediction gain, thus reducing
this point is classified as being either voiced or un- the bitrate. Moreover, because the LTP prediction is
voiced. We describe these two cases in Section 3.4.1 typically most effective at low frequencies, it reduces
and 3.4.2. the dynamic range of the AR spectrum defined by
the LPC coefficients. This helps with the numeri-
3.4.1. Voiced Speech cal properties of the LPC analysis and filtering, and
The long-term prediction (LTP) of voiced signals is avoids the need for any pre-emphasis filtering found
implemented with a fifth order filter. The LTP co- in other codecs.
efficients are estimated from the pre-whitened input
signal with the covariance method for every 5 ms 3.4.2. Unvoiced Speech
subframe. The coefficients are quantized and used For unvoiced signals, the pre-whitened signal is dis-

AES 135th Convention, New York, USA, 2013 October 17–20


Page 4 of 10
Vos et al. Voice Coding with Opus

carded and Burg’s method is used directly on the • Spectral shaping of the quantization noise sim-
input signal. ilarly to the speech spectrum to make it less
audible.
The LPC coefficients (for either voiced or unvoiced
speech) are converted to Line Spectral Frequencies • Suppressing the spectral valleys in between for-
(LSFs), quantized and used to re-calculate the LPC mant and harmonic peaks to make the signal
residual taking into account the LSF quantization less noisy and more predictable.
effects. Section 3.7 describes the LSF quantization.
For each subframe, a quantization gain (or step size)
3.5. Noise Shaping
is chosen and sent to the decoder. This quantization
Quantization noise shaping is used to exploit the
gain determines the tradeoff between quantization
properties of the human auditory system.
noise and bitrate.
A typical state-of-the-art speech encoder determines Furthermore, a compensation gain and a spectral tilt
the excitation signal by minimizing the perceptually- are found to match the decoded speech level and tilt
weighted reconstruction error. The decoder then to those of the input signal.
uses a postfilter on the reconstructed signal to sup-
press spectral regions where the quantization noise The filtering of the input signal is done using the
is expected to be high relative to the signal. Opus filter
combines these two functions in the encoder’s quan- Wana (z)
tizer by applying different weighting filters to the H(z) = G · (1 − ctilt · z −1 ) · , (3)
Wsyn (z)
input and reconstructed signals in the noise shap-
ing configuration of Figure 3. Integrating the two where G is the compensation gain, and ctilt is the
operations on the encoder side not only simplifies tilt coefficient in a first order tilt adjustment filter.
the decoder, it also lets the encoder use arbitrarily The analysis filter are for voiced speech given by
simple or sophisticated perceptual models to simul- NXLP C
!
taneously and independently shape the quantization Wana (z) = 1 − aana (k) · z −k (4)
noise and boost/suppress spectral regions. Such dif- k=1
ferent models can be used without spending bits 2
X
!
on side information or changing the bitstream for- −L −k
· 1−z · bana (k) · z , (5)
mat. As an example of this, Opus uses warped noise k=−2
shaping filters at higher complexity settings as the
frequency-dependent resolution of these filters bet- and similarly for the synthesis filter Wsyn (z). NLP C
ter matches human hearing [8]. Separating the noise is the LPC order and L is the pitch lag in samples.
shaping from the linear prediction also lets us se- For unvoiced speech, the last term (5) is omitted to
lect prediction coefficients that minimize the bitrate disable harmonic noise shaping.
without regard for perceptual considerations. The short-term noise shaping coefficients aana (k)
A diagram of the Noise Shaping Quantization (NSQ) and asyn (k) are calculated from the LPC of the input
is shown in Figure 3. Unlike typical noise shap- signal a(k) by applying different amounts of band-
ing quantizers where the noise shaping sits directly width expansion, i.e.,
around the quantizer and feeds back to the input, aana (k) = k
a(k) · gana , and (6)
in Opus the noise shaping compares the input and k
asyn (k) = a(k) · gsyn . (7)
output speech signals and feeds to the input of the
quantizer. This was first proposed in Figure 3 of The bandwidth expansion moves the roots of the
[9]. More details of the NSQ module are described LPC polynomial towards the origin, and thereby
in Section 3.5.2. flattens the spectral envelope described by a(k).
3.5.1. Noise Shaping Analysis The bandwidth expansion factors are given by
The Noise Shaping Analysis (NSA) function finds
gains and filter coefficients used by the NSQ to shape gana = 0.95 − 0.01 · C, and (8)
the signal spectrum with the following purposes: gsyn = 0.95 + 0.01 · C, (9)

AES 135th Convention, New York, USA, 2013 October 17–20


Page 5 of 10
Vos et al. Voice Coding with Opus

Fig. 3: Predictive Noise Shaping Quantizer.

where C ∈ [0, 1] is a coding quality control param- Similar to the short-term shaping, having Fana <
eter. By applying more bandwidth expansion to Fsyn emphasizes pitch harmonics and suppresses the
the analysis part than the synthesis part, we de- signal in between the harmonics.
emphasize the spectral valleys. The tilt coefficient ctilt is calculated as
The harmonic noise shaping applied to voiced frames
ctilt = 0.25 + 0.2625 · V, (12)
has three filter taps
where V ∈ [0, 1] is a voice activity level which, in
bana = Fana · [0.25, 0.5, 0.25], and (10) this context, is forced to 0 for unvoiced speech.
bsyn = Fsyn · [0.25, 0.5, 0.25], (11)
Finally, the compensation gain G is calculated as
where the multipliers Fana and Fsyn ∈ [0, 1] are cal- the ratio of the prediction gains of the short-term
culated from: prediction filters aana and asyn .
An example of short-term noise shaping of a speech
• The coding quality control parameter. This spectrum is shown in Figure 4. The weighted in-
makes the decoded signal more harmonic, and put and quantization noise combine to produce an
thus easier to encode, at low bitrates. output with spectral envelope similar to the input
• Pitch correlation. Highly periodic input signal signal.
are given more harmonic noise shaping to avoid 3.5.2. Noise Shaping Quantization
audible noise between harmoncis. The NSQ module quantizes the residual signal and
• The estimated input SNR below 1 kHz. This thereby generates the excitation signal.
filters out background noise for a noise input A simplified block diagram of the NSQ is shown in
signal by applying more harmonic emphasis. Figure 5. In this figure, P (z) is the predictor con-

AES 135th Convention, New York, USA, 2013 October 17–20


Page 6 of 10
Vos et al. Voice Coding with Opus

shaping part and the second part is the quantization


noise shaping part.
3.5.3. Trellis Quantizer
The quantizer Q in the NSQ block diagram is a
trellis quantizer, implemented as a uniform scalar
quantizer with a variable offset. This offset de-
pends on the output of a pseudorandom genera-
tor, implemented with linear congruent recursions
on previous quantization decisions within the same
frame [12]. Since the quantization error for each
residual sample now depends on previous quantiza-
tion decisions, both because of the trellis nature of
the quantizer and through the shaping and predic-
tion filters, improved R-D performance is achieved
by implementing a Viterbi delayed decision mecha-
nism [13]. The number of different Viterbi states to
Fig. 4: Example of how the noise shaping oper- track, N ∈ [2, 4], and the number of samples delay,
ates on a speech spectrum. The frame is classified D ∈ [16, 32], are functions of the complexity setting.
as unvoiced for illustrative purposes, showing only At the lowest complexity levels each sample is simply
short-term noise shaping. coded independently.
3.6. Pulse Coding
taining both the LPC and LTP filters. Fana (z) and The integer-valued excitation signal which is the out-
Fsyn (z) are the analysis and synthesis noise shap- put from the NSQ is entropy coded in blocks of 16
ing filters, and for voiced speech they each consist samples. First the signal is split into its absolute
of both long term and short term filters. The quan- values, called pulses, and signs. Then the total sum
tized excitation indices are denoted i(n). The LTP of pulses per block are coded. Next we repeatedly
coefficients, gains, and noise shaping coefficients are split each block in two equal parts, each time encod-
updated for every subframe, whereas the LPC coef- ing the allocation of pulses to each half, until sub-
ficients are updated every frame. blocks either have length one or contain zero pulses.
Finally the signs for non-zero samples are encoded
separately. The range coding tables for the splits are
optimized for a large training database.
3.7. LSF Quantization
The LSF quantizer consists of a VQ stage with 32
codebook vectors followed by a scalar quantization
stage with inter-LSF prediction. All quantization
Fig. 5: Noise Shaping Quantization block diagram. indices are entropy coded, and the entropy coding
tables selected for the second stage depend on the
quantization index from the first. Consequently, the
LSQ quantizer uses variable bitrate, which lowers
Substituting the quantizer Q with addition of a
the average R-D error, and reduce the impact of out-
quantization noise signal q(n), the output of the
liers.
NSQ is given by:
3.7.1. Tree Search
1 − Fana (z) 1
Y (z) = G · · X(z) + · Q(z) As proposed in [14], the error signals from the N
1 − Fsyn (z) 1 − Fsyn (z) best quantization candidates from the first stage are
(13) all used as input for the next stage. After the sec-
The first part of the equation is the input signal ond stage, the best combined path is chosen. By

AES 135th Convention, New York, USA, 2013 October 17–20


Page 7 of 10
Vos et al. Voice Coding with Opus

varying the number N , we get a means for adjusting An often used technique is to reduce the LTP coef-
the trade-off between a low rate-distortion (R-D) er- ficients, see e.g. [11], which effectively shortens the
ror and a high computational complexity. The same impulse response of the LTP filter.
principle is used in the NSQ, see Section 3.5.3. We have solved the problem in a different way; in
3.7.2. Error Sensitivity Opus the LTP filter state is downscaled in the be-
Whereas input vectors to the first stage are un- ginning of a packet and the LTP coefficients are kept
weighted, the residual input to the second stage is unchanged. Downscaling the LTP state reduces the
scaled by the square roots of the Inverse Harmonic LTP prediction gain only in the first pitch period in
Mean Weights (IHMWs) proposed by Laroia et al. in the packet, and therefore extra bits are only needed
[10]. The IHMWs are calculated from the coarsely- for encoding the higher residual energy during that
quantized reconstruction found in the first stage, so first pitch period. Because of Jensens inequality, its
that encoder and decoder can use the exact same better to fork out the bits upfront and be done with
weights. The application of the weights partially it. The scaling factor is quantized to one of three
normalizes the error sensitivity for the second stage values and is thus transmitted with very few bits.
input vector, and it enables the use of a uniform Compared to scaling the LTP coefficients, downscal-
quantizer with fixed step size to be used without ing the LTP state gives a more efficient trade-off be-
too much loss in quality. tween increased bit rate caused by lower LTP pre-
3.7.3. Scalar Quantization diction gain and encoder/decoder resynchronization
The second stage uses predictive delayed decision speed which is illustrated in Figure 6.
scalar quantization. The predictor multiplies the 3.9. Entropy Coding
previous quantized residual value by a prediction The quantized parameters and the excitation signal
coefficient that depends on the vector index from are all entropy coded using range coding, see [17].
the first stage codebook as well as the index for the
3.10. Stereo Prediction
current scalar in the residual vector. The predicted
In Stereo mode, Opus uses predictive stereo encod-
value is subtracted from the second stage input value
ing [16] where it first encodes a mid channel as the
before quantization and is added back afterwards.
average of the left and right speech signals. Next
This creates a dependency for the current decision
it computes the side channel as the difference be-
on the previous quantization decision, which again
tween left and right, and both mid and side channels
is exploited in a Viterbi-like delayed-decision algo-
are split into low- and high-frequency bands. Each
rithm to choose the sequence of quantization indices
side channel band is then predicted from the cor-
yielding the lowest R-D.
responding mid band using a scalar predictor. The
3.7.4. GMM interpretation prediction-residual bands are combined to form the
The LSF quantizer has similarities with a Gaussian side residual signal S, which is coded independently
mixture model (GMM) based quantizer [15], where from the mid channel M . The full approach is illus-
the first stage encodes the mean and the second trated in Figure 7. The decoder goes through these
stage uses the Cholesky decomposition of a tridiag- same steps in reverse order.
onal approximation of the correlation matrix. What
is different is the scaling of the residual vector by 4. DECODING
the IHMWs, and the fact that the quantized resid- The predictive filtering consist of LTP and LPC. As
uals are entropy coded with a entropy table that is shown in Figure 8, it is implemented in the decoder
trained rather than Gaussian. through the steps of parameter decoding, construct-
ing the excitation, followed by long-term and short-
3.8. Adaptive Inter-Frame Dependency term synthesis filtering. It has been a central design
The presence of long term prediction, or an Adaptive criterion to keep the decoder as simple as possible
Codebook, is known to give challenges when packet and to keep its computational complexity low.
losses occur. The problem with LTP prediction is
due to the impulse response of the filter which can 5. LISTENING RESULTS
be much longer than the packet itself. Subjective listening tests by Google[18] and Noki-

AES 135th Convention, New York, USA, 2013 October 17–20


Page 8 of 10
Vos et al. Voice Coding with Opus

Fig. 7: Stereo prediction block diagram.

Fig. 8: Decoder side linear prediction block diagram.

a[19] show that Opus outperforms most existing [2] J.-M. Valin, K. Vos, and T. B. Terriberry,
speech codecs at all but the lowest bitrates. “Definition of the Opus Audio Codec” RFC
In [18], MUSHRA-type tests were used, and the fol- 6716, http://www.ietf.org/rfc/rfc6716.txt, Am-
lowing conclusions were made for WB and FB: sterdam, The Netherlands, September 2012.

• Opus at 32 kbps is better than G.719 at 32 kbps. [3] J.-M. Valin, G Maxwell, T. B. Terriberry, and
K. Vos, ”High-Quality, Low-Delay Music Cod-
• Opus at 20 kbps is better than Speex and ing in the Opus Codec”, Accepted at the AES
G.722.1 at 24 kbps. 135th Convention, 2013.
• Opus at 11 kbps is better than Speex at 11 kbps. [4] K. Vos, S. Jensen, and K. Sørensen, ”SILK
In [19], it is stated that: speech codec”, IETF Internet-Draft, http://-
tools.ietf.org/html/draft-vos-silk-02.
• Hybrid mode provides excellent voice quality at
bitrates from 20 to 40 kbit/s. [5] Burg, J., ”Maximum Entropy Spectral Analy-
sis”, Proceedings of the 37th Annual Interna-
6. CONCLUSION tional SEG Meeting, Vol. 6, 1975.
We have in this paper described the voice mode in [6] K. Vos, ”A Fast Implementation of Burg’s
Opus. The paper is intended to complement the pa- Method”, www.arxiv.org, 2013.
per about music mode [3], for a complete description
of the codec. The format of the paper makes it eas- [7] P. Kabal and R. P. Ramachandran, ”Joint So-
ier to approach than the more comprehensive RFC lutions for Formant and Pitch Predictors in
6716 [2]. Speech Processing”, Proc. IEEE Int. Conf. A-
coustics, Speech, Signal Processing (New York,
7. REFERENCES NY), pp. 315-318, April 1988.
[1] Opus Interactive Audio Codec, http://www.- [8] H.W. Strube, ”Linear prediction on a Warped
opus-codec.org/. Frequency Scale”, Journal of the Acoustical So-

AES 135th Convention, New York, USA, 2013 October 17–20


Page 9 of 10
Vos et al. Voice Coding with Opus

by Constrained Optimization”, in Proc IEEE


Int. Conf. on Acoustics, Speech and Signal Pro-
cessing, March 2005.
[12] J. B. Anderson, T. Eriksson, M. Novak, Trellis
source codes based on linear congruential recur-
sions, Proc. IEEE International Symposium on
Information Theory, 2003.
[13] E. Ayanoglu and R. M. Gray, ”The Design
of Predictive Trellis Waveform Coders Using
the Generalized Lloyd Algorithm”, IEEE Tr.
on Communications, Vol. 34, pp. 1073-1080,
November 1986.
[14] J. B. Bodie, Multi-path tree-encoding for a-
nalog data sources, Commun. Res. Lab., Fac.
Eng., McMasters Univ., Hamilton, Ont., Cana-
Fig. 6: Illustration of convergence speed after a da, CRL Int. Rep., Series CRL-20, 1974.
packet loss by measuring the SNR of the zero state
LTP filter response. The traditional solution means [15] P. Hedelin and J. Skoglund, Vector quantiza-
standard LTP. Constrained is the method in [11], tion based on Gaussian mixture models, IEEE
where the LTP prediction gain is constrained which Trans. Speech and Audio Proc., vol. 8, no. 4,
adds 1/4 bit per sample. Reduced ACB is the Opus pp. 385401, Jul. 2000.
method. The experiment is made with a pitch lag of
1/4 packet length, meaning that the Opus method [16] H. Krüger and P. Vary, A New Approach for
can add 1 bit per sample in the first pitch period in Low-Delay Joint-Stereo Coding, ITG-Fachta-
order to balance the extra rate for constrained LTP. gung Sprachkommunikation, VDE Verlag Gm-
The unconstrained LTP prediction gain is set to 12 bH, Oct. 2008.
dB, and high-rate quantization theory is assumed (1 [17] G. Nigel and N. Martin, Range encoding: An
bit/sample ↔ 6 dB SNR). After 5 packets the Opus algorithm for removing redundancy from a dig-
method outperforms the alternative methods by > itized message, Video & Data Recording Con-
2 dB and the standard by 4 dB. ference, Southampton, UK, July 2427, 1979.
[18] J. Skoglund, ”Listening tests of Opus at Goo-
ciety of America, vol. 68, no. 4, pp. 10711076, gle”, IETF, 2011.
Oct 1980.
[19] A. Rämö, H. Toukomaa, ”Voice Quality Char-
[9] B. Atal and M. Schroeder, ”Predictive Coding
acterization of IETF Opus Codec”, Interspeech,
of Speech Signals and Subjective Error Crite-
2011.
ria”, IEEE Tr. on Acoustics Speech and Signal
Processing, pp. 247-254, July 1979.
[10] Laroia, R., Phamdo, N., and N. Farvardin, ”Ro-
bust and Efficient Quantization of Speech LSP
Parameters Using Structured Vector Quanti-
zation”, ICASSP-1991, Proc. IEEE Int. Conf.
Acoust., Speech, Signal Processing, pp. 641-
644, October 1991.
[11] M. Chibani, P. Gournay, and R. Lefebvre, ”In-
creasing the Robustness of CELP-Based Coders

AES 135th Convention, New York, USA, 2013 October 17–20


Page 10 of 10

You might also like