Neural Network Decoders for Surface Codes
Neural Network Decoders for Surface Codes
net/publication/329362532
CITATIONS READS
0 48,258
3 authors, including:
All content following this page was uploaded by Savvas Varsamopoulos on 06 December 2018.
that span both boundaries of the same type can be regarded The circuit that is used to collect the ancilla measurements
as a logical operator, with n being the number of data qubits for the surface code is known as syndrome extraction cir-
that are included in the logical operator. Typically, the logical cuit. It is presented in Figure 2 and it signifies one round
operator with the smallest n is selected. For instance, in Fig- of error correction. It includes the preparation of the an-
ure 1, a logical X̄ can be performed by applying these three cilla in the appropriate state, followed by 4 (2) CNOT gates
bit-flip operations: X0 X3 X6 . that entangle the ancilla qubit with its 4 (2) neighboring data
An important feature of the surface code is the code dis- qubits and then the measurement of the ancilla qubit in the
tance. Code distance, referred to as d, describes the degree appropriate basis. The measurement result of the ancilla is a
of protection against errors. More accurately, is the minimum parity-check, which is a value that is calculated as the parity
number of physical operations required to change the logical between the state of the data qubits connected to it. Each an-
state [1] [18]. In surface code, the degree of errors (d.o.e.) cilla performs a parity-check of the form of X ⊗4 /Z ⊗4 (square
that can be successfully corrected, is calculated according to tile) and X ⊗2 /Z ⊗2 (semi-circle tile), as presented in Figure 2.
the following equation: When the state of the data qubits involved in a parity-check
has not changed, then the parity-check will return the same
value as in the previous error correction cycle. In the case
d−1 where the state of an odd number of data qubits involved in a
d.o.e. = (1)
2 parity-check is changed compared to the previous error cor-
rection cycle, the parity-check will return a different value
Therefore, for a d=3 code, only single X- and Z-type errors than the one of the previous cycle (0 ↔ 1). The change in a
(degree = 1) can be guaranteed to be corrected successfully. parity-check in consecutive error correction cycles is known
One of the smallest surface codes, which is currently being as a detection event.
experimentally implemented, is the rotated surface code pre-
Note that the parity-checks are used to identify errors in
sented in Figure 1. It consists of 9 data qubits placed at the
the data qubits without having to measure the data qubits
corners of the square tiles and 8 ancilla qubits placed inside
explicitly and collapse their state. The state of the ancilla
the square and semi-circle tiles. Each ancilla qubit can inter-
qubit at the end of every parity-check is collapsed through
act with its neighboring 4 (square tile) or 2 (semi-circle tile)
the ancilla measurement, but is initialized once more in the
data qubits.
beginning of the next error correction cycle [19].
The parity-checks must conform to the following rules: i)
must commute with each other, ii) must anti-commute with
errors and iii) must commute with the logical operators. An
easier way to describe these parity-checks is to view them as
a matrix, as presented in Figure 1. A matrix containing the
4 X-type parity checks and a matrix containing the 4 Z-type
parity-checks for the d=3 rotated surface code is presented.
The notation Di refers to the ith data qubit used in a given
parity-check. A 1 in the matrix represents that a data qubit
is involved in the parity check.
Gathering all measurement outcomes, forms the error
FIG. 1. Rotated surface code with code distance 3. Data qubits are syndrome. Surface codes can be decoded by collecting the
enumerated from 0 to 8 (D0-D8). X-type ancilla are in the center of ancilla measurements out of one or multiple rounds of error
the white tiles and Z-type ancilla are in the center of grey tiles. correction and providing them to a decoding algorithm that
identifies the errors and outputs data qubit corrections.
As mentioned, ancilla qubits are used to detect errors in We provide a simple example of decoding for the d=3 ro-
the data qubits. Although quantum errors are continuous, the tated surface code presented in Figure 3 on the left side. The
measurement outcome of each ancilla is discretized and then measurement of each parity-check operator returns a binary
forwarded to the decoding algorithm. Quantum errors are value indicating the absence or presence of a nearby data
discretized into bit-flip (X) and phase-flip (Z) errors, that can qubit error. Assume that a Z-type error has occurred on data
be detected by Z-type ancilla and X-type ancilla, respectively. qubit 4 and the initial parity-check has a value of 0. Ancilla
AX1 and AX2 will return a value of 1, which indicates that a
data qubit error has occurred in their proximity and ancilla
AX0 and AX3 will return a value of 0 according to the parity-
checks provided in Figure 1. However, due to the degenerate
nature of the surface code, two different but complementary
sets of errors can produce the same error syndrome. Regard-
FIG. 2. Syndrome extraction circuit for individual Z-type (left) and less of which of the two potential sets of errors has occurred,
X-type (right) ancilla, with the ancilla placed in the bottom. The the decoder is going to provide the same corrections every
ancilla qubit resides at the center of each grey or white tile, respec- time the same error syndrome is observed. Therefore, there
tively. is an assumption when the decoder is designed of which cor-
3
rections are going to be selected when each error syndrome tiple error correction cycles and corrections for data qubit
is observed. For example, when the decoder observes ancilla errors are more confidently proposed by observing the accu-
AX1 and AX2 returning a value of 1, then there are two sets mulation of errors throughout the error correction cycles.
of errors that might have occurred: a Z error at data qubit There exist classical algorithms that can decode efficiently
4 or a Z error at data qubit 2 and at data qubit 6. If the de- the surface code, however optimal decoding is a NP-hard
coder is programmed to output a Z-type correction at data problem. For example, maximum likelihood decoding (MLD)
qubit 4, then in one case the error is going to be erased, but searches for the most probable error that produced the er-
in the other case a logical error will be created. Based on that ror syndrome, whereas minimum weight perfect matching
fact, there is no decoder that can always provide the right set (MWPM) searches for the least amount of errors that pro-
of corrections, since there is a chance of misinterpretation of duced the error syndrome [5, 17]. MLD has a running time of
the error that have occurred. O(nχ3 ) according to [20] where χ is a parameter that con-
A single error on a data qubit will be signified by a pair of trols the approximation precision and the optimized version
neighboring parity-checks changing value from the previous of the Blossom algorithm shows linear scaling with the num-
error correction cycle. In the case where an error occurs at ber of qubits. Also, a parallel version of the Blossom algo-
the sides of the lattice, only one parity-check will provide in- rithm is described in [6] that claims constant execution time
formation about the error, because there is only one parity- regardless of the size of the system.
check available due to the structure of the rotated surface Furthermore, current qubit technology offers a small time
code. Multiple data qubit errors that occur near each other, budget for decoding, making most decoders unusable for
form one dimensional chains of errors which create only two near-term experiments. For example, an error correction cy-
detection events located at the endpoints of the chains (see cle of a d=3 rotated surface code with superconducting qubits,
Figure 3 on the left side and the red line on the right side). On takes ~700nsec [2], which provides the upper limit of the al-
the other hand, a measurement error, which is an error dur- lowed decoding time in this scenario. If noisy error syndrome
ing the measurement itself, is described as a chain between measurements are also assumed, then d error correction cy-
the same parity-check over multiple error correction cycles cles are required to provide the necessary information to the
(see the blue line in Figure 3 on the right side). This blue line decoder, so in this scenario ~2.1µsec will be the upper limit
represents an alternating pattern of the measurement values for the time budget of decoding.
(0-1-0 or 1-0-1) coming from the same parity-check for con- The way that Blossom algorithm performs the decoding
secutive error correction cycles. If such a pattern is identi- is through a MWPM in a graph that includes the detection
fied and is not correlated with a data qubit error, then it is events that have occurred during the error correction cycles
considered a measurement error, so no corrections should be taken into account for decoding. If the number of qubits is
applied. increased, then the graph will be bigger and the decoding
time will increase, assuming no parallelization.
An alternative decoding approach is to use neural net-
works to assist or perform the decoding procedure, since neu-
ral networks provide fast and constant execution time, while
maintaining high application performance. In this paper, we
are going to analyze decoders that include neural networks
and compare them to each other and to an un-optimized ver-
sion of the Blossom algorithm as described in [21].
the input gate [22]. The equations that describe the behavior
of all gates in the LSTM cell are described in Figure 6.
cations, however, RNNs are able to produce better results for The way that neural networks solve problems is not by ex-
more complex problems. plicit programming, rather “learning” the solution based on
In Feed-forward neural networks, input signals xi are given examples. There exist many ways to “teach” the neu-
passed to the nodes of the hidden layer hi and the output ral network how to provide the right answer, however in this
of each node in the hidden layer acts as an input to the nodes work we are focusing on supervised learning. Learning is a
of the following hidden layer, until the output of the nodes procedure which involves the creation of a map between an
of the last hidden layer is passed to the nodes of the output input and a corresponding output and in supervised learning
layer yi . The weighted connections between nodes of differ- the (input, output) pair is provided to the neural network.
ent layers are denoted as Wi and b is the bias of each layer. During training, the neural network adjusts its weights in
σ denotes the activation function that is selected, with popu- order to provide the correct output based on the given in-
lar options being the sigmoid, the hyperbolic tangent (tanh) put. Theoretically, at the end of training, the neural network
and the rectified linear unit (ReLU). The output of the FFNN should be able to infer the right output even for inputs that
presented in Figure 4 is calculated as follows: were not provided during training, which is known as gen-
eralization.
Training is stopped when the neural network can suffi-
~y = σ(Ŵ0 σ(Ŵh ~x + b~h ) + b~0 ) (2) ciently predict the right output to each training input. How-
ever, a definition of the closeness between the desired value
In recurrent neural networks there is feedback that takes and the predicted value needs to be defined. This metric is
into account output from previous time steps yt−1 , ht−1 . known as cost/loss function and guides the neural network
RNNs have a feedback loop at every node, which allows in- towards the desired outcome by estimating the closeness be-
formation to move in both directions. Due to the feedback tween the predicted and the desired value. The cost function
nature (see Figure 5), recurrent neural networks can identify is calculated at the end of every training iteration after the
temporal patterns of widely separated events in noisy input weights have been updated. The cost function that we used
streams. is known as mean squared error, which tries to minimize
the average squared error between the desired output and
the predicted output, given by
n
1X
cost = (Yi − Ŷi )2 (3)
n i=1
FIG. 5. A conceptual visualization of the recurrent nature of an RNN.
, where n is the number of data, Yi is the target value and
Ŷi is the predicted value.
In this work, Long Short-Term Memory (LSTM) cells are The procedure in which the weights are updated during
used as the nodes of recurrent neural networks (see Figure training in order to minimize the cost function is known as
6). In an LSTM cell there are extra gates, namely the input, backpropagation. Backpropagation is a method that cal-
forget and output gate that are used in order to decide which culates the gradient of the cost function with respect to the
signals are going to be forwarded to another node. W is the weights, through the process of stochastic gradient descent.
recurrent connection between the previous hidden layer and In order to be able to use neural networks to find solutions to
current hidden layer. U is the weight matrix that connects a variety of applications (linear and non-linear), it is required
the inputs to the hidden layer. Ce is a candidate hidden state to have a non-linear activation function at the processing
that is computed based on the current input and the previous step of every node. This function defines the contribution
hidden state. C is the internal memory of the unit, which is a of this node to the subsequent nodes that it is connected to.
combination of the previous memory, multiplied by the for- The activation function that was used in this work was the
get gate, and the newly computed hidden state, multiplied by Rectified Linear Unit.
5
classical module of the latter design will only receive the er-
ror syndrome out of the last error correction cycle and pre-
dict a set of corrections. In our previous experimentation [7],
this classical module was called simple decoder. The correc-
tions proposed by the simple decoder do not need to exactly
match the errors that occurred, as long as the corrections cor-
respond to the observed error syndrome (valid corrections).
The other module which in both cases is a neural network,
should be trained to receive the error syndromes out of all FIG. 8. Description of the decoding process of the high level decoder
error correction cycles and predict whether the corrections for a d=5 rotated surface code. (a) Observed error syndrome shown
that are going to be proposed by the simple decoder are go- in red dots and bit-flip errors on physical data qubits shown with X
ing to lead to a logical error or not. In that case, the neural on top of them. (b) Corrections proposed by the simple decoder for
network outputs extra corrections, which are the appropriate the observed error syndrome. (c) Additional corrections in the form
logical operator that erases the logical error. The output of of the X̄ logical operator to cancel the logical error generated from
the proposed corrections of the simple decoder.
both modules is combined and any logical error created by
the corrections of the simple decoder will be canceled due to
the added corrections of the neural network (see Figure 8).
error correction cycles are run and decoding is applied
Furthermore, the simple decoder is purposely designed in
in frequent windows. Depending on the error model, a
the simplest way in order to remain fast, regardless of the
single error correction cycle might be enough to suc-
quality of proposed corrections. By adding the simple de-
cessfully decode, as in the case of perfect error syn-
coder alongside the neural network, the corrections can be
drome measurements (window = 1 cycle), or multiple
given at one step and the execution time of the decoder re-
error correction cycles might be required, as in the case
mains small, since both modules are fast and operate in par-
of imperfect error syndrome measurements (window
allel.
> 1 cycle). When the lifetime simulations are stopped,
In Figure 8, the decoding procedure of the high level de-
the decoding performance is evaluated as the ratio of
coder is described with an example. On 8a, we present an
the number of logical errors found over the number of
observed error syndrome shown in red dots and the bit-flip
windows run until the simulations are stopped.
errors on physical data qubits (shown with X on top of them)
that created that syndrome. On 8b, we present the decod- • The training time is the time required by the neural
ing of the classical module known as simple decoder. The network to adjust its weights in a way that the training
simple decoder receives the last error syndrome of the decod- inputs provide the corresponding outputs as provided
ing procedure and proposes corrections on physical qubits by by the training dataset and adequate generalization can
creating chains between each detection event and the near- be achieved.
est boundary of the same type as the parity-check type of the
detection event. In Figure 8b, the corrections on the physi- • The execution time is the time that the decoder needs
cal qubits are shown with X on top of them, indicating the to perform the decoding after being trained. It is cal-
way that the simple decoder functions. The simple decoder culated as the difference between the time when the
corrections are always deemed valid, due to the fact that the decoder receives the first error syndrome of the decod-
predicted and observed error syndrome always match based ing window and the time when it provides the output.
on the construction of the simple decoder. In the case of Fig-
A. Error model
ure 8a-b, the proposed corrections of the simple decoder are
going to lead to an X̄ logical error, therefore we use the neu- These decoders were tested for two error models, the de-
ral network to identify this case and propose the application polarizing error model and the circuit noise model.
of the X̄ logical operator as additional corrections to the sim- The depolarizing model assigns X,Z,Y errors with equal
ple decoder corrections, as presented in 8c. probability p/3, known as depolarizing noise, only on the data
qubits. No errors are inserted on the ancilla qubits and per-
V. Implementation parameters fect parity-check measurements are used. Therefore, only a
In this section, we implement and compare both types of single cycle of error correction is required to find all errors.
neural network based decoders as discussed in the previous The circuit noise model assigns depolarizing noise on the
section and argue about the better strategy to create such a data qubits and the ancilla qubits. Furthermore, each single-
decoder. The chosen strategy will be determined by investi- qubit gate is assumed perfect but is followed by depolarizing
gation of how different implementation parameters affect the noise with probability p/3 and each two-qubit gate is assumed
i) decoding performance, ii) training time and iii) execution perfect but is followed by a two-bit depolarizing map where
time. each two-bit Pauli has probability p/15, except the error-free
case, which has a probability of 1 − p. Depolarizing noise is
• The decoding performance indicates the accuracy of also used at the preparation of a state and the measurement
the algorithm during the decoding process. The typical operation with probability p, resulting in the wrong prepared
way that decoding performance is evaluated is through state or a measurement error, respectively. An important as-
lifetime simulations. In lifetime simulations, multiple sumption is that the error probability of a data qubit error is
7
equal to the probability of a measurement error, therefore d probabilities datasets approach. Each dedicated training
cycles of error correction are deemed enough to decode prop- dataset that was created by a specific physical error proba-
erly. bility is used to test the decoding performance at that same
physical error probability and the probabilities close to that,
B. Choosing the best dataset but not all physical probabilities tested. Moreover, by sam-
The best dataset for a neural network based decoder is pling, training and testing the performance for the same
the dataset that produces the best decoding performance.
physical error rate, the decoder has the most relevant infor-
Naively, one could suggest that including all possible error
mation to perform the task of decoding.
syndromes, would lead to the best decoding performance,
The first step when designing a neural network based
however, as the size of the quantum system increases, includ-
decoder is gathering data that will be used as the training
ing all error syndrome becomes impossible. Therefore, we
dataset. However, as the code distance increases, the size of
need to include as little but as diverse as possible error syn-
the space including all potential error syndromes gets expo-
dromes, which will provide the maximum amount of gener-
nentially large. Therefore, we need to decide at which point
alization, thus the best decoding performance, after training.
the sampling process is going to be terminated.
In our previous experimentation[7], we showed that sam-
Based on the sampling probability (physical error rate), dif-
pling at a single physical error rate that always produces
ferent error syndromes will be more frequent than others. We
the fewest amount of corrections, is enough to decode small chose to include the most frequent error syndromes in the
distance rotated surface codes with a decent level of decod- training dataset. In order to find the best possible dataset, we
ing performance. This concept of always choosing the fewer increase the dataset size until it stops yielding better results
amount of corrections is similar to the Minimum Weight Per- in terms of decoding performance. For each training dataset
fect Matching that Blossom algorithm uses. size, we train a neural network and evaluate the decoding
After sampling and training the neural network at a sin- performance.
gle physical probability, the decoder is tested against a large It is not straightforward to claim that the optimal size of
variety of physical error rates and its decoding performance a training dataset is found, because there is no way to en-
is observed. We call this approach, the single probability sure that we found the minimum number of training samples
dataset approach, because we create only one dataset based that provide the best weights for the neural network, there-
on a single physical error rate and test it against many. Us- fore generalization, after being perfectly trained. Thus, we
ing the single probability dataset approach to decode vari- rely heavily on the decoding performance that each training
ous physical error probabilities is not optimal, because when dataset achieves and typically use more training samples than
sampling at low physical error rates, less diverse samples are the least amount required.
collected, therefore the dataset is not diverse enough to cor-
rectly generalize to unknown training inputs. C. Structure of the neural network
The single probability approach is valid for a real exper- While investigating the optimal size of a dataset, some pre-
iment, since in an experiment there is a single physical er- liminary investigation of the structure has been done, how-
ror probability that the quantum system operates and at that ever only after the dataset is defined, the structure in terms
probability the sampling, training and testing of the decoder of layers and nodes is explored in depth (see Figure 9).
will occur. However, this is not a good strategy for testing A variety of different configurations of layers and nodes
the decoding performance over a wide range of error prob- needs to be tested, so that the configuration with the highest
abilities. This is attributed to the degenerate nature of the accuracy of training in the least amount of training time can
surface code, since different sets of errors generate the same be adopted. The main factors that affect the structure of the
error syndrome. One set of errors is more probable when the neural network are the size of the training dataset, the sim-
physical error rate is small and another when it is high. Based ilarity between the training samples and the type of neural
on the design principles of the decoder, only one of these sets network.
of errors, and always the same, are going to be selected when We found in our investigation that the number of layers
a given syndrome is observed regardless of the physical error selected for training are affected more by the samples, e.g.
rate. Therefore, training a neural network based decoder in the similarity of the input samples, and less by the size of
one physical error rate and testing its decoding performance the training dataset. The number of nodes of the last hidden
in a widely different physical error rate is not beneficial. The layer is selected to be equal to the number of output nodes.
main benefit of this approach lies in the fact that only a single The rest of the hidden layers were selected to have decreasing
neural network has to be trained and used to evaluate the de- number of nodes going from the first to the last layer, but we
coding performance for all the physical error rates that were do not claim that this is necessarily the optimal strategy.
tested. In the single probability dataset approach, the set with We implemented both decoder designs with feed-forward
the fewer errors was always selected, because this set is more and recurrent neural networks. The more sophisticated
probable for the range of physical error rates that we are in- recurrent neural network seems to outperform the feed-
terested in. forward neural network in both the depolarizing and the cir-
To avoid such violations, we created different datasets that cuit noise model. In Figure 10, it is evident that even for the
were obtained by sampling at various physical error rates small decoding problem of d=3 rotated surface code for the
and trained a different neural network at each physical error depolarizing error model, the RNN outperforms the FFNN in
rate taken into account. We call this approach, the multiple decoding performance. This is even more obvious at larger
8
FIG. 9. Different configurations of layers and nodes for the d=3 ro-
tated surface code for the depolarizing error model. The nodes of
the tested hidden layers are presented in the legend. Training stops
at 500 training epochs for all configurations, since a good indication
of the training accuracy achieved is evident by that point. Then, the
one that reached the highest training accuracy continues training
until the training accuracy cannot increase any more.
code distances and for the circuit noise model, where the re-
current neural network naturally fits better due to its nature. FIG. 10. Left: Comparison of decoding performance between Blos-
Moreover, training of the FFNN becomes much harder com- som algorithm, low level decoder and high level decoder for the
pared to the RNN as the size of the dataset increases, making d=3 rotated surface code for the depolarizing error model. Right:
the experimentation with FFNN even more difficult. Zoomed in at the region defined by the square.
The metric that we use to compare the different designs
is the pseudo-threshold. The pseudo-threshold is defined
as the highest physical error rate that the quantum device high level decoder is outperforming the low level decoder.
should operate in order for error correction to be beneficial. Although there are ways to increase the decoding perfor-
Operating at higher than the pseudo-threshold probabilities mance of the latter, mainly by re-designing the repetition step
will cause worse decoding performance compared to an un- to find the valid corrections in less repetitions, we found no
encoded qubit. The protection provided by error correction is merit in doing so, since the decoding performance would still
increasing as the physical error rate becomes smaller than the be similar to the high level decoder's and the repetition step
pseudo-threshold value, therefore a higher pseudo-threshold would still not be eliminated.
for a code distance signifies higher decoding performance. Based on these observations, the results presented in Fig-
The pseudo-threshold metric is used when a single code ures 13 and 14 in the Results section were obtained with the
distance is being tested. When a variety of code distances are high level decoder with recurrent neural networks.
investigated, then we use the threshold metric. The thresh-
D. Training process
old is a metric that represents the protection against noise for 1. Batch size
a family of error correcting codes, like the surface code. For Training in batches instead of the whole dataset at once,
the surface code, each code distance has a different pseudo- can be beneficial for the training accuracy and training time.
threshold value, but the threshold value of the code is only By training in batches, the weights of the neural network are
one. updated multiple times per training iteration, which typically
The pseudo-threshold values for all decoders investigated leads to faster convergence. We used batches of 1000 or 10000
in Figure 10 can be found as the points of intersection be- samples, based on the size of the training dataset.
tween the decoder curve and the black dashed line, which
represents the points where the physical error probability is 2. Learning rate
equal to the logical error probability (y = x). The pseudo- Another important parameter of training that can directly
thresholds acquired from Figure 10 are presented in Table I. affect the training accuracy and training time is the learning
The threshold value is defined as the point of intersection rate. The learning rate is the parameter that defines how big
of all the curves of multiple code distances, therefore cannot the updating steps will be for each weight at every training
be seen from Figure 10, since all curves involve d=3 decoders, iteration. Larger learning rates in the beginning of training
but can be found in Figures 13 14 for the depolarizing and can lead the training process to a minimum faster during gra-
circuit noise model, respectively. dient descent, whereas smaller learning rates near the end of
Another observation from Figure 10 and Table I is that the training can increase the training accuracy. Therefore, we
9
3. Generalization
The training process should not only be focused on the cor-
rect prediction of known inputs, but also the correct predic-
tion of inputs unknown to training, known as generalization.
Without generalization, the neural network acts as a Look-
Up Table (LUT), which will lead to sub-optimal behavior as
the code distance increases. In order to achieve high level of
generalization, we continue training until the closeness be-
tween the desired and predicted value up to the 3rd decimal FIG. 11. Execution time for the high level decoder (hld) and the low
digit is higher than 95% over all training samples. level decoder (lld) for Feed-forward (FFNN) and Recurrent neural
networks (RNN) for d=3 rotated surface code for the depolarizing
4. Training and execution time error model.
Timing is a crucial aspect of decoding and in the case of
neural network decoders we need to minimize both the exe-
cution time and the training time as much as possible.
Moreover, the recurrent neural network typically uses
The training time is proportional to the size of the training
more weights compared to the feed-forward neural network,
dataset and the number of qubits. The number of qubits is in-
which translates to higher execution time. However, the de-
creasing in a quadratic fashion, 2d2 − 1, and the selected size
coding performance and the training accuracy achieved with
of the training dataset in our experimentation is increasing in
2 recurrent neural networks leads to better decoding perfor-
an exponential way, 2d −1 . Therefore, training time should mance. Thus, we decided to create high level decoders based
increase exponentially while scaling up. on recurrent neural networks while taking into account all
However, the platform that the training occurs, affects the parameters mentioned above.
the training time immensely, since training in one/multiple
The execution time for high level decoders appears to in-
CPU(s) or one/multiple GPU(s) or a dedicated chip in hard-
crease linearly with the number of qubits. This is justified
ware will result in widely different training times. The neural
by the fact that as the code distance increases, the operation
networks that were used to obtain the results in this work, re-
of the simple decoder does not require more time, since all
quired between half a day to 3 days, depending on the num-
detection events are matched in parallel and independently
ber of weights and the inputs/outputs of the neural network,
to each other, and the size of the neural network increases in
on a CPU with 28 hyper thread cores at 2GHz with 384GB of
such a way that only a linear increase in the execution time is
memory.
calculated. In Table II, we provide the calculated average time
In our simulations in a CPU, we observed the constant
of decoding a surface code cycle under depolarizing noise for
time behavior that was anticipated for the execution time,
all distances tested with the high level decoder with recurrent
however a realistic estimation taking into account all the de-
neural networks.
tails of a hardware implementation that such a decoder might
run, has not been performed by this or any other group yet.
The time budget for decoding is different for different qubit TABLE II. Average time for surface code cycle under depolarizing
technologies, however due to the inadequate fidelity of the error model
quantum operations, it is extremely small for the time being, Code distance Avg. time / cycle
for example ~700nsec for a surface code cycle with supercon- d=3 4.14msec
ducting qubits [19]. d=5 11.19msec
In Figure 11, we present the constant and non-constant ex- d=7 28.53msec
ecution time for the d=3 rotated surface code for the depolar- d=9 31.34msec
izing noise model with perfect error syndrome measurements
for the high level decoder and the low level decoder, respec-
tively. There are factors such as the number of qubits, the type
The low level decoder has to repeat its predictions before of neural network being used and the number of inputs/out-
it predicts a valid set of corrections which makes the exe- puts of the neural network that influence the execution time.
cution time non-constant. With careful design of the repe- The main advantage against classical algorithms is that the
tition step, the average number of predictions can decrease, execution time of such neural network based decoders is in-
however the execution time will remain non-constant. Based dependent of the physical error probability.
on the non-constant execution time and the inferior decod- In the following section we are presenting the results in
ing performance compared to the high level decoder, the low terms of the decoding performance for different code dis-
level decoder was rejected. tances.
10
VI. Results
As we previously mentioned, the way that decoding per- TABLE III. Pseudo-threshold values for the depolarizing error model
formance is tested is by running simulations that sweep a Decoder d=3 d=5 d=7 d=9
large amount of physical error rates and calculate the cor- Blossom 0.08234 0.10343 0.11366 0.11932
responding logical error rate for each of them. This type of Single prob. dataset 0.09708 0.10963 0.12447 N/A
simulations are frequently referred to as lifetime simulations Multiple prob. dataset 0.09815 0.12191 0.12721 0.12447
and the logical error is calculated as the ratio of logical errors
found over the error correction cycles performed to accumu-
late these logical errors.
The design of the neural network based decoder that was
used to obtain the results is described in Figure 12 for the
depolarizing and the circuit error model. For the case of the
depolarizing error model, neural network 1 is not used, so
the input is forwarded directly to the simple decoder since
perfect syndrome measurements are assumed. The decoding
process is similar to the one presented in Figure 8.
The decoding algorithm for the circuit noise model con-
sists of a simple decoder and 2 neural networks. Both neural
networks receive the error syndrome as input. Neural net-
work 1 predicts which detection events at the error syndrome
belong to data qubit errors and which belong to measurement
errors. Then, it outputs the error syndrome relieved of the FIG. 13. Decoding performance comparison between the high level
detection events that belong to measurement errors to the decoder trained on a single probability dataset, the high level de-
simple decoder. The simple decoder provides a set of correc- coder trained on multiple probabilities datasets and Blossom algo-
rithm for the depolarizing error model with perfect error syn-
tions based on the received error syndrome. Neural network
drome measurements. Each point has a confidence interval of 99.9%.
2 receives the initial error syndrome and predicts whether the
simple decoder will make a logical error and outputs a set of
corrections which combine with the simple decoder correc-
high level decoder is trained to identify the most frequently
tions at the output.
encountered error syndromes based on a given physical er-
ror rate, results in more accurate decoding information. An-
other reason for the improvement against the Blossom algo-
rithm, is the ability of identifying correlated errors (-iY=XZ).
For the depolarizing noise model with perfect error syndrome
measurements, the Blossom algorithm is proven to be near-
optimal, so we are not able to observe a large improvement in
the decoding performance. Furthermore, the comparison is
against the un-optimized version of Blossom algorithm [21],
therefore it is mainly performed to get a frame of reference
FIG. 12. The design for the high level decoder that was used for the rather than an explicit numerical comparison.
depolarizing and the circuit noise model. We observe that for the range of physical error rates that
we are interested in, which are below the pseudo-threshold,
A. Depolarizing error model the improvement against Blossom algorithm is reaching up
For the depolarizing error model, we used 5 training to 18.7%, 58.8% and 53.9% for code distance 3, 5 and 7, respec-
datasets that were sampled at these physical error rates : 0.2, tively for the smallest physical error probabilities tested.
0.15, 0.1, 0.08, 0.05. Perfect error syndrome measurements are The threshold of the rotated surface code for the depolar-
assumed, so the logical error rate can be calculated per error izing model has improved from 0.14 for the single probabil-
correction cycle. ity dataset approach to 0.146 for the multiple probabilities
In Table III, we present the pseudo-thresholds achieved datasets approach. The threshold of Blossom is calculated to
from the investigated decoders for the depolarizing error be 0.142.
model with perfect error syndrome measurements for dif-
ferent distances. As expected, when the distance increases, B. Circuit noise model
the pseudo-threshold also increases. Furthermore, the neu- For the circuit noise model, we used 5 training datasets
ral network based decoder with the multiple probabilities that were sampled at these physical error rates : 4.5x10−3 ,
datasets exhibits higher pseudo-threshold values, which is 1.5x10−3 , 8.0x10−3 , 4.5x10−4 , 2.5x10−4 . Since, imperfect er-
expected since it has more relevant information in its dataset. ror syndrome measurements are assumed the logical error
As can be seen from Figure 13, the multiple probabili- rate is calculated per window of d error correction cycles.
ties datasets approach is providing better decoding perfor- In Table IV, we present the pseudo-thresholds achieved for
mance for all code distances simulated. The fact that the the circuit noise model with imperfect error syndrome mea-
11
surements. Again, the neural network based decoder with to perform good decoding beyond d=7. The reason for this
multiple probabilities datasets is performing better than the exponential growth is due to the way that we provide the
single probability dataset. We were not able to use the Blos- data to the neural network. Currently, we gather all error
som algorithm with imperfect measurements for code dis- syndromes out of all the error correction cycles and create
tances higher than 3, therefore we decided not to include it. lists out of them. Then, we provide these lists to the recur-
However, we note that the results that were obtained are sim- rent neural network all-together. Since the recurrent neu-
ilar to the results in the literature corresponding to the circuit ral network can identify patterns both in space and time, we
noise model [23, 24]. also provide the error correction cycle that provided that er-
ror syndrome (time stamp of each error syndrome). Then,
the recurrent neural network is able to differentiate between
TABLE IV. Pseudo-threshold values for the circuit noise model consecutive error correction cycles and find patterns of er-
Decoder d=3 d=5 d=7
rors in them.
Single prob. dataset 3.99x10−4 9.23x10−4 N/A
In order to obtain efficient decoding regardless of the ex-
Multiple prob. dataset 4.44x10−4 1.12x10−3 1.66x10−3
ponentially large state space, we restrict the space that we
sample to the one containing the most frequent error syn-
dromes occurring at the specified sampling (physical) error
probability. However, even by employing such a technique,
it seems impossible to continue beyond d=7 for the circuit
noise model with the decoding approach that we used in this
work. At the circuit noise model for d=7 for example, we
gather error syndromes out of 10 error correction cycles and
each error syndrome contains 48 ancilla qubits. Therefore,
the full space that needs to be explored is 210∗48 , which is
infeasible.
A different approach that minimizes the space that the
neural network needs to search would be extremely valuable.
A promising idea would be to provide the error syndromes
of each error correction cycle one at a time, instead of giv-
ing them all-together, and keep an intermediate state of the
logical qubit.
FIG. 14. Decoding performance comparison between the high level
decoder trained on a single probability dataset and the high level
decoder trained on multiple probabilities datasets for the circuit VII. Conclusions
noise model with imperfect error syndrome measurements. Each This work focused on researching various design strategies
point has a confidence interval of 99.9%. for neural network based decoders. Such kind of decoders are
currently being investigated due to their good decoding per-
We observe from Figure 14 that the results with the mul- formance and constant execution time. They seem to have
tiple probabilities datasets for the circuit noise model are an upper limit at around 160 qubits, however by designing
significantly better, especially as the code distance is in- smarter approaches in the future, we can have neural net-
creased. The case of the d=3 is small and simple enough to work based decoders for larger quantum systems.
be solved equally well by both approaches. The increased de- We emphasized mainly on the design aspects and the pa-
coding performance achieved with the multiple probabilities rameters that affect the performance of the neural networks
datasets approach is based on the more accurate information and devised a detailed plan on how to approach them. We
for the physical error probability that is being tested. showed that we can have high decoding performance for
The threshold of the rotated surface code for the circuit quantum systems of about 100 qubits for both the depolar-
noise model has improved from 1.77x10−3 for the single izing and the circuit noise model. We showed that a neural
probability dataset approach to 3.2x10−3 for the multiple network based decoder that uses the neural network as an
probabilities datasets approach, that signifies that the use of auxiliary module to a classical decoder leads to higher de-
dedicated datasets when decoding a given physical error rate coding performance.
is highly advantageous. Furthermore, we presented the constant execution time of
As mentioned, the single probability dataset is collected such a decoder and showed that it increases linearly with
at a low physical error rate, for example around the pseudo- the code distance in our simulations. We compared differ-
threshold value. Therefore, the size of the training dataset ent types of neural networks, in terms of decoding perfor-
is similar for both the single and the multiple probabilities mance and execution time, concluding that recurrent neu-
datasets for the low physical error rates. For higher physical ral networks can be more powerful than feed-forward neural
error rates, we gather larger training datasets for the multiple networks for such applications.
probabilities datasets approach, which are also more relevant. Finally, we showed that having a dedicated dataset for the
The space that needs to be sampled is getting exponentially physical error rate that the quantum system operates can in-
larger to a point that is infeasible to gather enough samples crease the decoding performance.
12