Wandera Etal Deep Learning
Wandera Etal Deep Learning
Correspondence
[email protected]
(R.B.),
[email protected] (C.L.B.)
In brief
Wandera et al. applied deep learning to
predict anti-CRISPR protein candidates
associated with diverse subtypes of
CRISPR-Cas immune systems. The
algorithm identified one protein then
shown to inhibit the Cas13b nuclease
from type VI-B CRISPR-Cas systems,
expanding the known range of phage-
encoded inhibitors as part of the bacteria-
phage arms race.
Highlights
d Deep learning predicts Acr candidates
Technology
Anti-CRISPR prediction using deep learning
reveals an inhibitor of Cas13b nucleases
Katharina G. Wandera,1,6 Omer S. Alkhnbashi,2,6 Harris v.I. Bassett,1 Alexander Mitrofanov,3 Sven Hauns,3
Anzhela Migur,1 Rolf Backofen,3,4,* and Chase L. Beisel1,5,7,*
1Helmholtz €rzburg, Germany
Institute for RNA-Based Infection Research (HIRI), Helmholtz Centre for Infection Research (HZI), 97080 Wu
2Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
€t Freiburg, 79098 Freiburg, Germany
3Universita
4Signalling Research Centres BIOSS and CIBSS, University of Freiburg, 79098 Freiburg, Germany
5Medical Faculty, University of Wu€ rzburg, 97080 Wu
€ rzburg, Germany
6These authors contributed equally
7Lead contact
SUMMARY
As part of the ongoing bacterial-phage arms race, CRISPR-Cas systems in bacteria clear invading phages
whereas anti-CRISPR proteins (Acrs) in phages inhibit CRISPR defenses. Known Acrs have proven extremely
diverse, complicating their identification. Here, we report a deep learning algorithm for Acr identification that
revealed an Acr against type VI-B CRISPR-Cas systems. The algorithm predicted numerous putative Acrs
spanning almost all CRISPR-Cas types and subtypes, including over 7,000 putative type IV and VI Acrs
not predicted by other algorithms. By performing a cell-free screen for Acr hits against type VI-B systems,
we identified a potent inhibitor of Cas13b nucleases we named AcrVIB1. AcrVIB1 blocks Cas13b-mediated
defense against a targeted plasmid and lytic phage, and its inhibitory function principally occurs upstream of
ribonucleoprotein complex formation. Overall, our work helps expand the known Acr universe, aiding our un-
derstanding of the bacteria-phage arms race and the use of Acrs to control CRISPR technologies.
2714 Molecular Cell 82, 2714–2726, July 21, 2022 ª 2022 Elsevier Inc.
ll
Technology
shown to regulate Acr expression (Birkholz et al., 2019; Stanley closely related genomes using existing tools (Figure 1B) (Padilha
et al., 2019). Beyond these Acr-associated (Aca) proteins, Acrs et al., 2020, 2021; Alkhnbashi et al., 2021). Acrs with no CRISPR-
were often present in genomes with CRISPR-Cas systems en- Cas subtype or multiple associated subtypes are labeled
coding self-targeting spacers, where the Acr was responsible ‘‘unassigned.’’
for preventing lethal self-targeting (Rauch et al., 2017; Watters The resulting model was trained by employing 420 known Acrs
et al., 2018). These insights led to a guilt-by-association derived from multiple CRISPR-Cas subtypes available during
approach used to identify Acr candidates in prophage regions. initial model construction (i.e., I-D, I-E, I-F, II-A, and II-C)
Eventually, machine learning was applied that leveraged guilt- (Bondy-Denomy et al., 2013; Pawluk et al., 2014, 2016; Rauch
by-association as well as a few features of known Acrs, such et al., 2017; He et al., 2018; Marino et al., 2018) as well as 420
as length, hydrophobicity, amino acid composition, and any non-Acrs represented by diverse small accessory proteins asso-
present motifs (Gussow et al., 2020; Wang et al., 2020; Huang ciated with type III CRISPR-Cas systems (Figure 1A; Table S2)
et al., 2021). Although these approaches have expanded the (Shah et al., 2019). Of these Acrs and non-Acrs, 15% (126 pro-
known set of Acrs, they have only led to the validation of Acrs teins) were used to validate the performance of the model
associated with CRISPR-Cas subtypes in which Acrs were (Table S2). DeepAcr achieved a performance accuracy of 96%
already known. using the LSTM network, 94% using the linear network, and
Here, we report a deep-learning algorithm called DeepAcr that 95% using the GRU network when testing a withheld dataset
predicts Acr candidates based purely on protein sequence infor- (Figure S2).
mation. The predictions unique to DeepAcr were mainly associ-
ated with subtypes lacking any established Acrs (e.g., IV-A and RESULTS
V-B), whereas the systematic screening of the highest-scoring
candidates against type VI-B CRISPR-Cas systems led to the DeepAcr uniquely predicts candidates outside of
discovery of an Acr that principally inhibited the Cas13b subtypes with known Acrs
nuclease prior to complexing with its crRNA. Our algorithm and Feeding DeepAcr protein sequences from 80,009 draft and com-
the use of deep learning are expected to further expand the plete bacterial and phage genomes, the algorithm identified
known universe of Acrs. 1,089,152 Acr candidates with medium (R 0.65) or high (R
0.8) confidence scores (Figure 2A; Tables S3 and S4). These
DESIGN candidates were principally associated with type I-E CRISPR-
Cas systems, likely reflecting the large fraction of known Acrs
A deep-learning approach combines compositional and against this subtype (Figure 2B). However, candidates were
sequence-related features without relying on genomic associated with many more subtypes than I-E, including sub-
associations types such as VI-B or IV-A with no established Acrs. The number
Given the growing list of validated Acrs and the remaining num- of Acr candidates within these subtypes became even more pro-
ber of CRISPR-Cas subtypes unassociated with Acrs (Bondy- nounced when normalizing to the number of CRISPR-Cas sys-
Denomy et al., 2018), we sought to develop a distinct approach tems within each subtype identified in the genomes used for
for Acr prediction. Unlike most of the prior approaches that relied Acr prediction (Figure S3).
on genomic associations (i.e., guilt-by-association), we focused To compare our candidates with those from other available
on a large set of features derived only from the assessed protein. prediction algorithms, we applied the same set of protein inputs
We also utilized deep learning instead of traditional machine to AcrDB (Huang et al., 2021) and a previously reported machine
learning to apply multiple learning structures for Acr prediction learning algorithm we will call the Gussow method (Gussow
from sequence-related features (Eitzinger et al., 2020; Gussow et al., 2020). The candidates extensively overlapped with the
et al., 2020). The resulting deep-learning algorithm, DeepAcr, predictions from DeepAcr, a remarkable result given that
functions through a series of defined steps inspired by a prior al- DeepAcr relies only on the protein sequence and no genomic
gorithm (Guo et al., 2019) (Figures 1A and S1). Initially, the as- features. Within this overlap, all 37,289 candidates from AcrDB
sessed protein sequence is converted into a feature matrix using were predicted by DeepAcr and the Gussow method, whereas
one-hot encoding, which converts amino acid sequences into 1,044,013 candidates were shared between the Gussow method
numerical values. Additionally, a set of twelve features capturing and DeepAcr. Therefore, our deep-learning algorithm could
properties of the entire protein (e.g., protein length and instability closely recapitulate predictions by existing algorithms despite
index) is extracted (Table S1). The one-hot encoding is then fed excluding commonly used genomic features such as self-target-
into a bidirectional recurrent cell incorporating three neural net- ing spacers or a flanking and properly oriented Aca.
works as the learning structures: long short-term memory Despite the extensive overlap, there were numerous candi-
(LSTM), linear, and gated recurrent unit (GRU). The output of dates uniquely predicted by DeepAcr (7,850) and the Gussow
these networks is concatenated and combined with the protein method (1,019) (Figure 2A). The Gussow method’s candidates
features. A multilayer perceptron (MLP) then converts these in- were associated with the I-E and I-F subtypes already possess-
puts into a confidence score (0 and 1 for lowest and highest con- ing a large cohort of Acrs (Figure 2C). However, the candidates
fidence, respectively) reflecting the certainty that the input pro- unique to DeepAcr were heavily enriched in subtypes with no es-
tein is an Acr. After predicting putative Acrs using DeepAcr, tablished Acrs. DeepAcr further predicted putative Acrs within
each candidate is paired with a CRISPR-Cas subtype by identi- the VI-A subtype in which seven Acrs have been reported (Lin
fying a CRISPR-Cas system in the Acr-encoding genome or et al., 2020; Meeske et al., 2020), although none of these were
Multilayer perceptron
Protein Properties High confidence
data selection score ≥ 0.8
Protein Length
Molecular Weight
Instability Index Medium confidence
Ensemble Models
Isoelectric Point score ≥ 0.65
Feature Learning
Charged Residues
Acrs from Accessory Extinction Coefficients
anti-CRISPRdb proteins Average of hydropathy Low confidence
Fractions of AC score < 0.65
...
...
...
420 Positive 420 Negative
...
...
...
Proteins Proteins
...
...
One-Hot Encoding
Balance A
Data Sampling C
D LSTM GRU Linear
E
Best performing
...
...
...
...
...
...
...
...
...
70% 15% 15% model
X
Train Test Y
Validate
Val Pro
Accuracy His
Val
Glu
Asp Cys
Ala Gln
B
Present subtype
assigned to Acr
Yes
candidate Present subtype
CRISPR-Cas system
Acr candidate assigned to Acr
present in genome? Yes
Closely related genome candidate
No
containing single
CRISPR-Cas system?
No Acr candidate
remains unassigned
Figure 1. DeepAcr applies deep learning to predict Acrs from input protein sequences
(A) Using DeepAcr to predict an Acr confidence score from an input protein sequence. Model training using a set of known Acrs and non-Acrs. The Acrs were
derived from anti-CRISPRdb (Dong et al., 2018) and are listed in Table S2. The non-Acrs were derived from accessory proteins from type III CRISPR-Cas systems
that are similar in size to known Acrs. LSTM, long short-term memory; GRU, gated recurrent unit.
(B) Assignment of a CRISPR-Cas subtype to a predicted Acr.
See also Figure S1; Table S1; Methods S1.
DeepAcr.
See also Figures S2 and S3; Tables S2, S3, and S4; Methods S1.
200,000
part of our training set due to the timing of their publication. The
candidates predicted by DeepAcr also exhibited a distinct length
100,000 distribution (Figure 2D), with 74% matching the distribution of
known Acrs (50–150 aa) and 24% exhibiting longer lengths
2,913
(151–300 aa). This large set of candidates offers an opportunity
0
to identify novel Acrs, particularly those associated with sub-
types in which none are known.
d
I-B
I-C
I-D
I-A
I-E
I-F
B
III C
-B
III C
-D
-A
A
-A
A
VI B
-A
na V B
ig C
ne
II-
II-
V-
V-
II-
-
ss I-
III
IV
III
VI
CRISPR-Cas subtype
known Acrs against Cas13a nucleases
C 750 Given the large number of predictions, we needed a rapid means
Gussow - unique candidates
to identify candidates that can inhibit defense by a CRISPR-Cas
500
system’s effector nuclease. We turned to cell-free transcription-
Number of predicted Acrs
250
translation (TXTL) systems that recapitulate transcription and
translation in a specially prepared E. coli cell lysate (Shin and
0 Noireaux, 2012; Garamella et al., 2016). As part of a reaction,
2,500 we added DNA constructs encoding a Cas nuclease, a targeting
DeepAcr - unique candidates
2,000 or non-targeting guide RNA (gRNA) and a targeted reporter
1,500 plasmid encoding a GFP variant (deGFP) that efficiently ex-
1,000 presses in TXTL (Garamella et al., 2016) to the lysate (Figure 3A).
500 The lysate then expresses the nuclease and gRNA, forming a
0 ribonucleoprotein (RNP) complex that cleaves the target and si-
lences deGFP expression. Measuring changes in deGFP fluores-
-C
-B
-C
-D
-A
-B
-A
-A
B
C
B
A
I-C
I-D
A
I-A
I-B
I-E
I-F
II-
V-
V-
II-
II-
VI
III
III
IV
VI
III
VI
III
50
been successfully used to screen for novel Acrs and assess
inhibitory activity against different Cas nucleases from type II
0
100,000
and V CRISPR-Cas systems (Marshall et al., 2018; Watters
predicted Acrs
et al., 2018; Wandera et al., 2020).
75,000
We specifically focused on type VI CRISPR-Cas systems and
50,000
their Cas13 nucleases, as many of the predicted Acrs fell within
25,000 multiple type VI subtypes without any reported Acrs (Figure 2C).
0 We initially devised a TXTL-based assay to measure the inhibi-
100 200 300 400 tion of Cas13 activity following our prior work (Marshall et al.,
Protein Length (aa) 2018; Wandera et al., 2020). In this case, we can measure on-
target and collateral RNA cleavage by targeting the deGFP tran-
script. Building on recent reports of eight distinct Acrs that inhibit
Cas13a (Lin et al., 2020; Meeske et al., 2020), we employed our
A
Non-targeting
target RNA
Acr
Cas13a
Fluorescence
Targeting
Cas ~16 hours
nuclease
Non-targeting
+ Acr
gRNA gRNA
Acr ~16 hours
deGFP Targeting
+ Acr
PFS
Time
B C
Protein AcRanker PaCRISPR AcrHub DeepAcr
Lwa
AcrVIA1′ Yes Yes Yes 0.74 (medium)
2′
3′
4′
5′
6′
7′
IA
IA
IA
IA
IA
IA
IA
rV
rV
rV
rV
rV
rV
rV
rV
Ac
Ac
Ac
Ac
Ac
Ac
Ac
Ac
Figure 3. Reported Cas13a Acrs characterized using TXTL align with scoring by DeepAcr
(A) Overview of the TXTL assay. As part of the assay, an Acr pre-expressed in one TXTL reaction is combined in a fresh reaction with constructs encoding a
Cas13a nuclease, a targeting or non-targeting gRNA, and a targeted deGFP reporter. deGFP fluorescence is then measured over time. Nuclease activation and
collateral RNA cleavage would lead to lower fluorescence, whereas the inhibition of nuclease expression or activity would restore fluorescence. PFS, protospacer
flanking sequence.
(B) Heatmap of inhibitory strength by reported Cas13a Acrs against LwaCas13a and LshCas13a in the TXTL assay. AcrVIA10 through AcrVIA70 comes from (Lin
et al., 2020). AcrVIA1 comes from (Meeske et al., 2020). The boxes with a white X represent non-specific inhibition of deGFP expression by the Acr. Values repre-
sent the average of four independent experiments. See Figure S4 for representative time courses.
(C) Acr predictions for three existing machine learning methods as well as for DeepAcr. Values represent the confidence scores output by DeepAcr.
See also Figure S4; Table S7.
TXTL assay to test each Acr against the Cas13a nuclease from LshCas13a (85%). Interestingly, RNA cleavage was also in-
Leptotrichia wadei (LwaCas13a) (Figure 3B). AcrVIA30 from one hibited by AcrVIA10 (84%), AcrVIA40 (77%), and AcrVIA50
study (Lin et al., 2020) yielded inconclusive results due to non- (60%). These results suggest that the seven reported AcrVIA
specific inhibition of deGFP expression, as we have observed proteins (AcrVIA10 –AcrVIA70 ) may exhibit inhibitory activity,
with other Acrs (Marshall et al., 2018). The six other Acrs from although the major conclusion is that the specific inhibitory activ-
the same study failed to inhibit RNA cleavage by LwaCas13a, ities originally reported for these Acrs could not be replicated (Lin
even though the same nuclease was reported to be robustly in- et al., 2020). We also conclude that TXTL can be used to assess
hibited by these Acrs in the original study (Lin et al., 2020). Our the inhibitory activity of putative Acrs against Cas13 nucleases.
results parallel recent work that also failed to observe inhibition The confirmed validity of AcrVIA1 and the uncertain validity of
by the seven Acrs in cell-based assays (Meeske et al., 2021). the other seven Acrs raise the question: what do DeepAcr and
No inhibition of LwaCas13a was observed in our TXTL-based the other Acr prediction algorithms predict for these Acrs?
assay with the Acr from the separate study (AcrVIA1), although Feeding each sequence into DeepAcr, the algorithm assigned
this Cas13a:Acr combination had not been tested (Meeske a high confidence score to AcrVIA1 (0.99) and a moderate con-
et al., 2020). Therefore, we tested a separate nuclease from Lep- fidence score to AcrVIA10 (0.74), the two Acrs exhibiting the
totrichia shahii (LshCas13a) as part of the same TXTL-based strongest inhibition of LshCas13a. DeepAcr assigned low confi-
assay (Figure 3B). AcrVIA1 strongly inhibited RNA cleavage by dence scores (0.28–0.49) to the remaining six Acrs (Figure 3C).
Interestingly, of the available Acr prediction tools, AcRanker pre- ence and phage defense. In the first assay, E. coli expressing
dicted all but AcrVIA1 as Acrs, whereas PaCRISPR and AcrHub PbuCas13b, a gRNA, and AcrVIB_5 were transformed with a
considered all as Acrs. Therefore, our TXTL results and those as- plasmid constitutively expressing the gRNA target (Figure 5A).
sessing the validity of AcrVIA10 –AcrVIA70 suggest that DeepAcr Under targeting conditions and in the absence of nuclease inhi-
can predict Acrs with enhanced accuracy over existing predic- bition, widespread collateral RNA cleavage by activated Cas13b
tion tools. induces cellular dormancy (Abudayyeh et al., 2016; Meeske
et al., 2019), leading to a drop in the number of colonies. As ex-
TXTL-based screening of Acr candidates against pected, targeting in the absence of AcrVIB_5 led to more than a
Cas13b reveals a potent inhibitor 100-fold reduction in colonies compared with a non-targeting
With the TXTL-based assay established, we were positioned to control, whereas expressing AcrVIB_5 resulted in similar colony
begin screening Acr candidates. We focused on candidates counts under targeting and non-targeting conditions (Figure 5B).
within the subtypes of type VI CRISPR-Cas systems lacking In the second assay, E. coli cells expressing the same compo-
any reported Acrs. The VI-B subtype and its Cas13b nuclease nents were infected with the lytic RNA phage MS2 followed by
were particularly attractive, in part because a number of these measuring plaque formation by the infecting phage (Figure 5C).
nucleases have been experimentally characterized and used Paralleling the plasmid interference assay, targeting in the
as technologies for gene silencing and RNA editing (Cox et al., absence of AcrVIB_5 eliminated all discernible plaques, whereas
2017; Kellner et al., 2019). From the highest-scoring candidates expressing AcrVIB_5 resulted in restored plaque formation.
associated with VI-B systems from DeepAcr, we chose 77 to When expressing AcrVIB_5, the plaques were more opaque un-
assess in our TXTL-based assay (Figure 4A; Table S5). These der targeting conditions, indicative of residual immune activity
candidates were variably predicted as Acrs by AcrRanker, (Figure 5D). Therefore, AcrVIB_5 can inhibit the activity of
PaCRISPR, and AcrHub, and only a few were flanked by a Cas13b in vivo, including under conditions in which the Acr pro-
predicted aca gene (Table S5). Given that many known Acrs motes phage infection. We now adopt the name AcrVIB1
can exhibit a narrow inhibitory spectrum (Shin et al., 2017; following the naming convention established for Acrs (Bondy-
Watters et al., 2018; Pinilla-Redondo et al., 2020), we incorpo- Denomy et al., 2018).
rated three phylogenetically distinct Cas13b nucleases from
Porphyromonas gingivalis (PgiCas13b), Prevotella buccae AcrVIB1 is a 115-residue protein consistently encoded
(PbuCas13b), and Bergeyella zoohelcum (BzoCas13b). Of the downstream of a conserved HTH-containing gene
tested Acr:Cas13b combinations, one candidate (AcrVIB_5) ex- With AcrVIB1 validated as an Acr against type VI-B CRISPR-Cas
hibited virtually complete inhibition (96%) against PbuCas13b systems, we explored the properties of this protein and its homo-
(Figures 4A and S5). logs as well as the genomic contexts in which they are found.
To initially validate the screening hit, we repeated the TXTL AcrVIB1 is 115 amino acids in length and contains no known mo-
assay using different dilutions of pre-expressed AcrVIB_5 (Fig- tifs (Figure 6A). PSI-BLAST revealed only four non-identical ho-
ure 4B). We also introduced a second reporter plasmid encoding mologs sharing between 90% and 95% amino acid sequence
mCherry lacking the gRNA target, which would be silenced identity with AcrVIB1, where the next closest search hit
through collateral RNA cleavage triggered by targeting the (A5Z863) shared only 12% identity (Altschul et al., 1990). Across
deGFP transcript. Adding pre-expressed Acr inhibited the the related homologs associated with a genomic sequence, all
silencing of both deGFP and mCherry in a dose-dependent were found in Riemerella anatipestifer genomes and fell within
manner (Figure 4C), paralleling the results from the large-scale predicted prophage regions. The homologs were also flanked
screen. We also subjected two other Cas13b homologs upstream by a conserved gene encoding an HTH domain (Fig-
(PguCas13b from Porphyromonas gulae and RanCas13b ure 6B). Although this domain is a standard feature of aca genes,
from Riemerella anatipestifer) as well as two Cas13a nucl- aca genes normally sit downstream of putative acr genes, an
eases (LshCas13a and LwaCas13a) to the TXTL-based orientation heavily weighted by the Gussow method for Acr pre-
assay with AcrVIB_5 (Figure S6). We found that only one addi- diction (Gussow et al., 2020). We conclude that AcrVIB1 exhibits
tional nuclease, the Cas13b from Porphyromonas gulae compositional and genomic hallmarks of other Acrs.
(PguCas13b), was partially inhibited by AcrVIB_5 in a dose-
dependent manner. As PguCas13b shares 52% identity with AcrVIB1 principally inhibits upstream of Cas13b binding
PbuCas13b within the set, we conclude that AcrVIB_5 exhibits the crRNA
a narrow inhibitory spectrum. Interestingly, the strain encoding Our TXTL and cell-based assays reported a reduction in RNA
AcrVIB_5 encodes a type VI-B CRISPR-Cas system, although cleavage by AcrVIB1, although any of the biomolecular steps
the associated nuclease (RanCas13b) was not inhibited by leading to RNA cleavage—from nuclease and gRNA expression
AcrVIB_5 in TXTL (Figure S6). Given that AcrVIB_5 and and RNP complex formation to target recognition and HEPN
RanCas13b are present in the same genome and RanCas13b activation—could be mechanistic targets. Therefore, we took
and PbuCas13b bear little similarity (43.4%), we speculate that steps to explore which of these steps was inhibited by
the co-occurrence of the Acr and Cas13b are coincidental rather AcrVIB1. Paralleling our prior work evaluating the timing of
than representative of direct inhibition of the endogenous VI-B Cas9:sgRNA complex formation (Marshall et al., 2018), we as-
CRISPR-Cas system. sessed different mechanistic steps by changing the set of con-
To validate the inhibitory activity of AcrVIB_5 outside of TXTL, structs added to a given TXTL reaction as well as the timing of
we performed two cell-based assays based on plasmid interfer- when each construct is added. We began by pre-expressing
A
Pgi
Pbu
Bzo
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
VI-B Acr candidate
Inhibition of
non-specific GFP inhibition
nuclease activity (%)
0 25 50 75 100
B C
deGFP mCherry
Cas13b gRNA Acr
100 100
nuclease activity (%)
Inhibition of
mCherry deGFP
50 50
0 0
5 6
~16 hours
4
Fluorescence
4
3
(RFU)
dilutions
2
2
1
NT T NT T
0 0
no Acr 1:1 1:2 1:5 no Acr 1:1 1:2 1:5
~16 hours Dilutions of pre-expressed AcrVIB_5
Figure 4. The screening of Acr candidates associated with VI-B CRISPR-Cas systems reveals a potent inhibitor of PbuCas13b
(A) Heatmap of 77 screened Acr candidates with medium or high confidence scores. See Table S5 for the list of the candidates along with their associated scores.
Boxes with a white X represent the non-specific inhibition of deGFP expression by the Acr. Values represent the average of four independent experiments. See
Table S6 for the individual values. Pgi: PgiCas13a. Pbu: PbuCas13a. Bzo: BzoCas13a. AcrVIB_5 was renamed AcrVIB1 following the current nomenclature
(Bondy-Denomy et al., 2018).
(B) Overview of TXTL assay to assess inhibition of on-target and collateral RNA cleavage by AcrVIB_5. The gRNA targets the deGFP transcript but not the
mCherry transcript.
(C) Results from the TXTL-based assay to assess inhibition of RNA cleavage activity by PbuCas13a with AcrVIB_5. Top: measured inhibitory activity based on
deGFP fluorescence (left) and mCherry fluorescence (right). Bottom: individual fluorescence end-point measurements. T, targeting gRNA; NT, non-targeting
gRNA. Error bars on the top reflect the mean and standard deviation of the inhibition values calculated from the fluorescence end-point measurements on the
bottom. Measurements are based on triplicate independent experiments.
See also Figures S4–S6; Tables S5, S6, and S7.
AcrVIB1 and adding it while PbuCas13b, the gRNA, and the tar- the gRNA together to form an RNP complex and then combined
geted deGFP transcript were being produced (Figure 7A). Inhibi- the complex with the deGFP reporter. Pre-expressed AcrVIB1
tion of deGFP silencing was complete and then greatly was added immediately before the deGFP reporter or at later
decreased with each hour delaying AcrVIB1 addition. deGFP time points (Figure 7B). Under this setup, AcrVIB1 only inhibited
production also did not rebound after adding AcrVIB1 (Figure S7). deGFP silencing by 23% followed by a similar decrease in inhi-
Therefore, AcrVIB1 was not inhibiting Cas13b-mediated RNA bition with each hour delaying AcrVIB1 addition. Incubating the
cleavage and instead was inhibiting some upstream step. pre-formed RNP complex with AcrVIB1 for different lengths of
To evaluate whether inhibition was occurring before or after time before adding the deGFP reporter did not restore inhibition
RNP complex formation, we pre-expressed PbuCas13b and (Figure 7C). The partial inhibition could reflect AcrVIB1 interfering
A C
acr
cas13b acr cas13b
deGFP
kanR
kanR
cmR gRNA cmR gRNA
ampR
B D
- deGFP - deGFP guide - MS2 - MS2 guide
added added
bacteria - - + + AcrVIB1 MS2 phage - - + + AcrVIB1
(dilution) (PFU/ml)
1:1 1.3x109
1:5 1.3x108
1:25 1.3x107
1.3x106
1:125
1.3x105
1:625
1:3,125
Figure 5. AcrVIB1 inhibits plasmid interference and phage defense by PbuCas13b in E. coli
(A) Experimental setup to evaluate the inhibitory activity of AcrVIB_5 in an in vivo killing assay. Bacteria harboring PbuCas13b, a gRNA, and AcrVI_5 are trans-
formed to obtain a plasmid encoding the deGFP target.
(B) Colony formation following the transformation of the targeted plasmid. Results are representative of triplicate independent experiments.
(C) Overview of the experimental setup of the MS2 phage infection assay. Bacteria harboring PbuCas13b, a gRNA, and AcrVIB_5 are challenged with MS2
phages.
(D) Plaque formation following infection with the lytic MS2 phage. Results are representative of triplicate independent experiments.
See also Table S7.
with target recognition, although the inhibitory effect mostly dates predicted by the Gussow method (Gussow et al., 2020), a
occurred upstream of RNP complex formation. These insights recently reported machine learning model that relies on genomic
suggest that AcrVIB1 principally exerts its inhibitory effect early features. Genomic features could also be incorporated into
in the process of CRISPR-based immunity rather than merely DeepAcr, although there is the potential that these additional
blocking the nuclease activity of the RNP complex. features lead to overfitting to existing Acrs and thus result in
otherwise strong candidates being discarded. The overlap be-
DISCUSSION tween the predictions for DeepAcr and the Gussow method
was remarkable given the different features used between the
Through this work, we developed and applied a deep-learning two algorithms. At the same time, DeepAcr uniquely predicted
model called DeepAcr to predict novel Acrs. Unlike all but one Acr candidates associated with CRISPR-Cas subtypes that
of the prior machine learning algorithms, DeepAcr operates were not part of the training set, with some currently lacking
purely based on information from an input protein sequence. any reported Acrs. The Acrs from these subtypes (e.g., IV-A in
The one exception, AcRanker, does not use a neural network ar- which little is known about the biology of these systems) provide
chitecture and is currently used in combination with a guilt-by- a large set of candidates that can be screened using experi-
association approach in AcrDB (Eitzinger et al., 2020). Focusing mental approaches, such as TXTL. In turn, new Acrs could be
on the protein sequence does ignore genomic features such as a revealed that expand the known Acr universe to almost every
flanking HTH-containing gene, an encompassing prophage re- subtype of the CRISPR-Cas system and reveal new inhibitory
gion, and the presence of a CRISPR-Cas system with self-tar- mechanisms and the outcomes of the bacteria-phage arms race.
geting spacers that are also indicators of Acrs. Despite not using By screening Acr candidates associated with type VI-B
this additional information and only relying on protein-related CRISPR-Cas systems, we identified one Acr exhibiting strong in-
features, our algorithm predicted the vast majority of Acr candi- hibition of Cas13b. The Acr, which we call AcrVIB1, was
protein name
WP_004917816.1 59
WP_064968248.1
WP_153937162.1
WP_079206532
uncharacterized protein
A5Z863
B
-5 -4 -3 -2 -1 AcrVIB1 1 2 3 4 5 Type VI-B CRISPR- in prophage
Cas system? region?
NC_014738
Yes Yes
(WP_004917816)
LUDK01000005
No Yes
(WP_064968248)
NZ_QXHV01000004 No Yes
(WP_153937162)
CP011859 No Yes
(WP_079206532)
HTH
Figure 6. AcrVIB1 and its homologs are 115-amino acid proteins that appear downstream of an HTH domain-encoding gene in prophage
regions of Riemerella anatipestifer genomes
(A) Protein sequence alignment of AcrVIB1 and its homologs. Homologs were identified by PSI-BLAST. A5Z863 is shown as the next PSI-BLAST hit after the
homologs. Identities to AcrVIB1 (NCBI: WP_004917816) are shown.
(B) Genetic synteny between AcrVIB1 and three homologs. The genes in gray are unrelated to any other displayed genes. The genes in non-gray colors share
>85% with the same-color genes.
See also Table S7.
identified in the genomes of Riemerella anatipestifer. AcrVIB1 Cas13b in the different applications in which these nucleases
and its homologs were consistently encoded immediately down- are employed (Cox et al., 2017; Abudayyeh et al., 2019).
stream of an HTH-encoding gene within prophage regions,
similar to many other characterized Acrs. Based on the charac- Limitations of the study
terization of HTH-encoding proteins (Birkholz et al., 2019; Stan- Although the TXTL-based screen led to the identification of
ley et al., 2019), these genes would be expected to regulate the AcrVIB1, none of the other 76 screened candidates exhibited
expression of AcrVIB1. Using TXTL, we were able to interrogate robust inhibitory activity despite all possessing high confidence
which step of immune defense AcrVIB1 is inhibiting. Our data scores. Even though this low hit rate could reflect a need for further
indicated that AcrVIB1 was principally inhibiting a step upstream improvements in the model, there are other explanations indepen-
of RNP complex formation. These data already indicate that dent of deep learning or CRISPR-Cas assignment. For instance,
AcrVIB1 functions differently than most known Acrs, which act many Acrs exhibit narrow inhibitory spectra (Shin et al., 2017;
by directly binding the RNP complex to block target recognition Marshall et al., 2018; Watters et al., 2018; Uribe et al., 2019; Pi-
or nuclease activity (Dong et al., 2017; Shin et al., 2017; Watters nilla-Redondo et al., 2020). As a result, our set of screened candi-
et al., 2018; Zhang et al., 2019). Instead, AcrVIB1 may inhibit dates could contain Acrs with inhibitory spectra that do not
RNP complex formation, or it may affect the expression or stabil- encompass the three Cas13b nucleases. Separately, the Acrs
ity of the unbound gRNA or the Cas13b holoenzyme or the ability could affect other aspects of adaptive immunity by VI-B
of the two components to interact. In-depth in vitro approaches CRISPR-Cas systems. These systems can harbor Csx27 or
can be pursued next to elucidate the exact mechanism of action. Csx28, accessory proteins that, respectively, regulate nuclease
Once determined, the Acr could be adapted for controlling activity or augment immune defense and could be targets of
A Acr added
B Acr added C Acr and RNP
before RNP formed after RNP formed pre-incubated
gRNA
delayed
deGFP deGFP
Inhibition of
Inhibition of
50 50 50
0 0 0
Acrs (VanderWal et al., 2016; Smargon et al., 2017). The Acrs could tegrated into Acr identification, potentially revealing new mecha-
also affect the natural expression of the systems as well as spacer nisms of action in which phages counter CRISPR-Cas defenses.
acquisition. Although these modes of inhibition would involve
proteins beyond Cas nucleases, the sequence and mechanistic STAR+METHODS
diversity of established Acrs and Cas nucleases lend to the iden-
tification of other Acrs using deep learning. By taking these other Detailed methods are provided in the online version of this paper
mechanisms into account, new screens could be devised and in- and include the following:
Hynes, A.P., Rousseau, G.M., Agudelo, D., Goulet, A., Amigues, B., Loehr, J., Padilha, V.A., Alkhnbashi, O.S., Shah, S.A., de Carvalho, A.C.P.L.F., and
Romero, D.A., Fremaux, C., Horvath, P., Doyon, Y., et al. (2018). Widespread Backofen, R. (2020). CRISPRcasIdentifier: machine learning for accurate
anti-CRISPR proteins in virulent bacteriophages inhibit a range of Cas9 pro- identification and classification of CRISPR-Cas systems. GigaScience 9.
teins. Nat. Commun. 9, 2919. giaa062.
Hynes, A.P., Rousseau, G.M., Lemay, M.L., Horvath, P., Romero, D.A., Padilha, V.A., Alkhnbashi, O.S., Tran, V.D., Shah, S.A., Carvalho, A.C.P.L.F.,
Fremaux, C., and Moineau, S. (2017). An anti-CRISPR from a virulent strepto- and Backofen, R. (2021). Casboundary: automated definition of integral Cas
coccal phage inhibits Streptococcus pyogenes Cas9. Nat. Microbiol. 2, cassettes. Bioinformatics 37, 1352–1359.
1374–1380. Pawluk, A., Bondy-Denomy, J., Cheung, V.H., Maxwell, K.L., and Davidson,
Jackson, S.A., McKenzie, R.E., Fagerlund, R.D., Kieper, S.N., Fineran, P.C., A.R. (2014). A new group of phage anti-CRISPR genes inhibits the type I-E
and Brouns, S.J. (2017). CRISPR-Cas: adapting to change. Science 356. CRISPR-Cas system of Pseudomonas aeruginosa. mBio 5, e00896.
eaal5056. Pawluk, A., Davidson, A.R., and Maxwell, K.L. (2018). Anti-CRISPR: discovery,
Kellner, M.J., Koob, J.G., Gootenberg, J.S., Abudayyeh, O.O., and Zhang, F. mechanism and function. Nat. Rev. Microbiol. 16, 12–17.
(2019). Sherlock: nucleic acid detection with CRISPR nucleases. Nat. Pawluk, A., Staals, R.H., Taylor, C., Watson, B.N., Saha, S., Fineran, P.C.,
Protoc. 14, 2986–3012. Maxwell, K.L., and Davidson, A.R. (2016). Inactivation of CRISPR-Cas sys-
Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. tems by anti-CRISPR proteins in diverse bacterial species. Nat. Microbiol.
ArXiv. http://arxiv.org/abs/1412.6980. 1, 16085.
Landsberger, M., Gandon, S., Meaden, S., Rollie, C., Chevallereau, A., Pinilla-Redondo, R., Shehreen, S., Marino, N.D., Fagerlund, R.D., Brown,
Chabas, H., Buckling, A., Westra, E.R., and van Houte, S. (2018). Anti- C.M., Sørensen, S.J., Fineran, P.C., and Bondy-Denomy, J. (2020).
CRISPR phages cooperate to overcome CRISPR-Cas immunity. Cell 174, Discovery of multiple anti-CRISPRs highlights anti-defense gene clustering
908–916. e12. in mobile genetic elements. Nat. Commun. 11, 5652.
Lee, J., Mir, A., Edraki, A., Garcia, B., Amrani, N., Lou, H.E., Gainetdinov, I., Rauch, B.J., Silvis, M.R., Hultquist, J.F., Waters, C.S., McGregor, M.J.,
Pawluk, A., Ibraheim, R., Gao, X.D., et al. (2018). Potent Cas9 inhibition in bac- Krogan, N.J., and Bondy-Denomy, J. (2017). Inhibition of CRISPR-Cas9 with
terial and human cells by AcrIIC4 and AcrIIC5 anti-CRISPR proteins. mBio 9. bacteriophage proteins. Cell 168, 150–158. e10.
e02321–18.
Shah, S.A., Alkhnbashi, O.S., Behler, J., Han, W., She, Q., Hess, W.R., Garrett,
Lin, P., Qin, S., Pu, Q., Wang, Z., Wu, Q., Gao, P., Schettler, J., Guo, K., Li, R., R.A., and Backofen, R. (2019). Comprehensive search for accessory proteins
Li, G., et al. (2020). CRISPR-Cas13 inhibitors block RNA editing in bacteria and encoded with archaeal and bacterial type III CRISPR-cas gene cassettes re-
mammalian cells. Mol. Cell 78, 850–861. e5. veals 39 new cas gene families. RNA Biol 16, 530–542.
Loshchilov, I., and Hutter, F. (2016). SGDR: stochastic gradient descent with Shin, J., Jiang, F., Liu, J.J., Bray, N.L., Rauch, B.J., Baik, S.H., Nogales, E.,
warm restarts. ArXiv. http://arxiv.org/abs/1608.03983. Bondy-Denomy, J., Corn, J.E., and Doudna, J.A. (2017). Disabling Cas9 by
Makarova, K.S., Wolf, Y.I., Iranzo, J., Shmakov, S.A., Alkhnbashi, O.S., an anti-CRISPR DNA mimic. Sci. Adv. 3, e1701620.
Brouns, S.J.J., Charpentier, E., Cheng, D., Haft, D.H., Horvath, P., et al. Shin, J., and Noireaux, V. (2012). An E. coli cell-free expression toolbox:
(2020). Evolutionary classification of CRISPR-Cas systems: a burst of class application to synthetic gene circuits and artificial cells. ACS Synth. Biol.
2 and derived variants. Nat. Rev. Microbiol. 18, 67–83. 1, 29–41.
} , B., and Bondy-Denomy, J. (2020).
Marino, N.D., Pinilla-Redondo, R., Csörgo Smargon, A.A., Cox, D.B.T., Pyzocha, N.K., Zheng, K., Slaymaker, I.M.,
Anti-CRISPR protein applications: natural brakes for CRISPR-Cas technolo- Gootenberg, J.S., Abudayyeh, O.A., Essletzbichler, P., Shmakov, S.,
gies. Nat. Methods 17, 471–479. Makarova, K.S., et al. (2017). Cas13b Is a type VI-B CRISPR-associated
Marino, N.D., Zhang, J.Y., Borges, A.L., Sousa, A.A., Leon, L.M., Rauch, B.J., RNA-guided RNase differentially regulated by accessory proteins Csx27 and
Walton, R.T., Berry, J.D., Joung, J.K., Kleinstiver, B.P., and Denomy, J.B. Csx28. Mol. Cell 65, 618–630. e7.
STAR+METHODS
RESOURCE AVAILABILITY
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact Prof. Chase
L. Beisel ([email protected]).
Materials availability
The following plasmids generated in this study have been deposited to Addgene: pPbuCas13b-gRNA-NT (#184568), pPgiCas13b
(#184833), pPgiCas13b-gRNA-1 (#184834), pPgiCas13b-gRNA-NT (#184835), pAcrVIB1-cloDF-kan (#184836), pcloDF-kan
(#184837), pPlacPbuCas13b-gRNA-NT (#184838), pPlacPbuCas13b-gRNA-T-MS2 (#184839), pPlacPbuCas13b-gRNA-T-
deGFP (#184840), P70a-deGFP-sc101-kan (#184841), pPJ23116-MS2-rep (#184842), pAcrVIB1-cloDF-amp (#184843), pcloDF-
amp (#184844). All other plasmids can be obtained upon reasonable request to Prof. Chase L. Beisel (chase.beisel@helmholtz-
hiri.de).
All bacterial strains used in this study are listed below. With the exception of Escherichia coli KL740 cI857+, all strains are cultured in
LB medium with incubation at 37 C, 220 rpm. Escherichia coli KL740 cI857+ was cultured at 29 C. All strains are stored as glycerol
stocks at -80 C.
METHOD DETAILS
Model architecture
The proposed method for binary classification of input Acr protein sequences. The first layer of the architecture is based on a single
convolutional layer that downsamples the protein input and creates an embedding to be processed by some recurrent layer. We,
therefore, use a relatively large stride (usually around 10) and kernel size (usually around 15). The use of convolution here is motivated
by the idea that amino acids in the protein that are closely clustered are more likely to create a feature. Additionally, creating an
embedding using a downsampling process makes the computation using the recurrent layer easier. We use the size of the one-
hot encoding of the amino acids as the number of channels and reduce the number of channels down to 9 in the output.
The recurrent layer consists of a bidirectional (therefore includes forward and backward pass) recurrent cell (based on either LSTM,
linear or GRU) with up to two layers (standard architecture). LSTM cells implement short-term memory by using input, forget, and
output gates. GRU cells are a smaller version of an LSTM cell that do not use an output gate. Within these cells, we apply a dropout
of 0.5. The additional information is preprocessed using (usually just a single) linear layer. The outcome of this preprocessing is then
concatenated with the output of the recurrent layer. This concatenated output is processed linearly and then yields the model output.
We do apply dropout with a dropout rate of up to 0.5 for these last linear layers. To enable non-linear processing ReLu functions f(x) =
max(0,x) are used for non-linear transformation. The used architecture was inspired by L. Guo et al. and optimized for our problem
(Guo et al., 2019). The optimization process is described under hyperparameter optimization.
One-hot encoding
We created a one-hot encoding of our input proteins and amino acids to process our data using neural networks. For each amino acid
belonging to a protein, the one-hot encoding consists of a 20 dimensional zero vector v=(0,...,x_i,...0) with a single 1 in position x_i
denoting the class of the amino acid. The sum of the one-hot encoding for each amino acid is, therefore, sum(v) = 1. In cases where
graph-convolution is applied, the primary form is represented with a one-dimensional graph with connections between neighboring
amino acids and self-referential relations. The idea behind the use of one-hot encoding is that they do not suggest ranking or order
between the used categories.
Model training
We evaluate using a 50-fold cross-validation procedure. 15% of the leftover training set is used as an additional validation set to
employ early stopping during training to not overfit the training data. We use cross-entropy-loss to evaluate our fit to the data and
weight the data according to its distribution. A weight decay of 0.001 and learning rate of 0.01 and an Adam optimizer (Kingma
and Ba, 2014) is applied, computing an adaptive learning rate for every parameter. We use mini batches containing 30 samples
each for training. Cosine Annealing (Loshchilov and Hutter, 2016) is used as a scheduler to adapt the learning rate, providing a
warm restart at 25 epochs. Test and validation sets are sampled in a stratified fashion to make sure that the distribution is equal.
We repeat this training process three times to create an ensemble of neural networks, each network trained for 50 epochs. Each
network in the ensemble is initialized randomly using a xavier normal function to capture different modes of the underlying space
of solution. The training uses weight decay, dropout, ensembles and early stopping to prevent overfitting to the train data.
Model testing
We evaluate the validation set from the cross-validation procedure in an ensemble setting. During prediction, we calculate the mean
prediction of all three networks (output1, output2, output3) and then apply a sigmoid function. The final prediction is the class with the
highest output: argmax(out). The idea here is that by using multiple networks we can better model uncertainty and increase our ac-
curacy. We give the accuracy of the ensemble prediction. The final result is the mean of all cross-validation runs.
Hyperparameter optimization
To optimize our architecture, we ran a regularized Evolutionary Algorithm (EA) (Tan et al., 2002) for 120 iterations. At first, 25 initial
architectures were created randomly. After evaluating all architectures, we create the next architectures to be evaluated. Therefore,
we draw the best three architectures from a tournament setting. In the tournament setting, three architectures are chosen at random,
and only the best one is returned. The quality of an architecture is determined by its performance (single-objective). The chosen ar-
chitectures are then either mutated (with a probability of 90%) or recombined with another architecture (probability of 10%). During
mutation, a third of the hyperparameter is chosen at random and mutated using a gaussian mutation procedure by sampling around
the previous hyperparameter with an std of 1/3 of the range of the hyperparameter. The values are then clipped if they exceed the
range of the hyperparameter. During recombination, a third of the hyperparameter of an architecture is chosen at random and re-
placed with the value of a second architecture. The oldest three of all architectures are then discarded. Finally, we simply return
the best performing architectures. Regularized EA has previously been used for similar tasks of neural architecture search (Tan,
Lee, Khor, and IEEE), which concludes that evolution provides a simple optimization method that can yield good results, especially
with little computational resources.
Prophage detection
Prophage locations were identified using the PHASTER web-server (Arndt et al., 2016).
the process of cloning. The Acr and the no-Acr-control were cloned via Gibson Assembly using a backbone encoding cloDF origin
and a kanamycin resistance marker. The constructed plasmids were prepared using the ZymoPURE II Plasmid Midiprep Kit (Zymo
Research) and the sequence was confirmed via Sanger sequencing. E.coli BL21(DE3) expressing both the plasmids encoding the
nuclease and a single-spacer CRISPR array and either AcrVIB1 or the no-Acr-control were used to transform 50ng of the plasmid
encoding a target deGFP gene. After transformation, the cells were recovered in SOC for 1h at 37 C while shaking at 220rpm.
The cells were plated on LB agar plates with triple antibiotic (Cm, Kana, Amp) in 5-fold serial dilutions. After 16h of growth, the colony
numbers were recorded for further analysis. In addition to counting the colonies, photos of the plates were taken using the
ImageQuant 800 (Amersham).
As part of the analysis of TXTL fluorescence data, the background fluorescence was subtracted from all samples. Background fluo-
rescence was measured using samples that only contained myTXTL mix and water. Grubb’s test was performed using the values
after 16-h to identify outliers between replicates (a = 0.1) when the data is presented in form of a bar chart. If no outliers were iden-
tified, one of the four replicates was discarded randomly. The time course graph shows the average deGFP fluorescence over time
together with the standard deviation. The percent inhibition of nuclease activity by each Acr candidate was calculated using the fluo-
rescence values after 16-h in the following formula:
0 1
GFPt; Acr GFPt
BGFPnt; Acr
GFPnt C
% Inhibition of nuclease activity = 100% B @
C;
A
GFPt
1
GFPnt
where GFPt,Acr is the GFP fluorescence in presence of a targeting gRNA and an Acr candidate, GFPnt,Acr is the GFP fluorescence in
presence of a non-targeting gRNA and an Acr candidate, GFPt- is the GFP fluorescence in presence of a targeting gRNA and no Acr,
and GFPnt- is the GFP fluorescence in presence of a non-targeting gRNA and no Acr.