Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Ankh is the first general-purpose protein language model trained on Google's TPU-V4 surpassing the state-of-the-art performance with dramatically less parameters, promoting accessibility to research innovation via attainable resources.

This repository will be updated regulary with new pre-trained models for proteins in part of supporting the biotech community in revolutinizing protein engineering using AI.

Dataset	Huggingface
Remote Homology	`load_dataset("proteinea/remote_homology")`
CASP12	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP12.csv']})`
CASP14	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP14.csv']})`
CB513	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CB513.csv']})`
TS115	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['TS115.csv']})`
DeepLoc	`load_dataset("proteinea/deeploc")`
Fluorescence	`load_dataset("proteinea/fluorescence")`
Solubility	`load_dataset("proteinea/solubility")`
Nearest Neighbor Search	`load_dataset("proteinea/nearest_neighbor_search")`

Usage

Loading pre-trained models:

  import ankh

  # Load Ankh base.
  model, tokenizer = ankh.load_ankh_base()
  model.eval()

  # Load Ankh large.
  model, tokenizer = ankh.load_ankh_large()
  model.eval()

  # Load Ankh3 Large
  model, tokenizer = ankh.load_ankh3_large()
  model.eval()

  # Load Ankh3 XL
  model, tokenizer = ankh.load_ankh3_xl()
  model.eval()

Feature extraction using ankh large example:

  model, tokenizer = ankh.load_ankh_large()
  model.eval()

  protein_sequences = [
    'MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
    'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR',
  ]

  protein_sequences = [list(seq) for seq in protein_sequences]

  outputs = tokenizer(
    protein_sequences, 
    add_special_tokens=True, 
    padding=True, 
    is_split_into_words=True, 
    return_tensors="pt",
  )
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs['attention_mask'])

Loading downstream models example:

  # To use downstream model for binary classification:
  binary_classification_model = ankh.ConvBERTForBinaryClassification(
    input_dim=768, 
    nhead=4, 
    hidden_dim=384, 
    num_hidden_layers=1, 
    num_layers=1, 
    kernel_size=7, 
    dropout=0.2, 
    pooling='max',
  )

  # To use downstream model for multiclass classification:
  multiclass_classification_model = ankh.ConvBERTForMultiClassClassification(
    num_tokens=2, 
    input_dim=768, 
    nhead=4, 
    hidden_dim=384, 
    num_hidden_layers=1, 
    num_layers=1, 
    kernel_size=7, 
    dropout=0.2,
  )

  # To use downstream model for regression:
  # training_labels_mean is optional parameter and it's used to fill the output layer's bias with it, 
  # it's useful for faster convergence.
  regression_model = ankh.ConvBERTForRegression(
    input_dim=768, 
    nhead=4, 
    hidden_dim=384, 
    num_hidden_layers=1, 
    num_layers=1, 
    kernel_size=7, 
    dropout=0, 
    pooling='max', 
    training_labels_mean=0.38145,
  )

Calculating Likelihood

import ankh

seq = "MDDADPEERNYDNMLKMLSDLNKDLEKLLEEMEKISVQATWMAYDMVVMRTNPTLAESMRRLEDAFVNCKEEMEKNWQELLHETKQRL"
likelihood = ankh.compute_pseudo_likelihood(
  "ankh_base",
  sequence,
  device="cpu",
  shard_input=True,
  shard_batch_size=32,
  verbose=True,
)

Original downstream Predictions

Secondary Structure Prediction (Q3):

Model	CASP12	CASP14	TS115	CB513
Ankh3 XLarge (NLU)	84.40%	82.19%	-	-
Ankh3 XLarge (S2S)	83.76%	82.30%	-	-
Ankh3 Large (NLU)	78.03%	79.28%	-	-
Ankh3 Large (S2S)	75.49%	77.96%	-	-
Ankh 2 Large	84.18%	76.82%	88.59%	88.78%
Ankh Large	83.59%	77.48%	88.22%	88.48%
Ankh Base	80.81%	76.67%	86.92%	86.94%
ProtT5-XL-UniRef50	83.34%	75.09%	86.82%	86.64%
ESM2-15B	83.16%	76.56%	87.50%	87.35%
ESM2-3B	83.14%	76.75%	87.50%	87.44%
ESM2-650M	82.43%	76.97%	87.22%	87.18%
ESM-1b	79.45%	75.39%	85.02%	84.31%

Secondary Structure Prediction (Q8):

Model	CASP12	CASP14	TS115	CB513
Ankh3 XLarge (NLU)	72.53%	69.85%	-	-
Ankh3 XLarge (S2S)	72.25%	69.51%	-	-
Ankh3 Large (NLU)	65.29%	65.50%	-	-
Ankh3 Large (S2S)	62.74%	65.88%	-	-
Ankh 2 Large	72.90%	62.84%	79.88%	79.01%
Ankh Large	71.69%	63.17%	79.10%	78.45%
Ankh Base	68.85%	62.33%	77.08%	75.83%
ProtT5-XL-UniRef50	70.47%	59.71%	76.91%	74.81%
ESM2-15B	71.17%	61.81%	77.67%	75.88%
ESM2-3B	71.69%	61.52%	77.62%	75.95%
ESM2-650M	70.50%	62.10%	77.68%	75.89%
ESM-1b	66.02%	60.34%	73.82%	71.55%

Contact Prediction Long Precision Using Embeddings:

Model	ProteinNet (L/1)	ProteinNet (L/5)	CASP14 (L/1)	CASP14 (L/5)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	48.93%	73.49%	16.01%	29.91%
Ankh Base	43.21%	66.63%	13.50%	28.65%
ProtT5-XL-UniRef50	44.74%	68.95%	11.95%	24.45%
ESM2-15B	31.62%	52.97%	14.44%	26.61%
ESM2-3B	30.24%	51.34%	12.20%	21.91%
ESM2-650M	29.36%	50.74%	13.71%	22.25%
ESM-1b	29.25%	50.69%	10.18%	18.08%

Contact Prediction Long Precision Using attention scores:

Model	ProteinNet (L/1)	ProteinNet (L/5)	CASP14 (L/1)	CASP14 (L/5)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	31.44%	55.58%	11.05%	20.74%
Ankh Base	25.93%	46.28%	9.32%	19.51%
ProtT5-XL-UniRef50	30.85%	51.90%	8.60%	16.09%
ESM2-15B	33.32%	57.44%	12.25%	24.60%
ESM2-3B	33.92%	56.63%	12.17%	21.36%
ESM2-650M	31.87%	54.63%	10.66%	21.01%
ESM-1b	25.30%	42.03%	7.77%	15.77%

Localization (Q10):

Model	DeepLoc Dataset
Ankh 2 Large	82.57%
Ankh Large	83.01%
Ankh Base	81.38%
ProtT5-XL-UniRef50	82.95%
ESM2-15B	81.22%
ESM2-3B	81.22%
ESM2-650M	82.08%
ESM-1b	80.51%

Remote Homology:

Model	SCOPe (Fold)
Ankh 2 Large	62.09%
Ankh Large	61.01%
Ankh Base	61.14%
ProtT5-XL-UniRef50	59.38%
ESM2-15B	54.48%
ESM2-3B	59.24%
ESM2-650M	51.36%
ESM-1b	56.93%

Solubility:

Model	Solubility
Ankh 2 Large	75.86%
Ankh Large	76.41%
Ankh Base	76.36%
ProtT5-XL-UniRef50	76.26%
ESM2-15B	60.52%
ESM2-3B	74.91%
ESM2-650M	74.56%
ESM-1b	74.91%

Fluorescence (Spearman Correlation):

Model	Fluorescence
Ankh3 XLarge (NLU)	0.64
Ankh3 XLarge (S2S)	0.65
Ankh3 Large (NLU)	0.65
Ankh3 Large (S2S)	0.65
Ankh 2 Large	0.62
Ankh Large	0.62
Ankh Base	0.62
ProtT5-XL-UniRef50	0.61
ESM2-15B	0.56
ESM-1b	0.48
ESM2-650M	0.48
ESM2-3B	0.46

Nearest Neighbor Search using Global Pooling:

Model	Lookup69K (C)	Lookup69K (A)	Lookup69K (T)	Lookup69K (H)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	0.83	0.72	0.60	0.70
Ankh Base	0.85	0.77	0.63	0.72
ProtT5-XL-UniRef50	0.83	0.69	0.57	0.73
ESM2-15B	0.78	0.63	0.52	0.67
ESM2-3B	0.79	0.65	0.53	0.64
ESM2-650M	0.72	0.56	0.40	0.53
ESM-1b	0.78	0.65	0.51	0.63

Team

Technical University of Munich:

Ahmed Elnaggar	Burkhard Rost

Proteinea:

Hazem Essam	Wafaa Ashraf	Walid Moustafa	Mohamed Elkerdawy

University of Columbia:

Charlotte Rochereau

License

Ankh pretrained models are released under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Community and Contributions

The Ankh project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

Have a question?

We are happy to hear your question in our issues page Ankh! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via Hello.

Found a bug?

Feel free to file a new issue with a respective title and description on the Ankh repository. If you already found a solution to your problem, we would love to review your pull request!.

✏️ Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@article{elnaggar2023ankh,
  title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
  author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
  journal={arXiv preprint arXiv:2301.06568},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.github/workflows		.github/workflows
examples		examples
images		images
src/ankh		src/ankh
vespa_analysis		vespa_analysis
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Table of Contents

Installation

Models Availability

Datasets Availability

Usage

Original downstream Predictions

Team

Sponsors

License

Community and Contributions

Have a question?

Found a bug?

✏️ Citation

About

Uh oh!

Releases 10

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Model	Ankh	Huggingface
Ankh Large	`ankh.load_large_model()`	Ankh Large
Ankh Base	`ankh.load_base_model()`	Ankh Base
Ankh3 Large	`ankh.load_ankh3_large()`	Ankh3 Large
Ankh3 XL	`ankh.load_ankh3_xl()`	Ankh3 XL

Folders and files

Latest commit

History

Repository files navigation

Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Table of Contents

Installation

Models Availability

Datasets Availability

Usage

Original downstream Predictions

Team

Sponsors

License

Community and Contributions

Have a question?

Found a bug?

✏️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages