Dna2vec is an open-source library to train distributed representations of variable-length k-mers.
For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers
Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.
- Clone the
dna2vecrepository:git clone https://github.com/pnpnpn/dna2vec - Install Python dependencies:
pip3 install -r requirements.txt - Test the installation:
python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
- Download
hg38from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz. This will take a while as it's 938MB. - Untar with
tar -zxvf hg38.chromFa.tar.gz. You should see FASTA files for chromosome 1 to 22:chr1.fa,chr2.fa, ...,chr22.fa. - Move the 22 FASTA files to folder
inputs/hg38/ - Start the training with:
python3 ./scripts/train_dna2vec.py -c configs/hg38-20161219-0153.yml - Wait for a couple of days ...
- Once the training is done, there should be a
dna2vec-<ID>.w2vand a correspondingdna2vec-<ID>.txtfile in yourresults/directory.
You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v using
the class MultiKModel in dna2vec/multi_k_model.py. For example:
from dna2vec.multi_k_model import MultiKModel
filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)
You can fetch the vector representation of AAA with:
>>> mk_model.vector('AAA')
array([ 0.023137 , 0.156295 , ...
Compute the cosine distance between two k-mers via dna2vec:
>>> mk_model.cosine_distance('AAA', 'GCT')
0.14546435594464155
>>> mk_model.cosine_distance('AAA', 'AAAA')
0.89000147450211231
The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8
>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]I would love for you to fork and send me pull request for this project. Please contribute.
This software is licensed under the MIT license