Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dhimmel
Copy link
Member

@dhimmel dhimmel commented Oct 5, 2016

Closes #23

This downloads the latest Entrez Gene information from their FTP site (updated daily). Obsoleted genes have missing values for the columns from Entrez Gene. Unclear how we want to proceed wrt making the backing/django-genes and cancer-data use the same gene data.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 5, 2016

Here's the head of genes.tsv, which would become part of the Cognoma data release:

entrez_gene_id symbol description chromosome gene_type synonyms aliases n_mutations mutation_frequency mean_expression mutation expression
1 A1BG alpha-1-B glycoprotein 19 protein-coding A1B ABG GAB HYST2477 alpha-1B-glycoprotein HEL-S-163pA epididymis secretory sperm binding protein Li 163pA
2 A2M alpha-2-macroglobulin 12 protein-coding A2MD CPAMD5 FWP007 S863-7 alpha-2-macroglobulin C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5 alpha-2-M
3 A2MP1 alpha-2-macroglobulin pseudogene 1 12 pseudo A2MP pregnancy-zone protein pseudogene 4 0.0005475 1 0
9 NAT1 N-acetyltransferase 1 8 protein-coding AAC1 MNAT NAT-1 NATI arylamine N-acetyltransferase 1 N-acetyltransferase 1 (arylamine N-acetyltransferase) N-acetyltransferase type 1
10 NAT2 N-acetyltransferase 2 8 protein-coding AAC2 NAT-2 PNAT arylamine N-acetyltransferase 2 N-acetyltransferase 2 (arylamine N-acetyltransferase) N-acetyltransferase type 2 arylamide acetylase 2

entrez_gene_id was a float due to odd behavior by df.merge in pandas. This
resulting in `float_format='%.4g'` of to_csv causing exponent formatting of
entrez_gene_id and irreversibly corrupting their IDs.
@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

Do not review yet --- will update in the wake of cognoma/genes#1.

dhimmel added a commit to dhimmel/cancer-data that referenced this pull request Oct 7, 2016
`0.genes-download.ipynb` is a notebook to download datasets from
`cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping
guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this
mapping is now done in `2.TCGA-process.ipynb`.

Closes cognoma#23. Closes cognoma#30 by exporting gene info files in `2.TCGA-process.ipynb`
@dhimmel dhimmel closed this in #32 Oct 10, 2016
dhimmel added a commit that referenced this pull request Oct 10, 2016
* Outsource Entrez Gene logic to cognoma/genes

`0.genes-download.ipynb` is a notebook to download datasets from
`cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping
guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this
mapping is now done in `2.TCGA-process.ipynb`.

Closes #23. Closes #30 by exporting gene info files in `2.TCGA-process.ipynb`

* Average expression values for the same gene

* Update cognoma/genes download location
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant