CancerMine

CancerMine is a literature-mined database of drivers, oncogenes and tumor suppressors in cancer. It is a valuable resource for cancer researchers and clinicians to understand the genetic underpinnings of different cancer types. The largest bottleneck in precision oncology is interpreting the myriad of mutations found in individual patient tumors. Knowledge of the role of those genes in cancer is essential. This resource provides information on genes that are drivers (frequently harbor cancer-promoting mutations), oncogenes (cancer-promoting) and tumor suppressors (protective against cancer) in a large number of different cancers. Context is key as some genes (e.g. NOTCH1) are oncogenes in one cancer and tumor suppressive in another. All data has been text-mined from articles and the source sentence and links are provided.

Download: The complete dataset is stored in the Zenodo repository for perpetuity and will be updated regularly. Alternatively browse the dataset using the tabs above and use the Download buttons to download all the data you need.

The charts below show the most frequently cited genes and cancer types. Click on one of the genes or cancers in the barchart or use the tabs above to navigate the dataset. Use the Help tab to see more detailed information.

This knowledge base is automatically extracted using text mining tools. We use a supervised learning approach where a set of experts have annotated a large set of sentences to be used as examples for a machine learning system. This machine learning system is Kindred, a Python package for relation extraction. It is the successor to our VERSE system that won part of the BioNLP'16 Shared Task for relation extraction. This machine learning system is then applied to all abstracts in PubMed and all full text papers in the Pubmed Central Open Access subset.

The main researcher is Jake Lever, formerly a PhD student in the Jones group at Canada's Michael Smith Genome Sciences Centre in Vancouver, Canada and now a lecturer at the University of Glasgow. The full CancerMine team is Jake Lever, Eric Zhao, Jasleen Grewal, Martin Jones and Steven Jones.

Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CancerMine cites before making any interpretation.

To the extent possible under law, Jake Lever has waived all copyright and related or neighboring rights to the CancerMine dataset.

Gene

Collapse roles

Download All Download Shown

Citation Table:
Select row in table above to see associated citations and sentences

Download All Sentences Download Sentences for this Gene Download Shown Sentences Last updated on 2023-03-01

Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CancerMine cites before making any interpretation.

To the extent possible under law, Jake Lever has waived all copyright and related or neighboring rights to the CancerMine dataset.

Cancer

Collapse roles

Download All Download Shown

Citation Table:
Select row in table above to see associated citations and sentences

Download All Sentences Download Sentences for this Cancer Download Shown Sentences Last updated on 2023-03-01

Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CancerMine cites before making any interpretation.

To the extent possible under law, Jake Lever has waived all copyright and related or neighboring rights to the CancerMine dataset.

Use this to quickly check a list of genes against the cancer associations in CancerMine. Type or copy in a list of HUGO gene names into the box below (one per line). The table below will update to show an ordered list of matching genes in CancerMine with the most frequently cited at the top. The colored bars showed the number of citations supporting each gene as a driver, oncogene or tumor suppressor. Click Download to get this data or click on the gene names to investigate further.

Genes:

Driver

Oncogene

Tumor Suppressor Download

Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CancerMine cites before making any interpretation.

To the extent possible under law, Jake Lever has waived all copyright and related or neighboring rights to the CancerMine dataset.

CancerMine Profiles

This plot shows the importance of strongly associated gene roles for the selected cancer types. The brighter the square, the more important that gene role is for that cancer type. It clusters similar cancer types together. You can change which cancer types to cluster by changing the choices below. For a summary of how the gene importance is calculated, see the FAQs on the help page.

Cancers to Cluster

Usage

Some basic tips on how to use CancerMine

You can search by gene or cancer type using the tabs at the top and the dropdown box on the left. Type in part of the gene or cancer type and it will give auto-completion suggestions.
You can then click on a row in the table or segment in the bar-chart to bring up the associated citations in the table at the bottom. This table includes the PubMed ID with link, journal information, section within publication (title/abstract/article) and the actual sentence
The gene names are from HUGO and the cancer types are from the Disease Ontology
The system tries to normalize gene/cancer names to those ontologies (e.g. HER2 -> ERBB2 and ESCC -> esophagus squamous cell carcinoma).
The system tries to understand the meaning of the sentence. We've dialed up the precision so it should make as few false positives as possible, but there will be a few mistakes.
It will be updated once a month (roughly on the 1st) with the latest publications
You can download the entire dataset or subsets using the Download buttons. You can also download the underlying data at Zenodo.
The underlying code is available at Github with explanations of the associated data.

FAQs

Q: How can I contact the researchers?

A: Please email Jake Lever at jake.lever@glasgow.ac.uk.

Q: Why the focus on drivers, oncogenes and tumor suppressors?

A: When interpreting precision medicine data for an individual patient, it is incredibly important to understand the role of different genes in which somatic mutations occur.

Q: When I type in a gene or cancer, it isn't autocompleted. Why is that?

A: If a gene or cancer doesn't come up, it means that it isn't in CancerMine. Check that the gene/cancer name is the standard name (e.g. ERBB2 not HER2) in the associated ontology (HUGO and the Disease Ontology).

Q: What text is being mined?

A: We are processing the entirety of Pubmed (~22 million abstracts) and Pubmed Central Open Access Subset (~1 million full text articles).

Q: Are you just using co-occurrences?

A: No. Co-occurrences count how many times a gene and a cancer-type appear in the same sentence. We use the Kindred relation extraction package to understand the context of the sentence and only extract sentences that discuss a driver, oncogene or tumor suppressor with high likelihood.

Q: How regularly is this updated?

A: CancerMine is updated monthly (around the 1st of the month) and makes use of the PubRunner package to make it easier and minimise the amount of additional computation on new abstracts and papers in Pubmed and Pubmed Central Open Access subset.

Q: I found a sentence that doesn't say what CancerMine describes. Why?

A: CancerMine is completely automated with no human curation so there will be some mistakes. We've adjusted the precision to make as few false positive mistakes (with a tradeoff of likely higher false negatives). All information offered by CancerMine should be interpreted accordingly.

Q: Could this approach be used to extract other types of biomedical knowledge?

A: It probably could. If you'd like to talk, please get in contact with the main researcher, Jake Lever, by creating a Github issue. Also check out the code at the Github repository to find out more about the project.

Q: Where can I get the raw data to use?

A: The data can be downloaded by clicking on the Download buttons in the app. The complete set is accessible through Zenodo. It is in TSV format with headers. It is licensed under CC0. Once the paper is finalised, we'd appreciate if you would cite it if you use this data.

Q: Where can I find the code used to construct this knowledge base?

A: All the code is available in our Github repository. The project uses various Python libraries for the text mining (e.g. Kindred and PubRunner). The viewer is built using Shiny. The code is also in the repository.

Q: How are the importance measures calculated for the CancerMine profiles used in the clustering heatmap?

A: Each gene role is measured for importance by the number of papers in which it is discussed with the correpsonding cancer type. Hence, if RUNX3 is mentioned in 80 papers as a tumor suppressor in stomach cancer, the base importance value for this gene role in stomach cancer is 80. The base value is then log10-transformed and divided by the most important role for that cancer type (in order to normalize for the number of papers on that cancer). The most important gene role will then have an importance score of 1.0 and all others will have a score below that.

Q: How can I cite this work?

A: Please cite the Nature Methods paper. The preprint is still available on bioRxiv.

Q: Where can I report a bug?

A: Please email Jake Lever or create an issue on the CancerMine Github repo.

Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CancerMine cites before making any interpretation.

To the extent possible under law, Jake Lever has waived all copyright and related or neighboring rights to the CancerMine dataset.