Text mined biomarkers in cancer for curation into the CIViC database.

Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CIViCmine cites before making any interpretation.


Download All Download Shown

Citation Table:
Select a biomarker in the table above to see associated citations and sentences

Download All Sentences Download Sentences for Biomarkers Above Download Sentences for Selected Biomarker
CIViCmine updated on 2023-03-01. Comparing with CIViC updated on 2023-01-09.


Some basic tips on how to use CIViCmine
  • You can filter the results using the panel on this left side of the Browse page. This panel allows you to filter by evidence type, gene, cancer type, drug name, variant type and whether it already exists in the CIViC database.
  • To deselect a gene, cancer type or drug, click the dropdown box, press Backspace and click away or press Escape. Do not press Enter. Unfortunately you cannot select an empty option. This is a reported issue with Shiny.
  • You can then click on a row in the table to bring up the associated citations in the table at the bottom. This table includes the PubMed ID with link, journal information, section within publication (title/abstract/article) and the actual sentence
  • You can also click on a gene, cancer type or drug shown in the pie chart to jump straight to biomarkers involving your selection. Your choice will be shown in the dropdown boxes on the left
  • The matching with CIViC only takes into account the evidence type, gene, cancer and drug (if applicable).
  • The gene names are from HUGO, the cancer types are from the Disease Ontology and the drug names are from WikiData
  • The system tries to normalize gene, cancer and drug names to those ontologies (e.g. HER2 -> erbb2, ESCC -> esophagus squamous cell carcinoma and AZD9291 -> osimertinib).
  • Sometimes it can't find one specific case so it lists all the possibilities separated by a semi-colon. So p75 gets mapped to "CUX1;HCLS1;PSIP1;SIGLEC7;TNFRSF1B" as it is very ambiguous.
  • It also tries to detect possible fusions or combinations. These are separated by the pipe '|'. So you may see 'BCR|ABL1'
  • The system tries to understand the meaning of the sentence. We've dialed up the precision so it should make as few false positives as possible, but there will be a few mistakes.
  • The table of citations includes a link to the PubMed citation, the journal and publication year, the section of the paper (title, abstract or article) and the associated sentence.
  • It will be updated once a month (roughly on the 1st) with the latest publications


Q: When I type in a gene, cancer or drug, it isn't autocompleted. Why is that?

A: If a gene or cancer doesn't come up, it means that it isn't in CIViCmine. Check that the gene/cancer name is the standard name (e.g. ERBB2 not HER2) in the associated ontology (HUGO, the Disease Ontology and WikiData for drug names).

Q: What text is being mined?

A: We are processing the entirety of Pubmed (~22 million abstracts) and Pubmed Central Open Access Subset (~1 million full text articles).

Q: Are you just using co-occurrences?

A: No. Co-occurrences count how many times a gene and a cancer-type appear in the same sentence. We use the Kindred relation extraction package to understand the context of the sentence and only extract sentences that discuss a cancer biomarker with high likelihood.

Q: How regularly is this updated?

A: CIViCmine is updated monthly (around the 1st of the month) and makes use of the PubRunner package to make it easier and minimise the amount of additional computation on new abstracts and papers in Pubmed and Pubmed Central Open Access subset.

Q: I found a sentence that doesn't say what CIViCmine describes. Why?

A: CIViCmine is completely automated with no human curation so there will be some mistakes. We've adjusted the precision to make as few false positive mistakes (with a tradeoff of likely higher false negatives). All information offered by CIViCmine should be interpreted accordingly.

Q: Could this approach be used to extract other types of biomedical knowledge?

A: It probably could. If you'd like to talk, please get in contact with Jake Lever.

Q: Can I bulk download the data?

A: You can bulk download it from our Zenodo repository or use the download buttons on this website to get a subset of the data. All data is Creative Commons Zero licensed.

Q: Where can I read more about the methods and full results?

A: This work has been published in Genome Medicine. Please cite this paper you make use of the data.


This resource is built as an aid to curation of the CIViC database based at the McDonnell Genome Institute at Washington University in St Louis. All data in this resource has been automatically extracted using text mining tools

We use a supervised learning approach where a set of experts have annotated a large set of sentences to be used as examples for a machine learning system. This machine learning system is Kindred, a Python package for relation extraction. It is the successor to our VERSE system that won part of the BioNLP'16 Shared Task for relation extraction. This machine learning system is then applied to all abstracts in PubMed and all full text papers in the Pubmed Central Open Access subset.

The main researcher is Jake Lever, previously a PhD student in the Jones group at Canada's Michael Smith Genome Sciences Centre in Vancouver, Canada. He is now a lecturer at the University of Glasgow.