Disclaimer: All information has been text-mined automatically. Therefore, you should evaluate the original papers that CIViCmine cites before making any interpretation.
Q: When I type in a gene, cancer or drug, it isn't autocompleted. Why is that?
A: If a gene or cancer doesn't come up, it means that it isn't in CIViCmine. Check that the gene/cancer name is the standard name (e.g. ERBB2 not HER2) in the associated ontology (HUGO, the Disease Ontology and WikiData for drug names).
Q: What text is being mined?
A: We are processing the entirety of Pubmed (~22 million abstracts) and Pubmed Central Open Access Subset (~1 million full text articles).
Q: Are you just using co-occurrences?
A: No. Co-occurrences count how many times a gene and a cancer-type appear in the same sentence. We use the Kindred relation extraction package to understand the context of the sentence and only extract sentences that discuss a cancer biomarker with high likelihood.
Q: How regularly is this updated?
A: CIViCmine is updated monthly (around the 1st of the month) and makes use of the PubRunner package to make it easier and minimise the amount of additional computation on new abstracts and papers in Pubmed and Pubmed Central Open Access subset.
Q: I found a sentence that doesn't say what CIViCmine describes. Why?
A: CIViCmine is completely automated with no human curation so there will be some mistakes. We've adjusted the precision to make as few false positive mistakes (with a tradeoff of likely higher false negatives). All information offered by CIViCmine should be interpreted accordingly.
Q: Could this approach be used to extract other types of biomedical knowledge?
A: It probably could. If you'd like to talk, please get in contact with Jake Lever.
Q: Can I bulk download the data?
A: You can bulk download it from our Zenodo repository or use the download buttons on this website to get a subset of the data. All data is Creative Commons Zero licensed.
Q: Where can I read more about the methods and full results?
A: This work has been published in Genome Medicine. Please cite this paper you make use of the data.
This resource is built as an aid to curation of the CIViC database based at the McDonnell Genome Institute at Washington University in St Louis. All data in this resource has been automatically extracted using text mining tools
We use a supervised learning approach where a set of experts have annotated a large set of sentences to be used as examples for a machine learning system. This machine learning system is Kindred, a Python package for relation extraction. It is the successor to our VERSE system that won part of the BioNLP'16 Shared Task for relation extraction. This machine learning system is then applied to all abstracts in PubMed and all full text papers in the Pubmed Central Open Access subset.
The main researcher is Jake Lever, previously a PhD student in the Jones group at Canada's Michael Smith Genome Sciences Centre in Vancouver, Canada. He is now a lecturer at the University of Glasgow.