PheneBank: Processed Medline Abstracts and PMC full articles + Phenotype-Disease Associations
- 1. University of Cambridge
- 2. Queen Mary University
Description
The PheneBank project:
Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies.
The project seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic representations of the phenotypes they tried to target. Our deep learning-based approach is an attempt to overcome this issue by reducing the uncertainty between textual and ontological forms of phenotypes. Specifically, the model treats multitoken named entities as a single token which allows more reliable handling of multiword expressions. The approach builds on ground breaking research at the European Bininformatics Institute by the PI (Nigel Collier) and the Co-investigator (Damian Smedley, Queen Mary University London), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype.
http://www.phenebank.org
The dataset:
As an output of the PheneBank project, we release the set of 24 million MEDLINE abstracts as well as 3.8M open-access PMC full articles annotated with 9 classes of entity: Phenotype, Disease, Anatomy, Cell, Cell_line, GPR, Gene_variant, Molecule, and Pathway. The entities have been mapped to five major ontologies: SNOMED, HPO, MeSH, PRO, and FMA.
In addition, we release the phenotype-disease associations that are automatically extracted based on co-occurrences statistics in Medline abstracts. Among different statistical measures we evaluated, the Fisher test best corresponded to the known tuples available from the curated associations available from the Monarch Initiative (https://monarchinitiative.org).
Processing:
The NER tagging has been done using a BiLSTM-CRF neural model (https://github.com/pilehvar/phenebank) trained on expert-annotated data (to be released for research). The grounding to ontologies relies on semantic embedding of concepts and entities in a unified semantic space.
Data format:
PheneBank_Processed_PubMed.part[x].tar.gz contains 24,359,010 .txt files that are classified into 812 directories. Each .txt file is named with a PubMed article ID and contains the corresponding article's abstract and its annotations. The dataset is split into four (unequal) parts based on PubMed's structure:
part1: medline16n00* medline16n01* medline16n02* [299 directories, 2.8GB]
part2: medline16n03* medline16n04* [200 directories, 4.7GB]
part3: medline16n05* medline16n06* [200 directories, 5.3GB]
part4: medline16n07* medline16n08* [113 directories, 3.1GB]
The PheneBank_Processed_PMC.tar.gz files has 6,180 directories which are named after the journal titles from which the articles have been drawn. There are three files per each article (i.e., 3 .txt files for the 3,751,770 distinct articles), containing text from different parts of the article: .title.txt, .abstract.txt, and .body.txt.
Each line starts with a word; for those words that are identified as entities, entity type and mapping information are followed in the same line (tab separated), with the following format:
word <TAB> ::: <TAB> entity_type <TAB> entity_concept_ID_1##confidence_score_1 entity_concept_ID_2##confidence_score_2 ...
Note that the concepts are sorted according to their mapping confidence scores.
As for the PheneBank_Associations.tsv file, there are ten columns that correspond to the following (left to right):
- Disease Name
- Disease (MONDO) ID
- Phenotype Name
- Phenotype (HPO) ID
- Co-occurrence Frequency
- Disease Frequency
- Phenotype Frequency
- Fisher (log)
- Dice
- Normalized PMI
Files
PheneBank_Associations.zip
Files
(36.4 GB)
Name | Size | Download all |
---|---|---|
md5:29c299d26301402e774bd4e28258abb0
|
64.1 MB | Preview Download |
md5:d7dc0fc7be318cdca2f6d8065f05b022
|
19.5 GB | Preview Download |
md5:9662ce48e3320abf5d3c0df46ff7a548
|
2.9 GB | Download |
md5:8bbaba023dcd1d5b8cd8b1eb355acbb8
|
5.0 GB | Download |
md5:898b9d2ff43f7fbe02a7484f6bbb7447
|
5.6 GB | Download |
md5:2e7ebc5e48e6d280236189bb8a7358c2
|
3.3 GB | Download |