Introduction
Concept Profile Analysis (CPA) has proven a powerful tool for interpreting and prioritizing results of bioinformatics analysis, and for linking data sets based on the best “educated guess” when precise links are not available. The technology uses the vector space model to relate concepts (such as genes and biological processes) mined from the literature to each other. Vectors can be compared efficiently and transparently1, and the model yields a measure of the strength of the relationship between concepts. We call these vectors “concept profiles”. The CPA algorithms have for example been successfully applied to compare microarray studies2, for predicting proteins putatively associated with muscular dystrophy pathways3, and for associating chemical structures with gene expression data4.
The standalone application Anni10 supports a number of standard CPA operations. For example, to perform pathway analysis for a gene expression experiment, a user first provides a list of gene database identifiers for the most significantly expressed genes. Anni uses these identifiers to query the concept profile database for the corresponding concept profiles, and subsequently constructs a “concept set” of these profiles. To match the list of genes with pathways, the user performs the operation “match concept sets” for the gene concept set with a predefined concept set of the category “Gene Ontology (GO) biological process”. Note that we refer here to GO concept profiles. The concept profile matching scores between the two concept sets are calculated by Anni, resulting in a ranked list of GO biological processes for the gene list. Finally, literature evidence in the form of documents containing co-mentions of the gene and biological processes can be retrieved by Anni from a supporting documents database, or from documents providing enough statistical evidence to support the gene-biological process associations without actually mentioning the gene and the biological process together in an abstract.
Here we present a new suite of Web Service (WS) operations that allows bioinformaticians to design and execute their own CPA workflow outside the Anni Web tool, possibly as part of a larger bioinformatics analysis. The WS was designed according to the outcome of an Anni usage analysis, where the common user and machine operations were identified.
Technical specifications
We implemented the CPA WS using Java, Model-View-Controller (MVC) Spring framework, and Apache Tomcat following the Java API for XML WS (JAX-WS) specifications. We compiled the Anni Java code for the different operations into separate libraries, for which wrappers were written in Java. Spring MVC was used as a WS interface to remote applications. The WS was implemented according to the JAX-WS standard, enabling an auto-generated WSDL specification and use of Java Annotations to specify operations. Apache Tomcat was used for deployment. The CPA WS uses a database of indexed PubMed records. The thesaurus behind the Anni Web application was converted to Simple Knowledge Organization System (SKOS), and the SKOS concept IDs were implemented as resolvable Unique Resource Identifiers leading to a Virtuoso Universal Server triple store.
User and machine operations as Taverna workflows
As an example on how to work with the CPA WS we implemented several workflows in the workflow management system Taverna workbench v 2.45 following the best practices for workflow design6. The whole suit of CPA workflows consists of 11 workflows collected in a myExperiment pack [http://www.myexperiment.org/packs/368]. These workflows are of two different types: 1) nine workflows calling one WS operation, and 2) two pipelines of nested workflows calling more than one WS operation. The workflows of type 1 are the building blocks to make pipelines of type 2, and were implemented with re-usability in mind.
Here we describe the workflow “Match concept profiles with predefined set” (Figure 1) in order to illustrate the design and use of the WS and workflows. The workflow invokes the WS operation “getSimilarConceptProfilesPredefined”. The operation takes three input parameters, which can be accessed using the XML splitter function in Taverna. The user specifies the concept(s) to be matched (“Query concept IDs”), the concept set to match against (“Match concept set”), and a cutoff number of matched concepts to return (“Cutoff”).
Figure 1. Taverna workflow for matching concept(s) with a predefined set of concept profiles.
Blue boxes represent the workflow inputs and outputs, green box the WS invocation, and purple boxes the XML splitters for the inputs and outputs of the WS operation. The workflow is available at http://www.myexperiment.org/workflows/3396.
Opening the “Run workflow” window in Taverna will result in showing the structured annotations for the whole workflow and the input parameters, as well as the example values (Figure 2). WS functional annotations can be accessed via the “Details” tab in Taverna (Figure 3). When the workflow is run, it will produce a ranked list of concepts associated to the query concept(s), and their similarity scores.
Figure 2. Taverna run window.
Detailed, structured descriptions for the whole workflow and its input parameters, with example values are shown in the window.
Figure 3. Taverna details window.
A detailed description of the function of the WS operation is shown in the window.
The above described workflow executes the core functionality of concept profile matching. The other WS operations implement functionality such as explaining the association found (by listing the common concepts contributing most to the score) and showing the literature evidence (by retrieving the links to the abstracts in PubMed). Workflows implementing these WS can be coupled to the “Match concept profiles with predefined set” workflow to form a pipeline of nested workflows. Examples of such pipelines are the “GWAS to biomedical concept” nested workflow, which performs Single Nucleotide Polymorphism annotation (SNP), and the “Annotate gene list with top ranking concepts” nested workflow for gene annotation (Figure 4).
Figure 4. Taverna nested workflow for gene annotation.
Blue boxes represent input and output parameters, purple boxes the local Taverna worker services, yellow boxes the Xpath services for fast XML parsing, and grey boxes the constant values. The workflow is available at http://www.myexperiment.org/workflows/3921.
Discussion
The CPA WS and workflows raise the level of reproducibility of bioinformatics experiments that make use of CPA compared to Anni, and the CPA WS can more easily be used together with other tools. For example, CPA-based SNP annotation can be performed with the CPA WS by coupling an external WS to map the SNP identifiers to Entrez gene identifiers7. With Anni, the SNP to Entrez gene identifier analysis would have to be performed separately, decreasing the reproducibility.
Some of the functionalities in Anni have not been migrated to the WS. For example, Anni provides a function for hierarchical clustering of the results. Clustering is not a CPA function by itself, but we are considering to implement workflows that perform this function. We are also working on a workflow implementation of the process that creates the data underlying the Anni WS, possibly using the recently developed text-mining workbench Argo8, allowing for more flexibility in performing CPA9. Specialization of the underlying resources for services to use in specific research domains, such as plant breeding or metabolomics, is a topic for future work.
Conclusions
By creating a WS building upon the Anni interactive tool, we made available the CPA technology in a way that users can easier integrate the technology with other software and save their procedures, results and related provenance.
Software availability
Software license
Apache 2.0
Author contributions
KH designed the workflows, performed the usage analysis for the WS, and wrote the manuscript. RS implemented the WS. EM helped in the design of the workflows, the usage analysis and testing of the WS. EH converted the thesaurus into SKOS format and set up the concept triple store. MT and RK helped in implementing the WS and setting up the concept triple store. BM, EvM, and JK helped in the usage analysis and testing of the WS. MR conceived the study and helped design the workflows. All authors approved the final version of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Grant information
The work in this paper was funded by the Seventh Framework Programme of the European Commission (Digital Libraries and Digital Preservation area ICT-2009.4.1 project reference 270192) (Wf4Ever), and grant agreement No. 305444 (RD-Connect).
Acknowledgements
We would like to thank Peter-Bram ’t Hoen for his comments about the WS design and functionality.
Faculty Opinions recommendedReferences
- 1.
Jelier R, Schuemie MJ, Roes PJ, et al.:
Literature-based concept profiles for gene annotation: the issue of weighting.
Int J Med Inform.
2008; 77(5): 354–362. PubMed Abstract
| Publisher Full Text
- 2.
Jelier R, ’t Hoen PA, Sterrenburg E, et al.:
Literature-aided meta-analysis of microarray data: a compendium study on muscle development and disease.
BMC Bioinformatics.
2008; 9: 291. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 3.
van Haagen HH, ’t Hoen PA, de Morrée A, et al.:
In silico discovery and experimental validation of new protein-protein interactions.
Proteomics.
2011; 11(5): 843–853. PubMed Abstract
| Publisher Full Text
- 4.
Hettne KM, Boorsma A, van Dartel DA, et al.:
Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data.
BMC Med Genomics.
2013; 6(1): 2. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 5.
Wolstencroft K, Haines R, Fellows D, et al.:
The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud.
Nucleic Acids Res.
2013; 41(Web Server issue): W557–W561. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 6.
Hettne KM, Wolstencroft K, Belhajjame K, et al.:
Best Practices for Workflow Design: How to Prevent Workflow Decay. In Proceedings of SWAT4LS 2012, 2012. Reference Source
- 7.
Hettne KM, Dharuri H, van Schouwen R, et al.:
Explaining genome-wide association study results using concept profile analysis and the Kyoto Encyclopedia of Genes and Genomes pathway database. In Proceedings of BioLINK SIG 2013, 2013; page 60. Reference Source
- 8.
Rak R, Batista-Navarro RT, Carter J, et al.:
Processing biological literature with customizable web services supporting interoperable formats.
Database(oxford).
2014; 2014: pii: bau064. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 9.
van der Horst E, Roos M, Hettne K:
Workflows and services for concept profile generation.
F1000Posters.
2014; 5(33). Reference Source
- 10.
Jelier R, Schuemie MJ, Veldhoven A, et al.:
Anni 2.0: a multipurpose text-mining tool for the life sciences.
Genome Biol.
2008; 9(6): R96. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 11.
Hettne KM, van Schouwen R, Mina E, et al.:
New suite of Concept Profile Analysis Web Services.
ZENODO.
2014. Data Source
Comments on this article Comments (0)