Skip to content
Hidetoshi Itaya edited this page Jun 28, 2013 · 15 revisions

Members

Universal Biological Sequence ID (UBSID)

The first step in merging of tables is the relation of primary IDs. In other words, any tables can be joined if they contain same ID. Therefore, all databases can be joined as a fully connected Linked Data if we can assign a universal identifier to each of the databases.

Molecular biology thus far has mainly developed around the Central Dogma concept, and majority of research have some relation to the genomic sequence. Genes, and part of genes such as SNPs, and protein binding motifs can be related to DNA sequences, and likewise for transcriptome and proteome. Therefore, we could presumably connect the majority of biological information if we can use the nucleotide sequences as IDs. However, the use of sequences per se have several problems:

  1. The sequence can be VERY long (human chromosome I) and not suitable as an ID
  2. The sequence can be VERY short (C->T SNP in a gene) and not identifiable
  3. There could be multiple sequences that are highly similar or identical as in multi-copy paralogs
  4. DNA has two strands of sequences

In order to overcome these problems, a universal sequence-based ID should have

  1. position information
  2. reference sequence information
  3. actual sequence (when there are mutations from the reference)
  4. strand information

and ALL OF THE ABOVE must be expressed as a short, human-comprehensive ID.

We tried to solve this problem using reference-based compression of DNA sequences. By using reference-based compression based on offset and run-length encoding, the sequence can be expressed just by the mismatching positions.

reference based compression

UBSID

Strengths of UBSID includes:

  • Reversible compression
    • Can be converted back to sequence without de-referencing.
    • IDs can be directly compared by similarity
  • Comparisons of related IDs by just sorting.
    • hg19-chr19:045409882+A42:=43-1092=193-580=718:
    • hg19-chr19:045412079+1:T:rs7412:C>T:Arg>Cys
    • Can further deduce sequence relationships (part-of, antisense, upstream...)
  • Human readable
  • You can add a comment!
    • Example: (BLAST results)
    • hg19-chr20:033104523+3C:=60>27:eval=2e-25:bits=119
    • hg19-chr20:033128400+1D:<58=29:eval=7e-07:bits=58.0

Examples and tools

Statistical assessment of semantic similarity

When we have sufficient linked data, we can assess the semantic similarity of “terms” by calculating the co-occurence of a pair of terms, without the semantic knowledge of the terms a priori.

For example, when the terms “GO:0006281” and “DNA repair” is frequently co-annotated to the same gene, we see a semantic relation between the terms.

We have previously shown the feasibility of such statistical assessment of semantic similarity in BioHackathon 2011 (Kyoto). https://github.com/dbcls/bh11/wiki/G-language

  • Similarity of terms X and Y can be calculated as

Eq.1

  • Similarity of terms data series S1(x) X and S2 (x) can be calculated as

Eq.2

SS.1

Clone this wiki locally