G language project

Members

Kazuharu Arakawa ([email protected])
Hidetoshi Itaya ([email protected])

Universal Biological Sequence ID (UBSID)

The first step in merging of tables is the relation of primary IDs. In other words, any tables can be joined if they contain same ID. Therefore, all databases can be joined as a fully connected Linked Data if we can assign a universal identifier to each of the databases.

Molecular biology thus far has mainly developed around the Central Dogma concept, and majority of research have some relation to the genomic sequence. Genes, and part of genes such as SNPs, and protein binding motifs can be related to DNA sequences, and likewise for transcriptome and proteome. Therefore, we could presumably connect the majority of biological information if we can use the nucleotide sequences as IDs. However, the use of sequences per se have several problems:

The sequence can be VERY long (human chromosome I) and not suitable as an ID
The sequence can be VERY short (C->T SNP in a gene) and not identifiable
There could be multiple sequences that are highly similar or identical as in multi-copy paralogs
DNA has two strands of sequences

In order to overcome these problems, a universal sequence-based ID should have

position information
reference sequence information
actual sequence (when there are mutations from the reference)
strand information

and ALL OF THE ABOVE must be expressed as a short, human-comprehensive ID.

We tried to solve this problem using reference-based compression of DNA sequences. By using reference-based compression based on offset and run-length encoding, the sequence can be expressed just by the mismatching positions.

reference based compression

UBSID

Strengths of UBSID includes:

Reversible compression
- Can be converted back to sequence without de-referencing.
- IDs can be directly compared by similarity
Comparisons of related IDs by just sorting.
- hg19-chr19:045409882+A42:=43-1092=193-580=718:
- hg19-chr19:045412079+1:T:rs7412:C>T:Arg>Cys
- Can further deduce sequence relationships (part-of, antisense, upstream...)
Human readable
You can add a comment!
- Example: (BLAST results)
- hg19-chr20:033104523+3C:=60>27:eval=2e-25:bits=119
- hg19-chr20:033128400+1D:<58=29:eval=7e-07:bits=58.0

Examples and tools

UBSID2Seq
- Syntax: http://rest.g-language.org/ubsid/ubsid2seq/[UBSID]
- Example:
  - http://rest.g-language.org/ubsid/ubsid2seq/hg19-chr20:033104523+5D62:=159-9391=76-8283=168-5785=44
  - http://rest.g-language.org/ubsid/ubsid2seq/hg19-chr19:045409882+A42:=43-1092=193-580=718:
Seq2UBSID
- Syntax(GET): http://rest.g-language.org/ubsid/seq2ubsid/[Sequence]
- Syntax(POST): http://rest.g-language.org/ubsid/seq2ubsid/ (POST sequence to this URL)
- Example:
  - http://rest.g-language.org/ubsid/seq2ubsid/atgggcgaggggtgggcggcggccctgcagcctagagttttggggccttggtgcgcgatgattgtgattcagaatccaaccgaataa
BLAST2UBSID
- Syntax(GET): http://rest.g-language.org/ubsid/blast2ubsid/[Sequence]
- Syntax(POST): http://rest.g-language.org/ubsid/blast2ubsid/ (POST sequence to this URL)
- Example:
  - http://rest.g-language.org/ubsid/blast2ubsid/atgggcgaggggtgggcggcggccctgcagcctagagttttggggccttggtgcgcgatgattgtgattcagaatccaaccgaataa
GFF2UBSID
- Not available as a service for the moment.

Statistical assessment of semantic similarity

When we have sufficient linked data, we can assess the semantic similarity of “terms” by calculating the co-occurence of a pair of terms, without the semantic knowledge of the terms a priori.

For example, when the terms “GO:0006281” and “DNA repair” is frequently co-annotated to the same gene, we see a semantic relation between the terms.

We have previously shown the feasibility of such statistical assessment of semantic similarity in BioHackathon 2011 (Kyoto). https://github.com/dbcls/bh11/wiki/G-language

Similarity of terms X and Y can be calculated as

Eq.1

Similarity of terms data series S1(x) X and S2 (x) can be calculated as

Eq.2

Examples
- http://ws.g-language.org/toys/bh11/
- http://ws.g-language.org/ubsid/result.php?db=Pfam
- http://ws.g-language.org/ubsid/result-id.php?id=PF00069
  - http://smart.embl.de/smart/do_annotation.pl?DOMAIN=SM00133
  - http://prosite.expasy.org/PS00108
- http://ws.g-language.org/ubsid/result-id.php?id=GO:0005667
- http://ws.g-language.org/ubsid/search.cgi?id=[TERM]
  - Searches for similar terms from pre-formatted data ranked by similarity score.

SS.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

G language project

Members

Universal Biological Sequence ID (UBSID)

Examples and tools

Statistical assessment of semantic similarity

Clone this wiki locally