skip to main content
article

Sindice.com: a document-oriented lookup index for open linked data

Published: 01 November 2008 Publication History

Abstract

Data discovery on the Semantic Web requires crawling and indexing of statements, in addition to the 'linked-data' approach of de-referencing resource URIs. Existing Semantic Web search engines are focused on database-like functionality, compromising on index size, query performance and live updates. We present Sindice, a lookup index over Semantic Web resources. Our index allows applications to automatically locate documents containing information about a given resource. In addition, we allow resource retrieval through inverse-functional properties, offer a full-text search and index SPARQL endpoints. Finally, we extend the sitemap protocol to efficiently index large datasets with minimal impact on data providers.

References

[1]
Aggarwal, C.C., Al-Garawi, F. and Yu, P.S. (2001) 'Intelligent crawling on the World Wide Web with arbitrary predicates', Proceedings of the International World-Wide Web Conference, pp.96-105.
[2]
Bar-Yossef, Z., Keidar, I. and Schonfeld, U. (2007) 'Do not crawl in the DUST: Different URLs with similar text', Proceedings of the International World-Wide Web Conference, pp.111-120.
[3]
Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P. and Morissette, J. (2007) 'Bio2RDF: towards a mashup to build bioinformatics knowledge system', Proceedings of the WWW Workshop on Health Care and Life Sciences Data Integration for the Semantic Web.
[4]
Berners-Lee, T. (2006) Linked Data, W3C Design Issues, URL: http://www.w3.org/DesignIssues/LinkedData.html
[5]
Berners-Lee, T., Chen, Y., Chilton, L., Connolly, D., Dhanaraj, R., Hollenbach, J., Lerer, A. and Sheets, D. (2006) 'Tabulator: exploring and analyzing linked data on the Semantic Web', Proceedings of the ISWC Workshop on Semantic Web User Interaction, URL: http://swui.semantic web.org/swui06/papers/Berners-Lee/Berners-Lee.pdf
[6]
Brin, S. and Page, L. (1998) 'Anatomy of a large-scale hypertextual web search engine', Computer Networks, Vol. 30, pp.107-117.
[7]
Broder, A. (2002) 'A taxonomy of web search', SIGIR Forum, Vol. 36, No. 2, pp.3-10.
[8]
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G. (1997) 'Syntactic clustering of the web', Computer Networks, Vol. 29, Nos. 8-13, pp.1157-1166.
[9]
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A. and Gruber, R.E. (2006) 'Bigtable: a distributed storage system for structured data', Proceedings of the Symposium on Operating Systems Design and Implementation, pp.15-15.
[10]
Consens, M.P. and Mendelzon, A.O. (1990) 'Graphlog: a visual formalism for real life recursion', Proceedings of the Symposium on Principles of Database Systems (PODS), pp.404-416.
[11]
Dean, J. and Ghemawat, S. (2004) 'Mapreduce: Simplified data processing on large clusters', Proceedings of the Symposium on Operating Systems Design and Implementation, pp.137-147.
[12]
Depoorter, B. and Parisi, F. (2002) 'Fair use and copyright protection: a price theory explanation', International Review of Law and Economics, Vol. 21, No. 4, pp.453-473.
[13]
Ding, L. and Finin, T. (2006) 'Characterizing the Semantic Web on the web', Proceedings of the International Semantic Web Conference (ISWC), pp.242-257.
[14]
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V. and Sachs, J. (2004) 'Swoogle: a search and metadata engine for the Semantic Web', Proceedings of the Conference on Information and Knowledge Management (CIKM), pp.652-659.
[15]
Erickson, J.S. (2003) 'Fair use, DRM, and trusted computing', Communications of the ACM, Vol. 46, No. 4, pp.34-39.
[16]
Finin, T. and Ding, L. (2006) Search Engines for Semantic Web Knowledge, XTech.
[17]
Finin, T.W., Ding, L., Pan, R., Joshi, A., Kolari, P., Java, A. and Peng, Y. (2005) 'Swoogle: searching for knowledge on the semantic web', Proceedings of the National Conference on Artificial Intelligence (AAAI), pp.1682-1683.
[18]
Frakes, W.B. and Baeza-Yates, R.A. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms, Prentice-Hall, New Jersey.
[19]
Ghemawat, S., Gobioff, H. and Leung, S-T. (2003) The Google File System, pp.29-43.
[20]
Harth, A., Umbrich, J. and Decker, S. (2007) 'Multicrawler: a pipelined architecture for crawling and indexing Semantic Web data', Proceedings of the International Semantic Web Conference (ISWC), pp.258-271.
[21]
Hayes, P. (Ed.) (2004) RDF Semantics, W3C Recommendation.
[22]
Hogan, A., Harth, A. and Decker, S. (2006) 'Reconrank: a scalable ranking method for Semantic Web data with context', Second International Workshop on Scalable Semantic Web Knowledge Base Systems, URL: http://sw.deri.org/2005/07/n3rank/paper/paper.pdf
[23]
Hogan, A., Harth, A., Umbrich, J. and Decker, S. (2007) 'Towards a scalable search and query engine for the web', Proceedings of the International World-Wide Web Conference, pp.1301-1302.
[24]
Huynh, D., Mazzocchi, S. and Karger, D. (2007) 'Piggy bank: experience the Semantic Web inside your web browser', Journal of Web Semantics, Vol. 5, No. 1, pp.16-27.
[25]
Kiryakov, A., Ognyanov, D. and Manov, D. 2005. 'OWLIM - a pragmatic semantic repository for OWL', Proceedings of the Conference on Web Information Systems Engineering (WISE) Workshops, pp.182-192.
[26]
Klein, B., Lerner, A. and Murphy, K. (2002) 'The economics of copyright 'fair use' in a networked world', The American Economic Review, Vol. 92, No. 2, pp.205-208.
[27]
Kleinberg. (1999) 'Authoritative sources in a hyperlinked environment', Journal of the ACM, Vol. 46, pp.604-632.
[28]
Krogh, C. (1996) 'The rights of agents', in Wooldridge, M., Muller, J.P. and Tambe, M. (Eds.): Intelligent Agents II, Agent Theories, Architectures and Languages, Springer-Verlag, pp.1-16.
[29]
Lämmel, R. (2008) 'Google's MapReduce programming model - revisited', Sci. Comput. Program, Vol. 70, No. 1, pp.1-30.
[30]
Landes, W.M. and Posner, R.A. (2003) The Economic Structure of Intellectual Property Law, Harvard University Press, Cambridge, MA.
[31]
Li, W. (1992) 'Random texts exhibit Zipf's-law-like word frequency distribution', IEEE Transactions on Information Theory, Vol. 38, No. 6, pp.1842-1845.
[32]
Lin, D. and Loui, M.C. (1998) 'Taking the byte out of cookies: privacy, consent, and the web', ACM Policy, Vol. 28, No. 2, pp.39-51.
[33]
Manning, C., Raghavan, P. and Schütze, H. (2008) Introduction to Information Retrieval, Cambridge University Press, Cambridge, MA.
[34]
McBryan, O.A. (1994) 'GENVL and WWWW: tools for taming the web', Proceedings of the International World-Wide Web Conference.
[35]
McGuinness, D.L. and van Harmelen, F. (Eds.) (2004) OWL Web Ontology Language, W3C Recommendation.
[36]
Miles, A., Baker, T. and Swick, R. (Eds.) (2006) Best Practice Recipes for Publishing RDF Vocabularies, W3C Working Draft.
[37]
Pant, G., Srinivasan, P. and Menczer, F. (2004) 'Crawling the web', Web Dynamics: Adapting to Change in Content, Size, Topology and Use, Springer-Verlag, pp.153-178.
[38]
Salton, G. and McGill, M.J. (1983) Introduction to Modern Information Retrieval, McGraw-Hill, New York, NY.
[39]
Sauermann, L., Cyganiak, R. and Völkel, M. (2007) Cool URIs for the Semantic Web., Tech. Rep. TM-07-01, DFKI. 2.
[40]
Tavani, H.T. (1999) 'Informational privacy, data mining, and the internet', Ethics and Information Technology, Vol. 1, No. 2, pp.137-145.
[41]
ter Horst, H.J. (2005) 'Combining RDF and part of OWL with rules: Semantics, decidability, complexity', Proceedings of the International Semantic Web Conference (ISWC), pp.668-684.
[42]
Thelwall, M. and Stuart, D. (2006) 'Web crawling ethics revisited: cost, privacy, and denial of service', Vol.57, No. 13, pp.1771-1779.
[43]
Tummarello, G., Morbidoni, C. and Nucci, M. (2006) 'Enabling semantic web communities with DBin: an overview', Proceedings of the International Semantic Web Conference (ISWC), pp.943-950.
[44]
Ullman, J.D. (1989) Principles of Database and Knowledge Base Systems, Vol. 2, Computer Science Press, Rockville, MD.
[45]
van Asbroeck, B. and Cock, M. (2007) 'Belgian newspapers v Google news: 2-0', Journal of Intellectual Property Law and Practice, Vol. 2, No. 7, pp.463-466.
[46]
van Wel, L. and Royakkers, L. (2004) 'Ethical issues in web data mining', Ethics and Information Technology, Vol. 6, No. 2, pp.129-140.
[47]
Zipf, G.K. (1932) Selective Studies and the Principle of Relative Frequency in Language, MIT Press, Cambridge, MA.

Cited By

View all
  • (2023)Linked Data - The Story So FarLinking the World’s Information10.1145/3591366.3591378(115-143)Online publication date: 5-Sep-2023
  • (2023)Linking the World’s InformationundefinedOnline publication date: 5-Sep-2023
  • (2022)Maximizing Bigdata Retrieval: Block as a Value for NoSQL over SQLProceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM55673.2022.10068692(556-563)Online publication date: 10-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of Metadata, Semantics and Ontologies
International Journal of Metadata, Semantics and Ontologies  Volume 3, Issue 1
January 2008
91 pages
ISSN:1744-2621
EISSN:1744-263X
Issue’s Table of Contents

Publisher

Inderscience Publishers

Geneva 15, Switzerland

Publication History

Published: 01 November 2008

Author Tags

  1. data discovery
  2. document location
  3. full-text searching
  4. indexing
  5. information retrieval
  6. large datasets
  7. lookup index
  8. open linked data
  9. resource retrieval
  10. scalability
  11. semantic web
  12. sitemap protocol

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Linked Data - The Story So FarLinking the World’s Information10.1145/3591366.3591378(115-143)Online publication date: 5-Sep-2023
  • (2023)Linking the World’s InformationundefinedOnline publication date: 5-Sep-2023
  • (2022)Maximizing Bigdata Retrieval: Block as a Value for NoSQL over SQLProceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM55673.2022.10068692(556-563)Online publication date: 10-Nov-2022
  • (2022)LOD search engine: A semantic search over linked dataJournal of Intelligent Information Systems10.1007/s10844-021-00687-059:1(71-91)Online publication date: 1-Aug-2022
  • (2021)The impact of semantic annotation techniques on content-based video lecture recommendationJournal of Information Science10.1177/016555152093173247:6(740-752)Online publication date: 1-Dec-2021
  • (2020)The Semantic WebSemantic Web10.3233/SW-19038711:1(169-185)Online publication date: 1-Jan-2020
  • (2020)A more decentralized vision for Linked DataSemantic Web10.3233/SW-19038011:1(101-113)Online publication date: 1-Jan-2020
  • (2020)Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge GraphsJournal of Data and Information Quality10.1145/337275012:2(1-31)Online publication date: 24-Apr-2020
  • (2019)Large-scale Semantic Integration of Linked DataACM Computing Surveys10.1145/334555152:5(1-40)Online publication date: 13-Sep-2019
  • (2019)Novel Node Importance Measures to Improve Keyword Search over RDF GraphsDatabase and Expert Systems Applications10.1007/978-3-030-27618-8_11(143-158)Online publication date: 26-Aug-2019
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media