skip to main content
research-article

An overview of semantic search evaluation initiatives

Published: 01 January 2015 Publication History

Abstract

Recent work on searching the Semantic Web has yielded a wide range of approaches with respect to the underlying search mechanisms, results management and presentation, and style of input. Each approach impacts upon the quality of the information retrieved and the user's experience of the search process. However, despite the wealth of experience accumulated from evaluating Information Retrieval (IR) systems, the evaluation of Semantic Web search systems has largely been developed in isolation from mainstream IR evaluation with a far less unified approach to the design of evaluation activities. This has led to slow progress and low interest when compared to other established evaluation series, such as TREC for IR or OAEI for Ontology Matching. In this paper, we review existing approaches to IR evaluation and analyse evaluation activities for Semantic Web search systems. Through a discussion of these, we identify their weaknesses and highlight the future need for a more comprehensive evaluation framework that addresses current limitations.

References

[1]
L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Reddivari, V.C. Doshi, J. Sachs, Swoogle: A search and metadata engine for the semantic web, in: Proceedings of the 13th ACM Conference on Information and Knowledge Management, 2004, pp. 652-659.
[2]
G. Tummarello, E. Oren, R. Delbru, Sindice.com: Weaving the open linked data, in: Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference, ISWC/ASWC2007, 2007, pp. 552-565.
[3]
E. Kaufmann, A. Bernstein, L. Fischer, NLP-reduce: A "naïve" but domain-independent natural language interface for querying ontologies, in: Proceedings of the 4th European Semantic Web Conference, ESWC2007. Vol. 4519, 2007.
[4]
E. Kaufmann, A. Bernstein, R. Zumstein, Querix: A natural language interface to query ontologies based on clarification dialogs, in: Proceedings of the 5th International Semantic Web Conference, ISWC 2006, 2006, pp. 980-981.
[5]
V. López, E. Motta, V. Uren, PowerAqua: Fishing the semantic web, in: The Semantic Web: Research and Applications, 2006, pp. 393-410.
[6]
D. Damljanovic, M. Agatonovic, H. Cunningham, Natural language interface to ontologies: Combining syntactic analysis and ontology-based lookup through the user interaction, in: Proceedings of the 7th Extended Semantic Web Conference, ESWC2010, 2010, pp. 106-120.
[7]
R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, D. Petrelli, Hybrid search: effectively combining keywords and ontology-based searches, in: Proceedings of the 5th European Semantic Web Conference, ESWC2008, 2008, pp. 554-568.
[8]
A. Clemmer, S. Davies, Smeagol: A specific-to-general semantic web query interface paradigm for novices, in: Proceedings of the 22nd International Conference on Database and Expert Systems Applications, DEXA2011, 2011, pp. 288-302.
[9]
G. Tummarello, R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, S. Decker, Sig.ma: Live views on the web of data, in: Proceedings of the 19th International Conference on World Wide Web, WWW2010, 2010, pp. 355-364.
[10]
A. Harth, VisiNav: A system for visual search and navigation on web data, J. Web Semant., 8 (2010) 348-354.
[11]
K. Järvelin, Evaluation, in: Interactive Information Seeking, Behaviour and Retrieval, Facet Publishing, 2011, pp. 113-138.
[12]
M. Taube, Cost as the measure of efficiency of storage and retrieval systems, in: Studies in Coordinate Indexing, Defense Technical Information Center, 1956, pp. 18-33.
[13]
S. Robertson, On the history of evaluation in IR, J. Inf. Sci., 34 (2008) 439-456.
[14]
M. Sanderson, Test collection based evaluation of information retrieval systems, Found. Trends Inf. Retr. (2010).
[15]
D. Harman, Information Retrieval Evaluation, in: Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool Publishers, 2011.
[16]
C.W. Cleverdon, The significance of the cranfield tests on index languages, in: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'91, 1991, pp. 3-12.
[17]
E.M. Voorhees, D.K. Harman, TREC: Experiment and Evaluation in Information Retrieval, in: Digital Libraries and Electronic Publishing, MIT Press, 2005.
[18]
H. Halpin, D.M. Herzig, P. Mika, R. Blanco, J. Pound, H.S. Thompson, T.T. Duc, Evaluating ad-hoc object retrieval, in: Proceedings of the 1st International Workshop on Evaluation of Semantic Technologies, IWEST, 2010.
[19]
S.N. Wrigley, D. Reinhard, K. Elbedweihy, A. Bernstein, F. Ciravegna, Methodology and campaign design for the evaluation of semantic search tools, in: Proceedings of the 3rd International Semantic Search Workshop, SEMSEARCH'10, 2010. pp. 10:1-10:10.
[20]
S.N. Wrigley, R. García-Castro, C. Trojahn, Infrastructure and workflow for the formal evaluation of semantic search technologies, in: Proceedings of the Workshop on Data Infrastructures for Supporting Information Retrieval Evaluation, 2011, pp. 29-34.
[21]
C. Unger, P. Cimiano, V. Lopez, E. Motta, Proceedings of the 1st Workshop on Question Answering Over Linked Data, QALD-1, 2011. URL: http://www.sc.cit-ec.uni-bielefeld.de/sites/www.sc.cit-ec.uni-bielefeld.de/files/proceedings.pdf.
[22]
K. Balog, P. Serdyukov, A. de Vries, Overview of the TREC 2010 Entity Track, in: TREC 2010 Working Notes, NIST, 2010.
[23]
K. Balog, P. Serdyukov, A.P. de Vries, Overview of the TREC 2011 Entity Track, in: TREC 2011 Working Notes, NIST, 2011.
[24]
J. Zobel, W. Webber, M. Sanderson, A. Moffat, Principles for robust evaluation infrastructure, in: Proceedings of the Workshop on Data Infrastructures for Supporting IR Evaluation, 2011, pp. 3-6.
[25]
T. Saracevic, Evaluation of evaluation in information retrieval, in: Proceedings of SIGIR, 1995, pp. 138-146.
[26]
S.E. Robertson, M.M. Hancock-Beaulieu, On the evaluation of IR systems, Inf. Process. Manage., 28 (1992) 457-466.
[27]
C.W. Cleverdon, J. Mills, M. Keen, Factors determining the performance of indexing systems, in: Aslib Cranfield Research Project Cranfield England, 1966.
[28]
D. Kelly, Methods for evaluating interactive information retrieval systems with users, Found. Trends Inf. Retr., 3 (2009) 1-224.
[29]
P. Clough, P. Goodale, Selecting success criteria: Experiences with an academic library catalogue, in: CLEF, Springer, 2013, pp. 59-70.
[30]
P. Borland, Interactive information retrieval: An introduction, J. Inf. Sci. Theory Pract., 1 (2013) 12-32.
[31]
M.L. Wilson, B. Kules, M.C. Schraefel, B. Shneiderman, From keyword search to exploration: Designing future search interfaces for the web, Foundations and Trends in Web Science, 2 (2010).
[32]
D. Kelly, S. Dumais, J. Pedersen, Evaluation challenges and directions for information-seeking support systems, Computer, 42 (2009) 60-66.
[33]
C.W. Cleverdon, Report on the first stage of an investigation onto the comparative efficiency of indexing systems. Tech. Rep., The College of Aeronautics, Cranfield, England, 1960.
[34]
E. Voorhees, The philosophy of information retrieval evaluation, in: Proceedings of the 2nd Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, CLEF2001, 2002, pp. 355-370.
[35]
P. Ingwersen, K. Järvelin, The Turn: Integration of Information Seeking and Retrieval in Context, Springer, 2005.
[36]
J. Kamps, S. Geva, C. Peters, T. Sakai, A. Trotman, E. Voorhees, Report on the SIGIR 2009 workshop on the future of information retrieval evaluation, SIGIR Forum, 43 (2009) 13-23.
[37]
W. Hersh, A. Turpin, S. Price, B. Chan, D. Kramer, L. Sacherek, D. Olson, Do batch and user evaluations give the same results? in: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 17-24.
[38]
M. Sanderson, M. Paramita, P. Clough, E. Kanoulas, Do user preferences and evaluation measures line up? in: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 555-562.
[39]
E. Kanoulas, B. Carterette, P. Clough, M. Sanderson, Evaluating multi-query sessions, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2011, pp. 1053-1062.
[40]
L. Azzopardi, K. Järvelin, J. Kamps, M. Smucker, Report on the SIGIR 2010 workshop on the simulation of interaction, SIGIR Forum, 44 (2010) 35-47.
[41]
E. Yilmaz, M. Shokouhi, N. Craswell, S. Robertson, Expected browsing utility for web search evaluation, in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 1561-1564.
[42]
M. Smucker, C. Clarke, Time-based calibration of effectiveness measures, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, pp. 95-104.
[43]
P. Clough, M. Sanderson, Evaluating the performance of information retrieval systems using test collections, Inform. Res., 18 (2013).
[44]
K. Spärck Jones, C.J.K. Van Rijsbergen, Information retrieval test collections, J. Doc. (1976).
[45]
L.R. Tang, R.J. Mooney, Using multiple clause constructors in inductive logic programming for semantic parsing, in: Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 466-477.
[46]
E. Kaufmann, Talking to the semantic web-natural language query interfaces for casual end-users, University of Zurich, 2007.
[47]
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann, DBpedia-a crystallization point for the Web of data, J. Web Semant. (2009).
[48]
M. Grubinger, P. Clough, On the creation of query topics for imageCLEFphoto, in: Proceedings of the Third MUSCLE/ImageCLEF Workshop on Image and Video Retrieval Evaluation, 2007, pp. 19-21.
[49]
H. Müller, Creating realistic topics for image retrieval evaluation, in: The Information Retrieval Series, vol. 32, Springer, Berlin, Heidelberg, 2010, pp. 45-61.
[50]
E.M. Voorhees, D. Harman, Overview of the ninth text retrieval conference (TREC-9), in: In Proceedings of the Ninth Text REtrieval Conference, TREC-9, 2000, pp. 1-14.
[51]
C. Peters, M. Braschler, Cross-language system evaluation: The CLEF campaigns, J. Am. Soc. Inf. Sci. Technol., 52 (2001) 1067-1072.
[52]
S.E. Robertson, The Methodology of Information Retrieval Experiment, Butterworths, 1981.
[53]
E.M. Voorhees, Topic set size redux, in: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2009, pp. 806-807.
[54]
B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'08, 2008, pp. 651-658.
[55]
L. Schamber, Relevance and information behavior, in: Annual Review of Information Science and Technology, ARIST, 1994.
[56]
S. Mizzaro, How many relevances in information retrieval?, Interact. Comput. (1998).
[57]
T. Saracevic, Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance, J. Am. Soc. Inf. Sci. Technol. (2007).
[58]
C.W. Cleverdon, The effect of variations in relevance assessments in comparative experimental tests of index languages. Tech. Rep., The College of Aeronautics, Cranfield, England, 1970.
[59]
E.M. Voorhees, Variations in relevance judgments and the measurement of retrieval effectiveness, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1998, pp. 315-323.
[60]
C. Cuadra, Experimental Studies of Relevance Judgments. Final Report. by Carlos A. Cuadra and Others}. System Development Corporation, 1967.
[61]
M.B. Eisenberg, Measuring relevance judgments, Inf. Process. Manage. (1988).
[62]
J.W. Janes, On the distribution of relevance judgements, in: Proceedings of the ASIS Annual Meeting, vol. 31, 1993, pp. 104-114.
[63]
A. Spink, H. Greisdorf, J. Bateman, From highly relevant to not relevant: Examining different regions of relevance, Inf. Process. Manage. (1998).
[64]
R. Tang, W.M. Shaw, J.L. Vevea, Towards the identification of the optimal number of relevance categories, J. Am. Soc. Inf. Sci. (1999).
[65]
J. Zobel, How reliable are the results of large-scale information retrieval experiments? in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 307-314.
[66]
C. Buckley, E.M. Voorhees, Retrieval evaluation with incomplete information, in: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pp. 25-32.
[67]
E. Yilmaz, J.A. Aslam, Estimating average precision with incomplete and imperfect judgments, in: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 2006, pp. 102-111.
[68]
S. Büttcher, C.L.A. Clarke, P.C.K. Yeung, I. Soboroff, Reliable information retrieval evaluation with incomplete and biased judgements, in: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 63-70.
[69]
P. Bailey, P. Thomas, N. Craswell, A.P.D. Vries, I. Soboroff, E. Yilmaz, Relevance assessment: Are judges exchangeable and does it matter, in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2008, pp. 667-674.
[70]
K.A. Kinney, S.B. Huffman, J. Zhai, How evaluator domain expertise affects search result relevance judgments, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, ACM, 2008, pp. 591-598.
[71]
G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1998, pp. 282-289.
[72]
J.A. Aslam, V. Pavlu, E. Yilmaz, A statistical method for system evaluation using incomplete judgments, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2006, pp. 541-548.
[73]
J. Zhang, J. Kamps, A search log-based approach to evaluation, in: Proceedings of the 14th European Conference on Research and Advanced Technology for Digital Libraries, Springer-Verlag, 2010, pp. 248-260.
[74]
O. Alonso, S. Mizzaro, Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment, in: SIGIR 2009 Workshop on the Future of IR Evaluation, 2009, pp. 15-16.
[75]
G. Kazai, In search of quality in crowdsourcing for search engine evaluation, in: Proceedings of the 33rd European Conference on Advances in Information Retrieval, vol. 6611, Springer-Verlag, 2011, pp. 165-176.
[76]
V.R. Carvalho, M. Lease, E. Yilmaz, Crowdsourcing for search evaluation, SIGIR Forum, 44 (2011) 17-22.
[77]
C. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008.
[78]
B. Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley Publishing Company, 2009.
[79]
A. Kent, M.M. Berry, Luehrs, J.W. Perry, Machine literature searching VIII. Operational criteria for designing information retrieval systems, Am. Doc. (1955).
[80]
P.B. Kantor, E.M. Voorhees, The TREC-5 confusion track: Comparing retrieval methods for scanned text, Inf. Retr. (2000).
[81]
E. Meij, M. Bron, L. Hollink, B. Huurnink, M. Rijke, Learning semantic query suggestions, in: Proceedings of the 8th International Semantic Web Conference, ISWC2009, 2009, pp. 424-440.
[82]
M.-D. Albakour, U. Kruschwitz, N. Nanas, Y. Kim, D. Song, M. Fasli, A. De Roeck, AutoEval: An evaluation methodology for evaluating query suggestions using query logs, in: Proceedings of the 33rd European Conference on Advances in Information Retrieval, 2011, pp. 605-610.
[83]
E.M. Voorhees, The TREC-8 question answering track report, in: Proceedings of TREC-8, 1999, pp. 77-82.
[84]
E.M. Voorhees, Overview of TREC 2003, in: TREC, 2003, pp. 1-13.
[85]
B. Magnini, S. Romagnoli, A. Vallin, J. Herrera, A. Penas, V. Peinado, F. Verdejo, M. de Rijke, R. Vallin, The multiple language question answering track at CLEF 2003. In: CLEF 2003 Workshop, 2003, pp. 471-486.
[86]
K. Järvelin, J. Kekäläinen, IR evaluation methods for retrieving highly relevant documents, in: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 41-48.
[87]
K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Inf. Syst. (2002).
[88]
N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An experimental comparison of click position-bias models, in: Proceedings of the International Conference on Web Search and Web Data Mining, 2008, pp. 87-94.
[89]
O. Chapelle, D. Metlzer, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009, pp. 621-630.
[90]
C. Buckley, E.M. Voorhees, Evaluating evaluation measure stability, in: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR2000, 2000, pp. 33-40.
[91]
C.L. Clarke, M. Kolla, G.V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, I. MacKinnon, Novelty and diversity in information retrieval evaluation, in: Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval, 2008, pp. 659-666.
[92]
J. Tague-sutcliffe, J. Blustein, A statistical analysis of the TREC-3 data, in: Overview of the Third Text REtrieval Conference, TREC-3, 1994, pp. 385-398.
[93]
E.M. Voorhees, C. Buckley, The effect of topic set size on retrieval experiment error, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, pp. 316-323.
[94]
T. Sakai, Evaluating evaluation metrics based on the bootstrap, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 525-532.
[95]
M. Sanderson, J. Zobel, Information retrieval system evaluation: Effort, sensitivity, and reliability, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR2005, 2005, pp. 162-169.
[96]
T. Sakai, On the reliability of information retrieval metrics based on graded relevance, Inf. Process. Manage. (2007).
[97]
F. Radlinski, N. Craswell, 2010. Comparing the sensitivity of information retrieval metrics, in: Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 667-674.
[98]
J. Tague, R. Schultz, Evaluation of the user interface in an information retrieval system: A model, Inf. Process. Manage. (1989).
[99]
S. Harter, C. Hert, Evaluation of information retrieval systems: Approaches, issues, and methods, in: Annual Review of Information Science and Technology, ARIST, 1997.
[100]
L.T. Su, Evaluation measures for interactive information retrieval, Inf. Process. Manage., 28 (1992) 503-516.
[101]
P. Over, The TREC interactive track: An annotated bibliography, Inf. Process. Manage. (2001).
[102]
D. Kelly, J. Lin, Overview of the TREC 2006 ciQA task, SIGIR Forum, 41 (2007) 107-116.
[103]
M.D. Dunlop, Proceedings of the second mira workshop. Research Report tr-1997-2, University of Glasgow, 1996.
[104]
J. Koenemann, N.J. Belkin, A case for interaction: A study of interactive information retrieval behavior and effectiveness, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, 1996, pp. 205-212.
[105]
H. Xie, Supporting ease-of-use and user control: desired features and structure of web-based online IR systems, Inf. Process. Manage., 39 (2003) 899-922.
[106]
D. Petrelli, On the role of user-centred evaluation in the advancement of interactive information retrieval, Inf. Process. Manage., 44 (2008) 22-38.
[107]
M. Hearst, Search User Interfaces, Cambridge University Press, 2009.
[108]
P. Thomas, D. Hawking, Evaluation by comparing result sets in context, in: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM'06, 2006, pp. 94-101.
[109]
R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology Behind Search, Addison Wesley Professional, 2011.
[110]
E. Hitchingham, A study of the relationship between the search interview of the intermediary searcher and the online system user, and the assessment of search results as judged by the user. Final Report. Research Report, Oakland Univ., Rochester, MI, 1979.
[111]
R. Tagliacozzo, Estimating the satisfaction of information users, Bull. Med. Libr. Assoc. (1997).
[112]
A.H. Turpin, W. Hersh, Why batch and user evaluations do not give the same results, in: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, pp. 225-231.
[113]
W.R. Hersh, M.K. Crabtree, D.H. Hickam, L. Sacherek, C.P. Friedman, P. Tidmarsh, C. Mosbaek, D. Kraemer, Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions, Am. Med. Inform. Assoc. (2002).
[114]
S. Huuskonen, P. Vakkari, Students' search process and outcome in Medline in writing an essay for a class on evidence-based medicine, J. Doc. (2008).
[115]
B.C. Vickery, The structure of information retrieval systems, in: Proceedings of the International Conference on Scientific Information, 1959, pp. 1275-1290.
[116]
B.C. Vickery, Subject analysis for information retrieval, in: Proceedings of the International Conference on Scientific Information, 1959, pp. 41-52.
[117]
M. Taube, A note on the pseudo-mathematics of relevance, Am. Doc. (1965).
[118]
D. Soergel, Indexing and retrieval performance: the logical evidence, J. Am. Soc. Inf. Sci. (1994).
[119]
D.J. Foskett, A note on the concept of relevance, Inf. Storage Retr. (1972).
[120]
D. Kemp, Relevance, pertinence and information system development, Inf. Storage Retr. (1974).
[121]
W. Goffman, V.A. Newill, A methodology for test and evaluation of information retrieval systems, Inf. Storage Retr. (1966).
[122]
W. Goffman, Communication and epidemic processes, Proc. R. Soc. (1967).
[123]
P. Borlund, P. Ingwersen, Measures of relative relevance and ranked half-life: Performance indicators for interactive IR, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 324-331.
[124]
W.S. Cooper, Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems, J. Am. Soc. Inf. Sci. (1968).
[125]
R.M. Losee, Text Retrieval and Filtering: Analytic Models of Performance, Kluwer Academic Publishers, Norwell, MA, USA, 1998.
[126]
ISO 9241-11:1998 ergonomic requirements for office work with visual display terminals (VDTs)-part 11: Guidance on usability. Tech. Rep., International Organization for Standardization, 1998.
[127]
M.D. Dunlop, Experiments on the cognitive aspects of information seeking and information retrieving. Final Report and Appendices. Research Report, Rutgers, The State Univ., New Brunswick, NJ, 1986.
[128]
L.T. Su, Value of search results as a whole as the best single measure of information retrieval performance, Inf. Process. Manage. (1998).
[129]
G. Tsakonas, S. Kapidakis, C. Papatheodorou, Evaluation of user interaction in digital libraries, in: Notes of the DELOS WP7 Workshop on the Evaluation of Digital Libraries, 2004.
[130]
F.C. Johnson, J.R. Griffiths, R.J. Hartley, Task dimensions of user evaluations of information retrieval systems, Inform. Res. (2003).
[131]
X. Dong, L.T. Su, Search engines on the world wide web and information retrieval from the internet: A review and evaluation, Online and CD-ROM Review (1997).
[132]
B. Shackel, Ergonomics in design for usability, in: People & Computers: Designing for Usability. Proceedings of the Second Conference of the BCS HCI Specialist Group, Cambridge University Press, 1986, pp. 44-64.
[133]
J. Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., 1993.
[134]
B. Shneiderman, Designing the User Interface: Strategies for Effective Human-Computer Interaction, Addison-Wesley Longman Publishing Co., Inc., 1986.
[135]
T. Tullis, W. Albert, Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics, Elsevier/Morgan Kaufmann, 2008.
[136]
T. Grossman, G. Fitzmaurice, R. Attar, A survey of software learnability: Metrics, methodologies and guidelines, in: Proceedings of the 27th International Conference on Human Factors in Computing Systems, 2009, pp. 649-658.
[137]
T.L. Roberts, T.P. Moran, The evaluation of text editors: Methodology and empirical results, Commun. ACM (1983) 265-283.
[138]
J. Whiteside, S. Jones, P.S. Levy, D. Wixon, User performance with command, menu, and iconic interfaces, SIGCHI Bull. (1985) 185-191.
[139]
F.D. Davis, R.P. Bagozzi, P.R. Warshaw, User acceptance of computer technology: A comparison of two theoretical models, Manage. Sci. (1989) 982-1003.
[140]
M. Hearst, A. Elliott, J. English, R. Sinha, K. Swearingen, K.-P. Yee, Finding the flow in web site search, Commun. ACM (2002) 42-49.
[141]
T. Saracevic, P. Kantor, A.Y. Chamis, D. Trivison, A study of information seeking and retrieving. I. Background and methodology, J. Am. Soc. Inf. Sci., 39 (1988) 161-176.
[142]
W.S. Cooper, On selecting a measure of retrieval effectiveness, part 1: The "subjective" philosophy of evaluation, J. Am. Soc. Inf. Sci. (1973).
[143]
W.S. Cooper, On selecting a measure of retrieval effectiveness part II. Implementation of the philosophy, J. Am. Soc. Inf. Sci. (1973).
[144]
J. Tessier, W. Crouch, P. Atherton, New measures of user satisfaction with computer based literature searches, Spec. Libr. (1977).
[145]
G.A. Crawford, A. Lee, L. Connolly, Y. Shylaja, OPAC user satisfaction and success: A study of four libraries, in: Proceedings of the 7th Conference on Integrated on-line Library Systems, IOLS 1992: Integrated on-line Library Systems, 1992, pp. 81-89.
[146]
S. Draper, Overall task measurement and sub-task measurements, in: Proceedings of the 2nd Mira Workshop, 1996, pp. 17-18.
[147]
C.D. Loupy, P. Bellot, Evaluation of document retrieval systems and query difficulty, 1997.
[148]
C. Hildreth, Accounting for users' inflated assessments of on-line catalogue search performance and usefulness: An experimental study, Inf. Res., 6 (2001).
[149]
J.R. Griffiths, F. Johnson, R.J. Hartley, User satisfaction as a measure of system performance, J. Librariansh. Inf. Sci. (2007).
[150]
W.S. Cooper, On selecting a measure of retrieval effectiveness, J. Am. Soc. Inf. Sci. (1973).
[151]
N. Belkin, A. Vickery, Interaction In Information Systems: A Review of Research from Document Retrieval to Knowledge-based Systems, British Library (1985).
[152]
O. Hoeber, X.D. Yang, User-oriented evaluation methods for interactive web search interfaces, in: Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology-Workshops, IEEE Computer Society, 2007, pp. 239-243.
[153]
J. Tague-sutcliffe, The pragmatics of information retrieval experimentation, revisited, Inf. Process. Manage. (1992).
[154]
P. Borlund, P. Ingwersen, The development of a method for the evaluation of interactive information retrieval systems, J. Doc. (1997).
[155]
S. Walker, R. DeVere, Improving subject retrieval in online catalogues. 2. Relevance feedback and query expansion. London: British Library, 1990.
[156]
S. Walker, M. Hancock-Beaulieu, City University (London, E.C.f.I.S.R.), 1991. Okapi at City: An Evaluation Facility for Interactive IR. Centre for Interactive Systems Research, City University.
[157]
J. Nielsen, Estimating the number of subjects needed for a thinking aloud test, Int. J. Hum.-Comput. Stud. (1994).
[158]
J.R. Lewis, Sample sizes for usability studies: additional considerations, J. Hum. Factors Ergon. Soc. (1994).
[159]
R.A. Virzi, Streamlining the design process: Running fewer subjects, in: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 1990, pp. 291-294.
[160]
R.A. Virzi, Refining the test phase of usability evaluation: How many subjects is enough?, Hum. Factors (1992).
[161]
C.W. Turner, J.R. Lewis, J. Nielsen, Determining usability test sample size, Int. Encyclopedia Ergon. Hum. Factors (2006).
[162]
C. Perfetti, L. Landesman, Eight is not enough, 2001. Retrieved: August 2012. http://www.uie.com/articles/eight_is_not_enough.
[163]
J. Spyridakis, Conducting research in technical communication: The application of true experimental designs, Tech. Commun. (1992).
[164]
L. Faulkner, Beyond the five-user assumption: benefits of increased sample sizes in usability testing, Behav. Res. Methods Instrum. Comput. (2003).
[165]
J. Gonzalo, D. Oard, iCLEF 2004 track overview: interactive cross-language question answering, in: Results of the CLEF 2004 Evaluation Campaign, 2004, pp. 310-322.
[166]
J. Gonzalo, P. Clough, A. Vallin, Overview of the CLEF 2005 interactive track, in: Proceedings of the 6th International Conference on Cross-Language Evaluation Forum: Accessing Multilingual Information Repositories, 2006, pp. 251-262.
[167]
W.H. Hersh, P. Over, Trec-9 interactive track report, in: Proceedings Text REtrieval Conference, TREC-9, 1999, pp. 41-50.
[168]
P. Over, TREC-6 Interactive track report, in: TREC, 1997, pp. 57-64.
[169]
I. Hsieh-Yee, Research on web search behavior, Libr. Inf. Sci. Res. (2001).
[170]
C. Hölscher, G. Strube, Web search behavior of Internet experts and newbies, Comput. Netw., 33 (2000) 337-346.
[171]
N. Navarro-Prieto, M. Scaife, Y. Rogers, Cognitive strategies in web searching. Proceedings of the 5th Conference on Human Factors and the Web 2004, 1999, pp. 1-13.
[172]
D. Tabatabai, B.M. Shore, How experts and novices search the Web, Libr. Inf. Sci. Res., 27 (2005) 222-248.
[173]
T. Isenberg, P. Isenberg, J. Chen, M. Sedlmair, T. Möller, A systematic review on the practice of evaluating visualization, IEEE Trans. Vis. Comput. Graphics, 19 (2013) 2818-2827.
[174]
J.M. Jose, J. Furner, D.J. Harper, Spatial querying for image retrieval: A user-oriented evaluation, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 232-240.
[175]
P. Borlund, Experimental components for the evaluation of interactive information retrieval systems, J. Doc. (2000).
[176]
R.W. White, M. Bilenko, S. Cucerzan, Studying the use of popular destinations to enhance web search interaction, in: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 159-166.
[177]
A. Spink, A user-centered approach to evaluating human interaction with web search engines: An exploratory study, Inf. Process. Manage. (2002).
[178]
L.T. Su, A comprehensive and systematic model of user evaluation of web search engines: II. An evaluation by undergraduates, J. Am. Soc. Inf. Sci. Technol. (2003).
[179]
H. Joho, D. Hannah, J.M. Jose, Comparing collaborative and independent search in a recall-oriented task, in: Proceedings of the Second International Symposium on Information Interaction in Context, 2008, pp. 89-96.
[180]
M. Kaki, A. Aula, Controlling the complexity in comparing search user interfaces via user studies, Inf. Process. Manage. (2008).
[181]
K. Elbedweihy, S.N. Wrigley, F. Ciravegna, Evaluating semantic search query approaches with expert and casual users, in: Proceedings of the Evaluations and Experiments Track, 11th International Semantic Web Conference, ISWC2012, 2012, pp. 274-286.
[182]
J.E. McGrath, Methodology matters: Doing research in the behavioral and social sciences, in: Human-Computer Interaction, Morgan Kaufmann Publishers Inc., 1995, pp. 152-169.
[183]
S. Page, Community research: The lost art of unobtrusive methods1, J. Appl. Soc. Psychol. (2000).
[184]
E.J. Webb, D.T. Campbell, R.D. Schwarz, L. Sechrest, Unobtrusive Measures, Sage Publications, 2000.
[185]
R. Rice, C. Borgman, The use of computer monitored data in information science and communication research, J. Am. Soc. Inf. Sci. (1983).
[186]
T. Peters, The history and development of transaction log analysis, Library Hi. Tech., 1993.
[187]
K.A. Ericsson, H.A. Simon, Protocol Analysis: Verbal Reports as Data, MIT Press, 1993.
[188]
C. Jenkins, C.L. Corritore, S. Wiedenbeck, Patterns of information seeking on the web: A qualitative study of domain expertise and web expertise, IT Soc., 1 (2003) 64-89.
[189]
S.N. Wrigley, K. Elbedweihy, A. Gentile, V. Lanfranchi, A.-S. Dadzie, Results of the second evaluation of semantic search tools. Tech. Rep. D13.6, SEALS Consortium, 2012.
[190]
D. Rhenius, G. Deffner, Evaluation of concurrent thinking aloud using eye-tracking data, in: Human Factors and Ergonomics Society Annual Meeting Proceedings, 1990.
[191]
L. Bainbridge, Verbal reports as evidence of the process operator's knowledge, Int. J. Man-Mach. Stud. (1979).
[192]
P. Ingwersen, Information Retrieval Interaction, Taylor Graham, 1992.
[193]
D. Harper, User, task and domain, in: Proceedings of the 2nd Mira Workshop, 1996.
[194]
J. Brooke, SUS: A quick and dirty usability scale, in: Usability Evaluation in Industry, Taylor and Francis, 1996, pp. 189-194.
[195]
J.R. Lewis, IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use, Int. J. Hum.-Comput. Interac., 7 (1995) 57-78.
[196]
J. Chin, V. Diehl, K. Norman, Development of an instrument measuring user satisfaction of the human-computer interface, University of Maryland, 1987.
[197]
K. Spärck Jones, Further reflections on TREC, Inf. Process. Manage., 36 (2000) 37-85.
[198]
W. Hersh, P. Over, SIGIR workshop on interactive retrieval at TREC and beyond, SIGIR Forum, 34 (2000).
[199]
N. Fuhr, N. Gövert, G. Kazai, M. Lalmas, Inex: Initiative for the evaluation of XML retrieval, in: Proceedings of the SIGIR 2002 Workshop on XML and Information Retrieval 2006, 2002, pp. 1-9.
[200]
D. Bitton, D.J. DeWitt, C. Turbyfill, Benchmarking database systems-a systematic approach, in: Proceedings of the 9th International Conference on Very Large Data Bases, 1983, pp. 8-19.
[201]
D. Bitton, A Measure of Transaction Processing Power, Datamation (1985).
[202]
M. Stonebraker, J. Frew, K. Gardels, J. Meredith, The SEQUOIA 2000 storage benchmark, SIGMOD Rec. (1993).
[203]
R.G.G. Cattell, The engineering database benchmark, in: Readings in Database Systems, Morgan Kaufmann Publishers Inc., 1994, pp. 247-281.
[204]
Y. Guo, Z. Pan, J. Heflin, LUBM: A benchmark for OWL knowledge base systems, Web Semant. Sci. Serv. Agents World Wide Web (2005).
[205]
C. Bizer, A. Schultz, The Berlin SPARQL Benchmark, Group (2009).
[206]
M. Schmidt, T. Hornung, G. Lausen, C. Pinkel, SP2Bench: A SPARQL performance benchmark. CoRR, 2008.
[207]
V. López, M. Pasin, E. Motta, AquaLog: An ontology-portable question answering system for the semantic web, in: The Semantic Web: Research and Applications, Springer, 2005, pp. 269-272.
[208]
B. de Alwis, G.C. Murphy, Answering conceptual queries with Ferret, in: Proceedings of the 30th International Conference on Software Engineering, ICSE 2008, 2008, pp. 21-30.
[209]
J. Sillito, G.C. Murphy, K. De Volder, Questions programmers ask during software evolution tasks, in: Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2006, pp. 23-34.
[210]
J. Sillito, G.C. Murphy, K.D. Volder, Asking and answering questions during a programming change task, IEEE Trans. Softw. Eng., 34 (2008) 434-451.
[211]
J. Pound, P. Mika, H. Zaragoza, Ad-hoc object retrieval in the web of data, in: Proceedings of the 19th International Conference on World Wide Web, WWW2010, 2010, pp. 771-780.
[212]
G. Cheng, H. Wu, W. Ge, Y. Qu, Searching semantic web objects based on class hierarchies, in: Proceedings of of Linked Data on the Web Workshop, LDOW, 2008.
[213]
A. Harth, A. Hogan, R. Delbru, J. Umbrich, S. O'Riain, S. Decker, SWSE: Answers before links!, in: Semantic Web Challenge, Vol. 295, 2007.
[214]
M. d'Aquin, C. Baldassarre, L. Gridinoc, S. Angeletou, M. Sabou, E. Motta, Characterizing knowledge on the semantic web with watson. In: EON, 2007, pp. 1-10.
[215]
K. Balog, E. Meij, M. de Rijke, Entity search: Building bridges between two worlds, in: SemSearch2010: Semantic Search 2010 Workshop at WWW 2010, 2010, pp. 91-95.
[216]
S. Campinas, D. Ceccarelli, T.E. Perry, R. Delbru, K. Balog, G. Tummarello, The Sindice-2011 dataset for entity-oriented search in the web of data, in: Proceedings of the 1st International Workshop on Entity-Oriented Search, EOS, 2011, pp. 26-32.
[217]
A. Bernstein, D. Reinhard, S.N. Wrigley, F. Ciravegna, SEALS Deliverable D13.1 Evaluation design and collection of test data for semantic search tools. Tech. Rep., SEALS Consortium, 2009.
[218]
K. Jones, R. Bates, Report on a design study for the 'ideal' information retrieval test collection, Computer Laboratory, University of Cambridge, 1977.
[219]
A. Rees, D. Schultz, A field experimental approach to the study of relevance assessments in relation to document searching. Final Report to the National Science Foundation, Case Western Reserve University, 1967.
[220]
R. Katter, The influence of scale form on relevance judgments, Inf. Storage Retr. (1968).
[221]
R. Blanco, H. Halpin, D.M. Herzig, P. Mika, J. Pound, H.S. Thompson, T. Tran, Repeatable and reliable semantic search evaluation, Web Semant., 21 (2013) 14-29.
[222]
H. Chu, Factors affecting relevance judgment: A report from TREC Legal track, J. Doc. (2011).
[223]
E. Voorhees, Evaluation by highly relevant documents. in: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, pp. 74-82.
[224]
K. Zhou, H. Zha, G.-R. Xue, Y. Yu, Learning the gain values and discount factors of DCG. CoRR abs/1212.5650, 2012.
[225]
E. Kanoulas, J.A. Aslam, Empirical justification of the gain and discount function for nDCG, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009, pp. 611-620.
[226]
S.N. Wrigley, K. Elbedweihy, D. Reinhard, A. Bernstein, F. Ciravegna, Evaluating semantic search tools using the seals platform, in: International Workshop on Evaluation of Semantic Technologies, IWEST 2010, ISWC 2010, 2010.
[227]
K. Elbedweihy, S.N. Wrigley, F. Ciravegna, D. Reinhard, A. Bernstein, Evaluating semantic search systems to identify future directions of research, in: Proceedings of the 2nd International Workshop on Evaluation of Semantic Technologies, IWEST, 2012b, pp. 25-36.
[228]
S.N. Wrigley, K. Elbedweihy, D. Reinhard, A. Bernstein, F. Ciravegna, Results of the first evaluation of semantic search tools. Technical report, SEALS Consortium, 2010.
[229]
O. Hartig, C. Bizer, J.C. Freytag, Executing SPARQL queries over the web of linked data, in: Proceedings of the 8th International Semantic Web Conference, ISWC 2009, 2009, pp. 293-309.
[230]
K. Balog, D. Carmel, A.P. de Vries, D.M. Herzig, P. Mika, H. Roitman, R. Schenkel, P. Serdyukov, T.T. Duc, The first joint international workshop on entity-oriented and semantic search (JIWES), SIGIR Forum, 46 (2012) 87-94.
[231]
K. Balog, R. Neumayer, A test collection for entity search in DBpedia, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, SIGIR'13, 2013, pp. 737-740.
[232]
K. Balog, A.P. de Vries, P. Serdyukov, P. Thomas, T. Westerveld, Overview of the TREC 2009 entity track, in: Proceedings of the Eighteenth Text REtrieval Conference, TREC 2009, NIST, 2010.
[233]
G. Demartini, T. Iofciu, A.P. De Vries, Overview of the inex 2009 entity ranking track, in: Proceedings of the Focused Retrieval and Evaluation, and 8th International Conference on Initiative for the Evaluation of XML Retrieval, Springer-Verlag, 2010, pp. 254-264.
[234]
P. Clough, M. Sanderson, J. Tang, T. Gollins, A. Warner, Examining the limits of crowdsourcing for relevance assessment, IEEE Internet Comput., PP (2012).
[235]
W. Goffman, On relevance as a measure, Inf. Storage Retr., 2 (1964) 201-203.
[236]
J. Vitek, T. Kalibera, Repeatability, reproducibility, and rigor in systems research, in: EMSOFT, ACM, 2011, pp. 33-38.

Cited By

View all
  1. An overview of semantic search evaluation initiatives

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Web Semantics: Science, Services and Agents on the World Wide Web
    Web Semantics: Science, Services and Agents on the World Wide Web  Volume 30, Issue C
    January 2015
    106 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 January 2015

    Author Tags

    1. Benchmarking
    2. Evaluation
    3. Information retrieval
    4. Performance
    5. Semantic search
    6. Usability

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 23 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media