skip to main content
10.1145/2556325.2566238acmconferencesArticle/Chapter ViewAbstractPublication Pagesl-at-sConference Proceedingsconference-collections
research-article
Open access

Scaling short-answer grading by combining peer assessment with algorithmic scoring

Published: 04 March 2014 Publication History

Abstract

Peer assessment helps students reflect and exposes them to different ideas. It scales assessment and allows large online classes to use open-ended assignments. However, it requires students to spend significant time grading. How can we lower this grading burden while maintaining quality? This paper integrates peer and machine grading to preserve the robustness of peer assessment and lower grading burden. In the identify-verify pattern, a grading algorithm first predicts a student grade and estimates confidence, which is used to estimate the number of peer raters required. Peers then identify key features of the answer using a rubric. Finally, other peers verify whether these feature labels were accurately applied. This pattern adjusts the number of peers that evaluate an answer based on algorithmic confidence and peer agreement. We evaluated this pattern with 1370 students in a large, online design class. With only 54% of the student grading time, the identify-verify pattern yields 80-90% of the accuracy obtained by taking the median of three peer scores, and provides more detailed feedback. A second experiment found that verification dramatically improves accuracy with more raters, with a 20% gain over the peer-median with four raters. However, verification also leads to lower initial trust in the grading system. The identify-verify pattern provides an example of how peer work and machine learning can combine to improve the learning experience.

References

[1]
Professionals against machine scoring of student essays in high-stakes assessment (www.humanreaders.com).
[2]
Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B., Ackerman, M. S., Karger, D. R., Crowell, D., and Panovich, K. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, ACM (2010), 313--322.
[3]
Boud, D., and Brew, A. Enhancing learning through self assessment, vol. 1. Kogan Page London, 1995.
[4]
Burstein, J. The e-rater scoring engine: Automated essay scoring with natural language processing. Automated essay scoring: A cross-disciplinary perspective (2003), 113--121.
[5]
Burstein, J., Chodorow, M., and Leacock, C. Automated essay evaluation: the criterion online writing service. AI Magazine 25, 3 (2004), 27.
[6]
Carlson, P., and Berry, F. Calibrated peer review and assessing learning outcomes. In Frontiers in Education Conference, vol. 2, STIPES (2003).
[7]
Chen, H., and He, B. Automated essay scoring by maximizing human-machine agreement.
[8]
Chodorow, M., and Leacock, C. An unsupervised method for detecting grammatical errors. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, Association for Computational Linguistics (2000), 140--147.
[9]
Fast, E., Lee, C., Aiken, A., Bernstein, M., Koller, D., Smith, E., and Institute, K. Crowd-scale interactive formal reasoning and analytics.
[10]
Foltz, P. W., Laham, D., and Landauer, T. K. The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning 1, 2 (1999).
[11]
Grimes, D., and Warschauer, M. Utility in a fallible tool: A multi-site case study of automated writing evaluation. The Journal of Technology, Learning and Assessment 8, 6 (2010).
[12]
Heimerl, K., Gawalt, B., Chen, K., Parikh, T., and Hartmann, B. Communitysourcing: engaging local crowds to perform expert work via physical kiosks. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems (2012), 1539--1548.
[13]
Hirschman, L., Breck, E., Light, M., Burger, J. D., and Ferro, L. Automated grading of short-answer tests. Intelligent Systems and their Applications, IEEE (2000), 31--37.
[14]
Hirschman, L., Light, M., Breck, E., and Burger, J. D. Deep read: A reading comprehension system. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Association for Computational Linguistics (1999), 325--332.
[15]
Kahneman, D., Lovallo, D., and Sibony, O. Before you make that big decision. Harvard Business Review 89, 6 (2011), 50--60.
[16]
Karger, D. R., Oh, S., and Shah, D. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems (2011), 1953--1961.
[17]
Kulkarni, C., Wei, K. P., Le, H., Chia, D., Papadopoulos, K., Cheng, J., Koller, D., and Klemmer, S. R. Peer and self assessment in massive online classes. ACM Trans. on Computer-Human Interaction 20 (2013), Preprint.
[18]
Lintott, C. J., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., Raddick, M. J., Nichol, R. C., Szalay, A., Andreescu, D., et al. Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey?. Monthly Notices of the Royal Astronomical Society 389, 3 (2008), 1179--1189.
[19]
Lovallo, D., and Sibony, O. The case for behavioral strategy. McKinsey Quarterly (2010), 30--43.
[20]
Nielsen, J. Usability inspection methods. In Conference companion on Human factors in computing systems, ACM (1994), 413--414.
[21]
Peng Dai, M. D., and Weld, S. Decision-theoretic control of crowd-sourced workflows. In In the 24th AAAI Conference on Artificial Intelligence (AAAI'10 (2010).
[22]
Piech, C., Huang, J., Chen, Z., Do, C., Ng, A., and Koller, D. Tuning peer grading. In Proceedings of the 6th International Conference on Educational Data Mining (2013).
[23]
Shermis, M. D., Garvan, C. W., and Diao, Y. The impact of automated essay scoring on writing outcomes. Online Submission (2008).
[24]
Sommers, N. Revision strategies of student writers and experienced adult writers. College composition and communication 31, 4 (1980), 378--388.
[25]
Wetzel, C. G., Wilson, T. D., and Kort, J. The halo effect revisited: Forewarned is not forearmed. Journal of Experimental Social Psychology (1981).
[26]
Winerip, M. Facing a robo-grader? just keep obfuscating mellifluously. New York Times (2013).
[27]
Yang, B., Sun, J.-T., Wang, T., and Chen, Z. Effective multi-label active learning for text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2009), 917--926.
[28]
Yannakoudakis, H., Briscoe, T., and Medlock, B. A new dataset and method for automatically grading esol texts. In ACL (2011), 180--189.
[29]
Zhang, M. Contrasting automated and human scoring of essays. R & D Connections (2013).

Cited By

View all

Index Terms

  1. Scaling short-answer grading by combining peer assessment with algorithmic scoring

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      L@S '14: Proceedings of the first ACM conference on Learning @ scale conference
      March 2014
      234 pages
      ISBN:9781450326698
      DOI:10.1145/2556325
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 March 2014

      Check for updates

      Author Tags

      1. assessment
      2. automated assessment
      3. online learning
      4. peer learning

      Qualifiers

      • Research-article

      Conference

      L@S 2014
      Sponsor:
      L@S 2014: First (2014) ACM Conference on Learning @ Scale
      March 4 - 5, 2014
      Georgia, Atlanta, USA

      Acceptance Rates

      L@S '14 Paper Acceptance Rate 14 of 38 submissions, 37%;
      Overall Acceptance Rate 117 of 440 submissions, 27%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)118
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 20 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media