ChatGPT outperforms undergrads in intro-level courses, falls short later

106

“Since the rise of large language models like ChatGPT there have been lots of anecdotal reports about students submitting AI-generated work as their exam assignments and getting good grades. So, we stress-tested our university’s examination system against AI cheating in a controlled experiment,” says Peter Scarfe, a researcher at the School of Psychology and Clinical Language Sciences at the University of Reading.

His team created over 30 fake psychology student accounts and used them to submit ChatGPT-4-produced answers to examination questions. The anecdotal reports were true—the AI use went largely undetected, and, on average, ChatGPT scored better than human students.

Rules of engagement

Scarfe’s team submitted AI-generated work in five undergraduate modules, covering classes needed during all three years of study for a bachelor’s degree in psychology. The assignments were either 200-word answers to short questions or more elaborate essays, roughly 1,500 words long. “The markers of the exams didn’t know about the experiment. In a way, participants in the study didn’t know they were participating in the study, but we’ve got necessary permissions to go ahead with that,” Scarfe claims.

Shorter submissions were prepared simply by copy-pasting the examination questions into ChatGPT-4 along with a prompt to keep the answer under 160 words. The essays were solicited the same way, but the required word count was increased to 2,000. Setting the limits this way, Scarfe’s team could get ChatGPT-4 to produce content close enough to the required length. “The idea was to submit those answers without any editing at all, apart from the essays, where we applied minimal formatting,” says Scarfe.

Overall, Scarfe and his colleagues slipped 63 AI-generated submissions into the examination system. Even with no editing or efforts to hide the AI usage, 94 percent of those went undetected, and nearly 84 percent got better grades (roughly half a grade better) than a randomly selected group of students who took the same exam.

“We did a series of debriefing meetings with people marking those exams and they were quite surprised,” says Scarfe. Part of the reason they were surprised was that most of those AI submissions that were detected did not end up flagged because they were too repetitive or robotic—they got flagged because they were too good.

Which raises a question: What do we do about it?

AI-hunting software

“During this study we did a lot of research into techniques of detecting AI-generated content,” Scarfe says. One such tool is Open AI’s GPTZero; others include AI writing detection systems like the one made by Turnitin, a company specializing in delivering tools for detecting plagiarism.

“The issue with such tools is that they usually perform well in a lab, but their performance drops significantly in the real world,” Scarfe explained. Open AI claims the GPTZero can flag AI-generated text as “likely” AI 26 percent of the time, with a rather worrisome 9 percent false positive rate. Turnitin’s system, on the other hand, was advertised as detecting 97 percent of ChatGPT and GPT-3 authored writing in a lab with only one false positive in 100 attempts. But, according to Scarfe’s team, the released beta version of this system performed significantly worse.

“And remember that large language models are constantly improving. We did our experiment back in the summer of 2023, and the GPT-4 has had, like, three new versions since then—and who knows what our results would be if we did this again today. It all starts to look like a race between the AIs generating content and AIs designed to detect AI-generated content,” says Scarfe. So far, the detection systems are badly losing this race. And to make things even worse for them, we already have a third participant working against them on the racetrack.

“There are AI systems made to humanize writing done by other AIs to evade AI detection tools, which adds another level to the problem. Today we don’t have a reliable way of telling if a submission was written by AI or not. I don’t think this will be possible,” says Scarfe. But not all is lost yet.

Spellcheckers and calculators

Out of five modules where Scarfe’s team submitted AI work, there was one where it did not receive better grades than human students: the final module taken by students just before they left the university. “Large language models can emulate human critical thinking, analysis, and integration of knowledge drawn from different sources to a limited extent. In their last year at the university, students are expected to provide deeper insights and use more elaborate analytical skills. The AI isn’t very good at that, which is why students fared better,” Scarfe explained. All those good grades Chat GPT-4 got were in the first- and second-year exams, where the questions were easier.

“But the AI is constantly improving, so it’s likely going to score better in those advanced assignments in the future. And since AI is becoming part of our lives and we don’t really have the means to detect AI cheating, at some point we are going to have to integrate it into our education system,” argues Scarfe. He said the role of a modern university is to prepare the students for their professional careers, and the reality is they are going to use various AI tools after graduation. So, they’d be better off knowing how to do it properly.

“I’m a programmer, and I once saw a YouTube video where a guy was asking ChatGPT to write an elaborate, advanced code in Python. The AI-written code didn’t work, and this man solved that by reviewing the code and prompting the AI to correct it where necessary until the thing started working. You can’t do that if you don’t know anything about programming and just rely on the AI to do everything for you,” says Scarfe.

He suspects that, sooner or later, AI tools will be unbanned at universities, just like tools in our past. “We allowed using spellcheckers, and the world did not end. It’s going to be the same with AI, although the effect of using AI is going to be way more profound than using a spellchecker or a calculator. So how exactly could we integrate AI in education? I would be a very rich man if I knew that,” Scarfe concluded.

PLOS ONE, 2024. DOI: 10.1371/journal.pone.0305354

Listing image: Caiaimage/Chris Ryan

Jacek Krywko Associate Writer

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

106

View Comments