Researchers test ChatGPT, other AI models against real-world students

Mohamed

September 17, 2024

William Hersh, M.D., who has taught generations of medical and clinical informatics students at Oregon Health & Science University, found himself curious about the growing influence of artificial intelligence. He wondered how AI would perform in his own class.

So, he decided to try an experiment.

He tested six forms of generative, large-language AI models — for example ChatGPT — in an online version of his popular introductory course in biomedical and health informatics to see how they performed compared with living, thinking students. A study published in the journal npj Digital Medicine, revealed the answer: Better than as many as three-quarters of his human students.

“This does raise concern about cheating, but there is a larger issue here,” Hersh said. “How do we know that our students are actually learning and mastering the knowledge and skills they need for their future professional work?”

As a professor of medical informatics and clinical epidemiology in the OHSU School of Medicine, Hersh is especially attuned to new technologies. The role of technology in education is nothing new, Hersh said, recalling his own experience as a high school student in the 1970s during the transition from slide rules to calculators.

Yet, the shift to generative AI represents an exponential leap forward.

“Clearly, everyone should have some kind of foundation of knowledge in their field,” Hersh said. “What is the foundation of knowledge you expect people to have to be able to think critically?”

Large-language models

Hersh and co-author Kate Fultz Hollis, an OHSU informatician, pulled the knowledge assessment scores of 139 students who took the introductory course in biomedical and health informatics in 2023. They prompted six generative AI large language models with student assessment materials from the course. Depending on the model, AI scored in the top 50^th to 75^th percentile on multiple-choice questions that were used in quizzes and a final exam that required short written responses to questions.

“The results of this study raise significant questions for the future of student assessment in most, if not all, academic disciplines,” the authors write.

The study is the first to compare large-language models to students for a full academic course in the biomedical field. Hersh and Fultz Hollis noted that a knowledge-based course such as this one may be especially ripe for generative, large-language models, in contrast to more participatory academic courses that help students develop more complex skills and abilities.

Hersh remembers his experience in medical school.

“When I was a medical student, one of my attending physicians told me I needed to have all the knowledge in my head,” he said. “Even in the 1980s, that was a stretch. The knowledge base of medicine has long surpassed the capacity of the human brain to memorize it all.”

Maintaining the human touch

Yet, he believes there’s a fine line between making sensible use of technical resources to advance learning and over-reliance to the point that it inhibits learning. Ultimately, the goal of an academic health center like OHSU is to educate health care professionals capable of caring for patients and optimizing the use of data and information about them in the real world.

In that sense, he said, medicine will always require the human touch.

“There are a lot of things that health care professionals do that are pretty straightforward, but there are those instances where it gets more complicated and you have to make judgment calls,” he said. “That’s when it helps to have that broader perspective, without necessarily needing to have every last fact in your brain.”

With fall classes starting soon, Hersh said he’s not worried about cheating.

“I update the course each year,” he said. “In any scientific field, there are new advancements all the time and large-language models aren’t necessarily up to date on all of it. This just means we’ll have to look at newer or more nuanced tests where you won’t get the answer out of ChatGPT.”

Source link