
AI models flunk language test that takes grammar out of the equation
The Hindu
Generative AI systems excel in various tasks, but struggle to understand meaning like humans do, raising concerns about their capabilities.
Generative AI systems like large language models and text-to-image generators can pass rigorous exams that are required of anyone seeking to become a doctor or a lawyer. They can perform better than most people in Mathematical Olympiads. They can write halfway decent poetry, generate aesthetically pleasing paintings and compose original music.
These remarkable capabilities may make it seem like generative artificial intelligence systems are poised to take over human jobs and have a major impact on almost all aspects of society. Yet while the quality of their output sometimes rivals work done by humans, they are also prone to confidently churning out factually incorrect information. Sceptics have also called into question their ability to reason.
Large language models have been built to mimic human language and thinking, but they are far from human. From infancy, human beings learn through countless sensory experiences and interactions with the world around them. Large language models do not learn as humans do – they are instead trained on vast troves of data, most of which is drawn from the internet.
The capabilities of these models are very impressive, and there are AI agents that can attend meetings for you, shop for you or handle insurance claims. But before handing over the keys to a large language model on any important task, it is important to assess how their understanding of the world compares to that of humans.
I’m a researcher who studies language and meaning. My research group developed a novel benchmark that can help people understand the limitations of large language models in understanding meaning.
So what “makes sense” to large language models? Our test involves judging the meaningfulness of two-word noun-noun phrases. For most people who speak fluent English, noun-noun word pairs like “beach ball” and “apple cake” are meaningful, but “ball beach” and “cake apple” have no commonly understood meaning. The reasons for this have nothing to do with grammar. These are phrases that people have come to learn and commonly accept as meaningful, by speaking and interacting with one another over time.
We wanted to see if a large language model had the same sense of meaning of word combinations, so we built a test that measured this ability, using noun-noun pairs for which grammar rules would be useless in determining whether a phrase had recognisable meaning. For example, an adjective-noun pair such as “red ball” is meaningful, while reversing it, “ball red,” renders a meaningless word combination.