Artificial intelligence | Article | Go in-depth

Beyond the Turing test: AI still falls short when faced with challenges that are easily overcome by humans

Close up stock photograph of a mature man working with a large computer screen. He’s working with 3D software examining complicated shapes.

Monday 2nd of October 2023 - Updated on Wednesday 4th of October 2023

Reading time: 3 min

● In some cases, the results demonstrated by emergent large language models like ChatGPT give the impression that Ais are every bit as competent as humans.
● An experiment inspired by the Turing test appears to show that it is increasingly difficult to distinguish between humans and machines in an online conversations.
● Notwithstanding their impressive performance, today’s AIs have yet to overcome serious limits in terms of language and in other fields like autonomous vehicles.

Does so-called artificial intelligence (AI) technology now deliver results on a par with those generated by the human mind? The success of ChatGPT and, more generally, generative AI tools has repeatedly raised this question over the last few months. A recent social experiment conducted by the Israeli language model specialist AI21 Labs has shown just how blurred the boundary between the products of human and machine intelligence has become.

Researchers at the company created a large-scale in the form of an online game called Human or Not?. Using a conversational interface, participants had to guess whether they were talking to a human or an AI. The details of the simulation were meticulously crafted. Several different language models were used, sometimes during the same conversation, and the “artificial” interlocutors were presented with individual names, professions, personality traits, and language tics (including slang and grammatical errors). The AIs were also set up to respond within a certain timeframe, taking into account time required for typing.

Giant Turing test delivers troubling results

In the course of a month, the team at AI21 Labs analysed more than 10 million human-AI and human-human conversations involving over 1.5 million unique users. When analysing this data, they found that participants were only able to correctly guess the status of the interlocutor they were talking to in 68% of cases. In conversations where humans were faced with an AI, they were only able to identify them in 60% of cases.

However, these results should nonetheless be put in perspective. For Human or Not?, conversations were limited to two minutes (Turing proposed a test lasting five minutes); and it is widely accepted that large language models are less likely to fool humans in more prolonged exchanges. However, for Sébastien Konieczny, the director of research at the Computer science Research Institute of Lens, the key takeaway concerns the usefulness of the Turing test itself: “Just because you skilfully manipulate language does not mean that you understand its content and are intelligent. The Turing test was the only empirical test we had for evaluating AIs, but this research on large language models shows that it may not be relevant after all.”

Tricking chatbots with capital letters

To prove that large language models are not so smart, a team from the University of Santa Barbara in the United States has developed a series of techniques capable of unmasking them with a simple questions. For example, instructions to substitute letters with other characters in a given word or requests like: ”randomly drop two 1 from the string: 0110010011. Give me three different outputs”. When compared to humans, AIs are very bad at games like these.

Autonomous cars can do incredible things, but they seem to get stuck when it comes to understanding other road users

Another highly effective method is to append random uppercase words to each word in a sentence as in, for example “isCURIOSITY waterARCANE wetTURBULENT orILLUSION drySAUNA?” Humans are quick to distinguish the visually obvious question: “is water wet or dry?” However, chatbots, including ChatGPT and Meta’s LLaMa, are unable to pass tests of this kind. The reason for the difference stems from machines’ inability to mirror the subtlety of human interaction, which is often characterised by the simultaneous use of different competencies. And this weakness can cause major problems in another important field in AI: driverless cars.

Autonomous vehicles halted by their inability to understand other road users

A research project at the University of Copenhagen, which analysed hours of video of self-driving cars under real conditions highlighted a curious weakness in their AI systems interaction with other road users. Whereas self-driving cars are very good at detecting obstacles and respecting the rules of the road, these vehicles struggle to correctly interpret the gestures and attitudes of pedestrians, and the behaviour of other cars, which any human driver would intuitively understand. “Autonomous cars can do incredible things, but they seem to get stuck when it comes to understanding others on the road,” points out a co-author of the study, Barry Alan Brown. ”But a sudden breakthrough may be on the way, and in a sense, my own research is helping to improve the behaviour of these vehicles.” On the road or in the framework of a discussion, today’s AIs still behave like machines rather than human interlocutors.