Another thanks to Aeryk, who got me to review the Turing Test…
The Turing test is a test for program intelligence, an updated version of which goes something like this:
A judge engages in a chat session with a respondent. After a while, the judge must decide if the respondent is a human or a program. If the judge mistakes the program for a human, then the program is intelligent.
This test has captured the imagination of many, and some people use it as a gold standard for Artificial Intelligence. However, this test is full of problems.
First, imagine the judge types “What’s 327653 * -2.34872^2” and the respondent immediately answers correctly. The judge could then conclude the respondent is a program since no human could calculate that quickly. Yet the absurdity of this failure should be obvious; the program failed because it was “too intelligent”!
This problem actually happened in reverse. Years ago at a (widely criticized) version of the Turing Test called the Loebner prize, a human Shakespeare expert was one of the respondents. Given an obscure Shakespeare question, she successfully answered, and was classified as a computer because the judge believed no human could know that information!
Here’s another problem. Imagine the following conversation between a judge and a respondent:
JUDGE: What’s your favorite food?
JUDGE: What kind of cheese?
JUDGE: My pet moose choked on a piece of Gouda and its heart stopped, but fortunately we were able to resuscitate it with a live eel.
RESPONDENT: That was close!
The response was relevant, but it’s off since it didn’t comment on the judge’s bizarre story. Yet to understand the story as bizarre requires a large body of facts including common pets, the diets of moose and so on. Yet, shouldn’t reasoning be independent of stored information?
Douglas Lenat thought not and embarked on the years-long CYC project to do just that. This project involved entering a huge number of facts — from encyclopedic knowledge to current events to obvious common sense tidbits — into a system equipped with an inference engine from which it could infer new facts from facts previously entered. Its success is debated, but to my knowledge no Turing winner resulted from that project (not that I’m claiming this was his goal).
Yet is a vast body of facts required for intelligence? For instance, imagine a young child who knows nothing, but is a fast learner, able to draw deep conclusions from anything learned. Would anyone claim the child was unintelligent because s/he didn’t have a vast body of facts on hand?
Let’s move on to another problem with the Turing test. This is related to the example above and is seen by the following exchange:
JUDGE: Have you ever LeBronned anyone?
JUDGE: Not even your mother?
JUDGE: Why not? I mean, she is your mother after all!
RESPONDENT: I guess I never had the chance.
The absurdity of this response becomes clear when you consider that “LeBronning” is when someone falls violently to the ground after making slight contact with another person. It got its name from basketball player LeBron James, for whom this is a signature move. Here are some hilarious slow motion videos of LeBron doing his thing. Had the respondent known what “LeBronning” was, its response would have been puzzlement, like the case of the pet moose.
This reveals another aspect of intelligence: social interaction. “LeBronning” reveals that concepts are dynamic, so a system can’t simply be coded and forgotten, it must be sensitive to current events. Between the time the program is complete to the time it’s tested, social pressures may introduce new terms, concepts or behaviors into conversation. Maybe we measure intelligent behavior by social interaction, and what is conversation but social interaction? Conversation’s language and topics are determined by current events, and successful conversation requires anticipating the other participant’s viewpoints and learning (adjusting behavior based on feedback).
In fact, maybe the only reason we’re able to successfully use the ambiguous medium of human language is because we anticipate the viewpoints of the other participant? We use this to fill in the blanks, disambiguate and even correct the other’s utterances. What’s more, we take it for granted that the other party does this, as witnessed by statements like “Stop playing dumb — you know what I meant!”. This means conversation can’t be regarded as simply language comprehension; one needs to build a mental model of the other person that goes beyond language.
But back to knowledge of social trends, and specifically “LeBronning”. There are people who haven’t heard of “LeBronning” (or other social trends), yet they are intelligent. True, but they (should) behave differently when confronted with terms they don’t know. For instance, here’s a dialog that can take place in the absence of this knowledge, yet that still exhibits all the earmarks of intelligence:
JUDGE: Have you ever LeBronned anyone?
RESPONDENT: What’s that?
JUDGE: It’s when someone barely touches you and you fall down dramatically.
JUDGE: Not even your mother?
RESPONDENT: Of course not!
The main thing about this response is that the respondent learned what “LaBronning” was and incorporated this new knowledge in the conversation. The learning is important, as another way of trying to smoke out a program would be to subtly refer to previous elements of the dialog.
So what would be an adequate measure of intelligence, and what would a workable Turing-like test be?
One measure of intelligence may be an ability to work with facts — when they are given. For instance, take a machine learning program. Given a body of data and fields of interest, it can analyze the relationships in the data to produce a predictive model for those fields. Then when given new data, it can fill in those fields from what it learned. In this respect, it’s similar to the bright child of the example above. We supply the world, the program proves its intelligence by how it processes the information we give it.
Perhaps this can be an inspiration for a better Turing test? Instead of random chit-chat, imagine a much more controlled setting. A short story is entered in simple language and judges then ask questions about the story — questions whose answers are all implicit within the story but may require a significant amount of inference to draw out. Answers are then scored based on accuracy and relevance. At this point, intelligence could be determined either by absolute performance or by a comparison with the human respondents. Programs that perform favorably would be regarded as intelligent.
This test would eliminate the need for things that have nothing to do with intelligence (like emotional mimicry, intentional errors, delays, etc…), would put scoring on a much more strict footing (rather than a judge’s opinion of the identity of the respondent), and would measure intelligent behavior rather than a comprehensive database of relevant knowledge.