Come up with better Turing Tests
post by Stuart_Armstrong · 2014-06-10T10:47:23.878Z · LW · GW · Legacy · 44 commentsContents
44 comments
So the Turing test has been "passed", and the general consensus is that this was achieved in a very unimpressive way - the 13 year old Ukrainian persona was a cheat, the judges were incompetent, etc... These are all true, though the test did pass Turing's original criteria - and there are far more people willing to be dismissive of those criteria in retrospect than were in advance. It happened about 14 years later than Turing had been anticipating, which makes it quite a good prediction for 1950 (in my personal view, Turing made two mistakes that compensated - the "average interrogator" was a much lower bar than he thought, but progress on the subject would be much slower than he thought).
But anyway, the main goal now, as suggested by Toby Ord and others, is to design a better Turing test, something that can give AI designers something to aim at, and that would be a meaningful test of abilities. The aim is to ensure that if a program passes these new tests, we won't be dismissive of how it was achieved.
Here are a few suggestions I've heard about or thought about recently; can people suggest more and better ideas?
- Use proper control groups. 30% of judges thinking that a program is human is meaningless unless the judges also compare with actual humans. Pair up a human subject with a program, and the role of the judge is to establish which of the two subjects is the human and which is not.
- Toss out the persona tricks - no 13 year-olds, nobody with poor English skills. It was informative about human psychology that these tricks work, but we shouldn't allow them in future. All human subjects will have adequate English and typing skills.
- On that subject, make sure the judges and subjects are properly motivated (financial rewards, prizes, prestige...) to detect or appear human. We should also brief them that our usual conversational approach to establish which kind of human they are dealing with, is not useful for determining whether they are dealing with a human at all.
- Use only elite judges. For instance, if Scott Aaronson can't figure it out, the program must have some competence.
- Make a collection of generally applicable approaches (such as the Winograd Schemas) available to the judges, while emphasising they will have to come up with their own exact sentences, since anything online could have been used to optimise the program already.
- My favourite approach is to test the program on a task they were not optimised for. A cheap and easy way of doing that would be to test them on novel ASCII art.
My current method would be the lazy one of simply typing this, then waiting, arms folded:
"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".
44 comments
Comments sorted by top scores.
comment by VAuroch · 2014-06-10T19:12:03.407Z · LW(p) · GW(p)
There already is a better Turing test, which is the Turing test as originally described.
To run the test as originally described, you need an active control; a human conversing with the judges at the same time in the same manner, where their decision is "Which is the human?", not "Is this a human?" If the incompetent judges had been also talking simultaneously with a real 13-year-old from Ukraine, I have no doubt that Eugene Goostman would have bombed horribly.
Replies from: Punoxysm↑ comment by Punoxysm · 2014-06-13T17:22:14.335Z · LW(p) · GW(p)
This is not that much better. The article that The Most Human Human is based on talks about the difficulty of communication in a 5-minute window, and the lack of knowledge lay judges have about what AI involves. The author consistently got named a human by better-applying the tactics of the some of the most successful bots: controlling conversation flow and using humor.
It's an improvement, but a winner would still win by "gaming" judges' psychology.
Replies from: VAuroch↑ comment by VAuroch · 2014-06-13T19:30:27.460Z · LW(p) · GW(p)
Where's that article? On the surface of it, that doesn't seem like a problem, necessarily. And a good active control doesn't have to be untrained; they could suggest questions to ask the computer, etc.
"Here's something I'll bet you the AI can't do: Ask it to tell you a story about it's favorite elementary-school teacher"
or whatever.
comment by redlizard · 2014-06-10T16:52:29.597Z · LW(p) · GW(p)
As a point of interest, I want to note that behaving like an illiterate immature moron is a common tactic for (usually banned) video game automation bots when faced with a moderator who is onto you, for exactly the same reason used here -- if you act like someone who just can't communicate effectively, it's really hard for others to reliably distinguish between you and a genuine foreign 13-year-old who barely speaks English.
comment by Gavin · 2014-06-10T15:30:45.580Z · LW(p) · GW(p)
Is the Turing Test really all that useful or important? I can easily imagine an AI powerful beyond any human intelligence that would still completely fail a few minutes of conversation with an expert.
There is so much about the human experience which is very particular to humans. Is creating an AI with a deep understanding of what certain subjective feelings are like, or niceties of social interaction? Yes, an FAI eventually needs to have complete knowledge of those, but the intermediate steps may be quite alien and mechanical, even if intelligent.
Spending a lot of time trying to fool humans into thinking that a machine can empathize with them seems almost counterproductive. I'd rather the AIs honestly relate what they are experiencing, rather than try to pretend to be human.
Replies from: HungryHobo, Stuart_Armstrong↑ comment by HungryHobo · 2014-06-11T18:22:57.782Z · LW(p) · GW(p)
The test is a response to the Problem Of Other Minds.
Simply, no other test will be accepted by people that [insert something non human here] is genuinely intelligent.
The reasoning goes: strictly speaking the problem of other minds applies to other humans as well but we politely assume that the humans we're talking to are genuinely intelligent or at least conscious on little more than the basis that we're talking to them and they're talking back like conscious human beings.
the longer and more involved the test the harder it is to use tricks to fake genuine intelligence.
↑ comment by Stuart_Armstrong · 2014-06-10T15:35:43.890Z · LW(p) · GW(p)
Is the Turing Test really all that useful or important?
It did seem like a useful tool for measuring (some types of) intelligence. Since it doesn't work, it would be useful to have a substitute...
Replies from: RobinZcomment by wobster109 · 2014-06-10T17:36:16.144Z · LW(p) · GW(p)
It's a little funny that in our quest for a believably human conversation bot, we've ended up with conversations that are very much unhuman.
In no conversation would I meet someone and say, "oh hey, how many legs on a millipede?" They'd say to me "haha that's funny, so are you from around here?" and I'd reply with "how many legs on an ant in Chernobyl?" And if they said to me, "sit here with your arms folded for 4 minutes then repeat this sentence back to me," I wouldn't do it. I'd say "why?" and fail right there.
Replies from: RobinZ, Jiro, bbleeker↑ comment by RobinZ · 2014-06-10T18:18:02.560Z · LW(p) · GW(p)
Hmm ... that and a la shminux's xkcd link gives me an idea for a test protocol: instead of having the judges interrogate subjects, the judges give each pair of subjects a discussion topic a la Omegle's "spy" mode:
Spy mode gives you and a stranger a random question to discuss. The question is submitted by a third stranger who can watch the conversation, but can't join in.
...and the subjects have a set period of time they are permitted to talk about it. At the end of that time, the judge rates the interesting-ness of each subject's contribution, and each subject rates their partner. The ratings of confirmed-human subjects would be a basis for evaluating the judges, I presume (although you would probably want a trusted panel of experts to confirm this by inspection of live results), and any subjects who get high ratings out of the unconfirmed pool would be selected for further consideration.
↑ comment by Jiro · 2014-06-16T22:19:23.981Z · LW(p) · GW(p)
For the same reason that the test shouldn't try to simulate a human with poor English skills, it also shouldn't try to simulate a human who isn't willing to cooperate with a questioner. A random human off the street wouldn't answer the millipede question, but a random human recruited to take part in an experiment and told to answer reasonable questions probably would.
↑ comment by Sabiola (bbleeker) · 2014-06-11T09:44:39.814Z · LW(p) · GW(p)
comment by RobinZ · 2014-06-10T14:29:17.342Z · LW(p) · GW(p)
Similar to your lazy suggestion, challenging the subject to a novel (probably abstract-strategy) game seems like a possibly-fruitful approach.
On a similar note: Zendo-variations. I played a bit on a webcomic forum using natural numbers as koans, for example; this would be easy to execute over a chat interface, and a good test of both recall and problem-solving.
Replies from: cousin_it, Punoxysm↑ comment by Punoxysm · 2014-06-13T17:18:41.909Z · LW(p) · GW(p)
Nope; general game-playing is a well-studied area of AI; the AI's aren't great at it, but if you aren't playing them for a long time they can certainly pass as a bad human. Zendo-like "analogy-finding" has also been studied.
By only demanding very structured action types, instead of a more free-flowing, natural-language based interaction, you are handicapping yourself as a judge immensely.
comment by Jan_Rzymkowski · 2014-06-10T13:27:22.078Z · LW(p) · GW(p)
Ad 4. Elite judges is quite arbitrary. I'd rather iterate the test, each time choosing only those judges, who recognized program correctly or some variant of that (e.g. top 50% with most correct guesses). This way we select those, who go beyond simply conforming to a conversation and actually look for differences between program and human. (And as seen from transcripts, most people just try to have a conversation, rather than looking for flaws) Drawback is that, if program has set personality, judges could just stick to identifing that personality rather than human characteristics.
Another approach might be that, the same pair program-human is examined by 10 judges consecutively, each spending 5 minutes with both. The twist is that judges can leave instructions for next judges. So if program fails to perform "If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2", than every judge after the one, who found that flaw, can use that and make right guess.
My favourite method would be to give bot a simple physics textbook and then ask him to solve few physics test problems. Even if it wouldn't be actual AI, it would still prove helluva powerful. Just toss it summarized knowledge on quantum physics and ask to solve for GUT. Sadly, most humans wouldn't pass such high-school physics test.
- is actually original Turing Test.
EDIT:
- is bad. It would exclude equally many actual AI and blind people as well. It is actually more general problem with Turing Test. It helps testing programs that mimic humans, but not AI in general. For text based AI, senses are alien. You could develop real intelligence, which would fail, when asked "How you like the smell of glass?". Sure it can be taught that glass don't smell, but it actually needs superhuman abilities. So while superintelligence can perfectly mimic human, human-level AI wouldn't pass Turing Test, when asked about sensual stuff, just as humans would fail, when asked about nuances of geometry in four dimensions.
↑ comment by Stuart_Armstrong · 2014-06-10T15:27:19.159Z · LW(p) · GW(p)
EDIT: 6. is bad. It would exclude equally many actual AI and blind people as well.
So we wouldn't use blind people in the human control group. We'd want to get rid of any disabilities that the AI could use as an excuse (like the whole 13 year-old foreign boy).
As for excluding AIs... the Turing test was conceived as a sufficient, but not necessary, measure of intelligence. If AI passes, then intelligent, not the converse (which is much harder).
↑ comment by NancyLebovitz · 2014-06-10T17:39:38.433Z · LW(p) · GW(p)
Physics problems are an interesting test-- you could check for typical human mistakes.
You could throw in an unsolvable problem and see whether you get plausibly human reactions.
Replies from: Jan_Rzymkowski↑ comment by Jan_Rzymkowski · 2014-06-10T19:53:25.515Z · LW(p) · GW(p)
Stuart, it's not about control groups, but that such test actually would test negatively for blind, who are intelligent. Blind AI would also test negatively, so how is that useful?
Actually physics test is not about getting closer to humans, but about creating something useful. If we can teach program to do physics, we can teach it to do other stuff. And we're getting somewhere mid narrow and real AI.
↑ comment by RobinZ · 2014-06-10T13:41:34.719Z · LW(p) · GW(p)
Speaking of original Turing Test, the Wikipedia page has an interesting discussion of the tests proposed in Turing's original paper. One of the possible reads of that paper includes another possible variation on the test: play Turing's male-female imitation game, but with the female player replaced by a computer. (If this were the proposed test, I believe many human players would want a bit of advance notice to research makeup techniques, of course.) (Also, I'd want to have 'all' four conditions represented: male & female human players, male human & computer, computer & female human, and computer & computer.)
comment by DanArmak · 2014-06-10T19:19:38.198Z · LW(p) · GW(p)
But anyway, the main goal now, as suggested by Toby Ord and others, is to design a better Turing test, something that can give AI designers something to aim at, and that would be a meaningful test of abilities
We want a test to tell us when an AI is intelligent and powerful. But we'll know a powerful AI even without a test, because we'll see it using its power and achieving things that really matter, and not just things that matter when it's an AI that does them.
I fear a new test would be a red herring. The Turing test has inspired people to come up with narrow AIs that are useless for anything other than passing the test. A new test might repeat the same story. Or it might turn out to be too hard and only be achieved long after many other AI capabilities that would greatly change the world.
Either one would be a poor target for AI designers to aim at. It would be better for them to aim at real world problems for the AI to solve.
Replies from: Nornagest↑ comment by Nornagest · 2014-06-10T20:22:15.812Z · LW(p) · GW(p)
We want a test to tell us when an AI is intelligent and powerful. But we'll know a powerful AI even without a test, because we'll see it using its power and achieving things that really matter [...] narrow AIs [can be made that are] useless for anything other than passing the test. A new test might repeat the same story. Or it might turn out to be too hard and only be achieved long after many other AI capabilities that would greatly change the world.
I think we'll see (arguably have already seen) AI changing the world before we see a general AI passing the Turing test. But I don't think that makes the Turing test useless, or a red herring.
Narrow AI is plenty powerful. It drives cars, flies military drones, runs short-term trading systems, and plays chess, and does (or will shortly do) them all better than the best humans in their domains. Right now that hasn't dramatically changed the world, but I don't think it's too much of a stretch to imagine a world that has been transformed by narrow AI applications.
But there are still things the Turing test or a successor would be useful for. For one thing, as AI techniques advance, I expect the line between narrow and general AI to blur. I can't rule out purpose-built AGI before this becomes significant, but if that doesn't make the problem completely irrelevant, then the Turing test serves as a pretty good marker of generalizability: if your trading system (that scrapes Reuters for headlines and does some sophisticated NLP and concept-mapping stuff that you're pretty proud of) starts asking you hilariously bizarre questions about business ethics, you're probably well on your way to dealing with something that can no longer be described as narrow AI. If it starts asking you good questions about business ethics... well, you're probably very lucky.
Less significantly from an AGI perspective, but still interestingly, there's a bunch of semi-narrow AI applications that focus tightly on interaction with humans. Siri, Google Now, and Cortana are probably the most salient examples right now, along with all those godawful customer-service phone systems; we could also imagine things like automated hotel concierges or caretakers for the elderly. The Turing test is an excellent benchmark for their performance; I no longer think we can take a pass as evidence of strong general intelligence, but humanlike responses are so useful in these roles that I still think it's a good thing to shoot for. A successor test in this role gives us a less gameable objective.
Replies from: DanArmak↑ comment by DanArmak · 2014-06-11T08:34:33.792Z · LW(p) · GW(p)
the Turing test serves as a pretty good marker of generalizability
That argues any sufficiently general system could pass the Turing test. But maybe it's really impossible to pass the test without investing a lot of 'narrow' resources in that specific goal. Even if an AGI could self-modify to pass for human, it would not bother unless that were an instrumental goal (i.e. to trick humans), at which point it's probably too late for you from a FAI viewpoint.
We should be able to recognize a powerful, smart, general intelligence without requiring that it be good at pretending to be a complete different kind of powerful, smart, general intelligence that has a lot of social quirks and cues.
The Turing test is an excellent benchmark for their performance; I no longer think we can take a pass as evidence of strong general intelligence, but humanlike responses are so useful in these roles that I still think it's a good thing to shoot for.
Again, I don't think the Turing test is necessary in this example. Siri can fulfill every objective of its designers without being able to trick humans who really want to know if it's an AI or not. A robotic hotel concierge wants to make guests comfortable and serve their needs; there is no reason that should involve tricking them.
comment by NancyLebovitz · 2014-06-10T17:43:46.260Z · LW(p) · GW(p)
Random thought: Could a computer program pass for human while commenting at slatestarcodex?
Replies from: Nornagest, shminux, Viliam_Bur↑ comment by Shmi (shminux) · 2014-06-10T18:36:52.019Z · LW(p) · GW(p)
Certainly a few commenters there easily pass for computers.
↑ comment by Viliam_Bur · 2014-06-10T18:35:53.415Z · LW(p) · GW(p)
Why not? (1) You are not required to respond to other people's comments. (2) People generally don't suspect you of being a chatbot on a blog, so they will not test you explicitly.
So the chatbot could be designed to play safe, and reply only in situations it believes it understands.
comment by tgb · 2014-06-10T14:38:56.566Z · LW(p) · GW(p)
So long as the bots are easy to distinguish from humans, it'll be easy for competitions to produce false positives: all it takes is for the judges to want to see the bot win, at least kind of. If you want a real challenge, you'd better reward the judges significantly for correctly distinguishing human from AI.
Replies from: Stuart_Armstrong↑ comment by Stuart_Armstrong · 2014-06-10T14:50:26.257Z · LW(p) · GW(p)
Point 3, "properly motivated".
Replies from: tgbcomment by Luke_A_Somers · 2014-06-10T14:12:25.984Z · LW(p) · GW(p)
"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".
If they screw it up somehow, they're human?
ETA: yes, not any old failure will do.
Replies from: Stuart_Armstrong, RobinZ↑ comment by Stuart_Armstrong · 2014-06-10T15:32:44.023Z · LW(p) · GW(p)
No. It's just that its something a chatterbot is spectacularly ill-equipped to respond to, unless they've been specifically programmed for these sort of things. It's a meta-instruction, using the properties of the test that are not derived from vocabulary parsing.
↑ comment by RobinZ · 2014-06-10T15:32:48.939Z · LW(p) · GW(p)
The manner in which they fail or succeed is relevant. When I ran Stuart_Armstrong's sentence on this Web version of ELIZA, for example, it failed by immediately replying:
Perhaps you would like to be human, simply do nothing for 4 minutes, then re-type this sentence you've just written here, skipping one word out of 2?
That said, I agree that passing the test is not much of a feat.
comment by Shmi (shminux) · 2014-06-10T15:29:46.648Z · LW(p) · GW(p)
Here is a relevant xkcd classic.
comment by [deleted] · 2014-06-11T04:36:16.979Z · LW(p) · GW(p)
You could simply ask it Implement a plan to maximize the number of paperclips produced.
If the answer involves consuming all resources in the Universe, then we can assume it is an AI
If the answer is reasonable and balanced, then it is either a person or a Friendly AI, in which case it doesn't matter
Replies from: Stuart_Armstrong↑ comment by Stuart_Armstrong · 2014-06-11T12:15:58.387Z · LW(p) · GW(p)
Though we may have to consider the hideous but unlikely possibility... that it might... lie.
comment by RobinZ · 2014-06-10T13:29:12.760Z · LW(p) · GW(p)
[EDIT: Jan_Rzymkowski's complaint about 6 applies to a great extent to this as well - this approach tests aspects of intelligence which are human-specific more than not, and that's not really a desirable trait.]
Suggestion: ask questions which are easy to execute for persons with evolved physical-world intuitions, but hard[er] to calculate otherwise. For example:
Suppose I have a yardstick which was blank on one side and marked in inches on the other. First, I take an unopened 12-oz beverage can and lay it lengthwise on one end of the yardstick so that half the height of the can is touching the yardstick and half is not, and duct-tape it to the yardstick in that position. Second, I take one-liter plastic water bottle, filled with water, and duct-tape it to the other end in a similar sort of position. If I lay a deck of playing cards in the middle of the open floor and place the yardstick so that the 18-inch mark is centered on top of the deck of cards, when I let go, what will happen?
(By the way, as a human being, I'm pretty sure that I would react to your lazy test with eloquent, discursive indignation while you sat back and watched. The fun of the game from the possibly-a-computer side of the table is watching the approaches people take to test your capabilities.)
Replies from: army1987, Stuart_Armstrong↑ comment by A1987dM (army1987) · 2014-06-10T16:11:36.861Z · LW(p) · GW(p)
Suggestion: ask questions which are easy to execute for persons with evolved physical-world intuitions, but hard[er] to calculate otherwise. For example:
Suppose I have a yardstick which was blank on one side and marked in inches on the other. First, I take an unopened 12-oz beverage can and lay it lengthwise on one end of the yardstick so that half the height of the can is touching the yardstick and half is not, and duct-tape it to the yardstick in that position. Second, I take one-liter plastic water bottle, filled with water, and duct-tape it to the other end in a similar sort of position. If I lay a deck of playing cards in the middle of the open floor and place the yardstick so that the 18-inch mark is centered on top of the deck of cards, when I let go, what will happen?
Familiarity with imperial units is hardly something I would call an evolved physical-world intuition...
Replies from: RobinZ↑ comment by RobinZ · 2014-06-10T17:57:04.324Z · LW(p) · GW(p)
Were I using that test case, I would be prepared with statements like "A fluid ounce is just under 30 cubic centimeters" and "A yardstick is three feet long, and each foot is twelve inches" if necessary. Likewise "A liter is slightly more than one quarter of a gallon".
But Stuart_Armstrong was right - it's much too complicated an example.
↑ comment by Stuart_Armstrong · 2014-06-10T15:30:59.045Z · LW(p) · GW(p)
Your test seems overly complicated; what about simple estimates? Like "how long would it take to fly from Paris, France, to Paris, USA" or similar? Add in some Fermi estimates, get them to show your work, etc...
By the way, as a human being, I'm pretty sure that I would react to your lazy test with eloquent, discursive indignation while you sat back and watched
If the human subject is properly motivated to want to appear human, they'd relax and follow the instructions. Indignation is another arena in which non-comprehending programs can hide their lack of comprehension.
Replies from: army1987, RobinZ↑ comment by A1987dM (army1987) · 2014-06-10T16:13:50.264Z · LW(p) · GW(p)
"how long would it take to fly from Paris, France, to Paris, USA"
Ahem...
Replies from: army1987↑ comment by A1987dM (army1987) · 2014-06-11T09:23:08.509Z · LW(p) · GW(p)
This is weird. Yesterday it worked fine, today (in the same browser on the same computer) it says “Wolfram|Alpha doesn't understand your query; Showing instead result for query: long”
Replies from: Stuart_Armstrong↑ comment by Stuart_Armstrong · 2014-06-11T11:11:03.253Z · LW(p) · GW(p)
Still a useful reminder that we can't take things for granted when being a judge in such tests.
↑ comment by RobinZ · 2014-06-10T15:43:42.535Z · LW(p) · GW(p)
Your test seems overly complicated; what about simple estimates? Like "how long would it take to fly from Paris, France, to Paris, USA" or similar? Add in some Fermi estimates, get them to show your work, etc...
That is much better - I wasn't thinking very carefully when I invented my question.
If the human subject is properly motivated to want to appear human, they'd relax and follow the instructions. Indignation is another arena in which non-comprehending programs can hide their lack of comprehension.
I realize this, but as someone who wants to appear human, I want to make it as difficult as possible for any kind of computer algorithm to simulate my abilities. My mental model of sub-sapient artificial intelligence is such that I believe many such might pass your test, and therefore - were I motivated properly - I would want to make it abundantly clear that I had done more than correctly parse the instructions "[(do nothing) for (4 minutes)] then {re-type [(this sentence I've just written here,) skipping (one word out of 2.)]}" That is a task that is not qualitatively different from the parsing tasks handled by the best text adventure game engines - games which are very far from intelligent AI.
I wouldn't merely sputter noisily at your failure to provide responses to my posts, I'd demonstrate language comprehension, context awareness, knowledge of natural-language processing, and argumentative skills that are not tested by your wait-four-minutes proposal, both because I believe that you will get better results if you bear these factors in mind and because - in light of the fact that I will get better results if you bear them in mind - I want you to correctly identify me as a human subject.