3 Levels of Rationality Verification

eliezer_yudkowsky

3 Levels of Rationality Verification

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T17:19:14.736Z · LW · GW · Legacy · 246 comments

246 comments

I strongly suspect that there is a possible art of rationality (attaining the map that reflects the territory, choosing so as to direct reality into regions high in your preference ordering) which goes beyond the skills that are standard, and beyond what any single practitioner singly knows. I have a sense that more is possible.

The degree to which a group of people can do anything useful about this, will depend overwhelmingly on what methods we can devise to verify our many amazing good ideas.

I suggest stratifying verification methods into 3 levels of usefulness:

Reputational
Experimental
Organizational

If your martial arts master occasionally fights realistic duels (ideally, real duels) against the masters of other schools, and wins or at least doesn't lose too often, then you know that the master's reputation is grounded in reality; you know that your master is not a complete poseur. The same would go if your school regularly competed against other schools. You'd be keepin' it real.

Some martial arts fail to compete realistically enough, and their students go down in seconds against real streetfighters. Other martial arts schools fail to compete at all—except based on charisma and good stories—and their masters decide they have chi powers. In this latter class we can also place the splintered schools of psychoanalysis.

So even just the basic step of trying to ground reputations in some realistic trial other than charisma and good stories, has tremendous positive effects on a whole field of endeavor.

But that doesn't yet get you a science. A science requires that you be able to test 100 applications of method A against 100 applications of method B and run statistics on the results. Experiments have to be replicable and replicated. This requires standard measurements that can be run on students who've been taught using randomly-assigned alternative methods, not just realistic duels fought between masters using all of their accumulated techniques and strength.

The field of happiness studies was created, more or less, by realizing that asking people "On a scale of 1 to 10, how good do you feel right now?" was a measure that statistically validated well against other ideas for measuring happiness. And this, despite all skepticism, looks like it's actually a pretty useful measure of some things, if you ask 100 people and average the results.

But suppose you wanted to put happier people in positions of power—pay happy people to train other people to be happier, or employ the happiest at a hedge fund? Then you're going to need some test that's harder to game than just asking someone "How happy are you?"

This question of verification methods good enough to build organizations, is a huge problem at all levels of modern human society. If you're going to use the SAT to control admissions to elite colleges, then can the SAT be defeated by studying just for the SAT in a way that ends up not correlating to other scholastic potential? If you give colleges the power to grant degrees, then do they have an incentive not to fail people? (I consider it drop-dead obvious that the task of verifying acquired skills and hence the power to grant degrees should be separated from the institutions that do the teaching, but let's not go into that.) If a hedge fund posts 20% returns, are they really that much better than the indices, or are they selling puts that will blow up in a down market?

If you have a verification method that can be gamed, the whole field adapts to game it, and loses its purpose. Colleges turn into tests of whether you can endure the classes. High schools do nothing but teach to statewide tests. Hedge funds sell puts to boost their returns.

On the other hand—we still manage to teach engineers, even though our organizational verification methods aren't perfect. So what perfect or imperfect methods could you use for verifying rationality skills, that would be at least a little resistant to gaming?

(Added: Measurements with high noise can still be used experimentally, if you randomly assign enough subjects to have an expectation of washing out the variance. But for the organizational purpose of verifying particular individuals, you need low-noise measurements.)

So I now put to you the question—how do you verify rationality skills? At any of the three levels? Brainstorm, I beg you; even a difficult and expensive measurement can become a gold standard to verify other metrics. Feel free to email me at sentience@pobox.com to suggest any measurements that are better off not being publicly known (though this is of course a major disadvantage of that method). Stupid ideas can suggest good ideas, so if you can't come up with a good idea, come up with a stupid one.

Reputational, experimental, organizational:

Something the masters and schools can do to keep it real (realistically real);
Something you can do to measure each of a hundred students;
Something you could use as a test even if people have an incentive to game it.

Finding good solutions at each level determines what a whole field of study can be useful for—how much it can hope to accomplish. This is one of the Big Important Foundational Questions, so—

Think!

(PS: And ponder on your own before you look at the other comments; we need breadth of coverage here.)

246 comments

Comments sorted by top scores.

comment by swestrup · 2009-03-15T20:15:03.391Z · LW(p) · GW(p)

Well, you asked for DUMB ideas, so here's mine. It has the advantage that I'm sure no one else will suggest it. This is based on an accidental discovery (so far as I know, unpublished) that one can compare two arbitrary documents for similarity (even if they are in different word-processor formats) by running them both through a recognizer built out of a random state machine and comparing bit masks of all the states traversed. The more common they are, the more states will be traversed in both.

So, lets assume we have a panel of highly rational individuals which are our control group. We generate a random multiple-choice questionnaire consisting of nonsensical questions and answers. Things like:

1) How Green is the Smell of Bacon?

a) 7.5

b) Neon

c) Introspection

d) Larger

You then do a correlation over how your panel of experts chose their answers and see if there is a common pattern. You then score students who take the test based on how similar to the common pattern they are.

Assuming this idea works at all, the advantage of this is that it would be extremely difficult to game. The disadvantage would be that it would penalize those who are significantly more rational than the 'norm'. It would probably also require the panel to be similar to each other in cognition. There is also the general problem of not knowing if you're really testing for what you think you're testing.

Frankly, I don't know if I'd be more happy if this was tested and shown to be workable, or if it turned out to be a really stupid idea.

Replies from: Eliezer_Yudkowsky, MichaelVassar, thomblake, Cameron_Taylor

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T20:18:26.171Z · LW(p) · GW(p)

NOT CRAZY ENOUGH! We need EVEN STUPIDER ideas!

(Voted up for being the best try so far, though.)

↑ comment by MichaelVassar · 2009-03-16T05:25:31.386Z · LW(p) · GW(p)

I think that this resembles the MMPI methodology. http://en.wikipedia.org/wiki/Minnesota_Multiphasic_Personality_Inventory

Replies from: Cameron_Taylor, Cameron_Taylor

↑ comment by Cameron_Taylor · 2009-03-18T04:54:45.386Z · LW(p) · GW(p)

What is the MMPI supposed to test?

↑ comment by Cameron_Taylor · 2009-03-18T04:16:49.976Z · LW(p) · GW(p)

There are similarities.

What I observed when doing an MMPI was that it seemed altogether gameable. I believe I have more than enough knowledge about psychology, including the type of metrics that MMPI uses, to more or less choose whatever result I desired.

↑ comment by thomblake · 2009-03-16T20:05:19.423Z · LW(p) · GW(p)

I've actually proposed something like this to test for personality type. The main reason it never got implemented is there isn't really a good, workable theory of persistent personality.

↑ comment by Cameron_Taylor · 2009-03-18T04:14:14.011Z · LW(p) · GW(p)

That scares me!

It sounds altogether too much like the famous beauty pagent, with a bit of "guess the teachers answer" and radomly generated poetry thrown in for good measure.

Frankly, I don't know if I'd be more happy if this was tested and shown to be workable, or if it turned out to be a really stupid idea.

I know I'd be far happier if it was shown to be a really stupid idea. I have a hunch, however, that a correlation of the kind you hypothesis would exist. The part that scares me is that there could well be more than one style of thinking of equal merit, with one being far more common than the other. Naturally the suspicion that I'd end up in the minority and downgraded for it is troublesome. There is more than enough of that sort of bias in schools already!

Upvoted for being the right kind of idea and incidently my answer to the example question is a) 7.5. The other three make absolutely no sense while I acknowledge that there is a possibility (though it is improbable) that the way the brain functions could make a quantisation of said greeness at least have some meaning.

Replies from: swestrup

↑ comment by swestrup · 2009-03-21T16:41:23.515Z · LW(p) · GW(p)

When I look at my question there, the only answer that seems appropriate is 'Introspection' as that's at least a step towards an answer.

comment by talisman · 2009-03-15T22:33:37.089Z · LW(p) · GW(p)

Occasionally, well-respected community members could say things that are intentionally false, but persuasive and subtle, a la http://www.overcomingbias.com/2008/02/my-favorite-lia.html.

You get points for catching these mistakes. Perhaps you submit your busts privately to some arbiter so others have the same challenge.

Later, the error is revealed and discussed.

This would also have the benefit of causing everyone to read the most-respected members' writings ultra-critically, rather than sitting back and being spoon-fed.

One key thing this idea has is short term feedback. Frequent, rapid feedback is essential for getting good at this kind of thing. (IMO that's why economics is still so useless relative to the other sciences: the experiments take fifty years to run.)

Replies from: Jiro, MBlume

↑ comment by Jiro · 2014-04-08T05:02:20.378Z · LW(p) · GW(p)

This doesn't work, because people here say controversial things. By definition, controversial means that many people think they are wrong, but they do not think they are wrong themselves. Anyone who finds a mistake might have found one of the intentional mistakes, or might happen to disagree on a controversial issue and believes the community member made a mistake where the community member thinks otherwise.

Unless you think that community members are perfectly correct 100% of the time on controversial issues or at least always recognize their own mistakes when pointed out to them (and no human being is like that), the idea will become unworkable. Everyone will have to think "is this an intentional misake, or is an unintentional mistake that the community member won't recognize as such, earning me demerits for pointing it out?"

Replies from: None

↑ comment by [deleted] · 2016-05-27T00:41:33.589Z · LW(p) · GW(p)

There are objective ways of finding out some classes of mistakes. Fallacies are well-defined and most of them can be easily diagnosed. I often do this at Facebook to blow off steam.

Even better: the website can accomodate for this. It's as easy as adding a "report logical fallacy" button next to each comment. Moderators can award points to all who noticed the correct fallacy. A leaderboard can be put up. It can be made a sport.

Another benefit is that those who make mistakes receive detailed feedback.

Edit: I'd like to learn why this was downvoted. How might I be wrong?

Replies from: DPiepgrass

↑ comment by DPiepgrass · 2020-06-26T18:49:04.412Z · LW(p) · GW(p)

Nothing makes me want to upvote someone like a downvote-without-comment on a post that seems vaguely reasonable.

↑ comment by MBlume · 2009-03-15T22:46:11.708Z · LW(p) · GW(p)

I can see the need for anonymity to avoid spoilers, but I think doing the thing publicly has benefits too -- that way there's the risk on the other side of having publicly denounced the Great Teacher when he was speaking truthfully.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T23:19:30.078Z · LW(p) · GW(p)

You could have private points subtracted off and that gives you the same incentive not to make uncertain accusations. Attach confidence levels and take Bayes-score.

Replies from: JGWeissman

↑ comment by JGWeissman · 2009-04-01T05:21:06.980Z · LW(p) · GW(p)

With the Bayes-score being always negative, I don't see what incentive one would have to submit a mistake report. I think it would be better to test for better than, for example, 90% confidence, by awarding 1 point for a correct report and deducting 9 points for an incorrect report. This achieves the goal of detecting ability to detect bad arguments. Measuring calibration would have to be a seperate test.

Replies from: jyasskin

↑ comment by jyasskin · 2010-12-29T17:41:46.966Z · LW(p) · GW(p)

Treat not submitting a mistake report as the "I have no idea" claim: that you've assigned a probability of "mistakes/total emails" to this particular email being a mistake.

comment by CarlShulman · 2009-03-15T18:13:44.983Z · LW(p) · GW(p)

For 'hot' political and religious biases, create materials in which apparent advocates of different ideologies or parties are arguing for some particular empirical prediction, e.g. about the relationship between different tax rate changes and economic growth, with some predictions being right and some wrong. The subject then needs to make his or her own prediction about some easily-verifiable but obscure empirical fact related to the argument, e.g. whether a graph of GDP and tax rates matches Norway or Iceland.

Scoring would reflect the degree to which the ideological affiliation in the prompt biased the results. If it was being gamed you might need to add in scoring for accuracy. Challenges would be producing a large enough inventory of test items, keeping them secret, and the need to tailor tests to locally popular ideologies or ideologies of interest.

More surveys that study the relationship between knowledge about verifiable facts and values. What sorts of information do those with different values tend to have, and what are the values of those whose knowledge covers the pet facts of all camps? There is a fair amount of this literature in political science aimed at the electorate and its political knowledge, but it would be good to extend it to other topics, e.g. scientific ones.

Announced probability distributions (not just predictions, so as to enable better scoring) for the results of upcoming experiments. For instance, we know that in the next 2-3 years we are going to get a huge amount of genomic data that will answer a lot of questions about the genetic architecture of human diseases. Making public quantitative predictions about things like that could be quite informative.

Replies from: Roko, Cameron_Taylor

↑ comment by Roko · 2009-03-15T22:08:16.799Z · LW(p) · GW(p)

Hot political/religious issues seem like a great way to tempt people into saying/believing irrational things. This is a good idea.

↑ comment by Cameron_Taylor · 2009-03-18T05:01:42.786Z · LW(p) · GW(p)

Very solid example of how to test for that bias.

comment by MichaelHoward · 2009-03-15T18:33:37.195Z · LW(p) · GW(p)

People tend to compartmentalize. We need to bear in mind that anything we come up with that involves testing someone when they know they're being tested can only check how rational they can be if they put their mind to it, not how rational they are when they're not being tested.

Replies from: Roko, swestrup, Cameron_Taylor

↑ comment by Roko · 2009-03-15T22:09:45.852Z · LW(p) · GW(p)

It is possible to test people for one thing, and claim that you are testing them for another thing. E.g. Asch's experiments wouldn't have worked if he had told people the truth about what he was testing for. As long as the person doesn't know they're being tested for rationality, it should be OK. You could test people for ability to make money, ability to get some task done, etc.

http://www.overcomingbias.com/2007/12/aschs-conformit.html

↑ comment by swestrup · 2009-03-15T21:14:37.755Z · LW(p) · GW(p)

I agree. The only solutions to this that I can see is to either not let students know when they are being tested, or to have a system of continual testing.

Replies from: Matt_Simpson

↑ comment by Matt_Simpson · 2009-03-15T22:59:28.814Z · LW(p) · GW(p)

They key is probably to test someone without letting them know you are testing them. If I ran a martial arts dojo and wanted to make sure my students were really super badass ninjas, I would give them a convincing looking "test" that included things you would expect to see: strength, speed, form, technique, success in actual matches, etc.

This would have very little weighting in the actual grade, however. The real test would be some sort of surprise fight or fights where the student has no idea that the fight is actually one of the tests. Perhaps he (or she) is followed by the assailant until an opportunity to pick a fight arises.

The main advantage of the surprise test is that it is much hard to game. Imperfect metrics are much more likely to say something meaningful about the student in this surprise situation than if the student knows the test is coming.

When it comes to the rationality dojo, there are numerous normally easy-to-game heuristics that could be used, for example:

how susceptible the student is to group-think
what they do in some sort of strenuous situation (e.g., do they blow up the Huygens?) The situation must seem real to them.
are they willing to bet their beliefs even when no one important will notice?
What others can you guys think of?

edit: notice that lists are not working. edit 2: never mind, editing seemed to fix them.

I doubt that it would be practical to analyze all of the information and get a single number as a measure of the student's rationality. At the top of all of these tests would have to be someone whose judgment on matters of rationality can be trusted. This may be the most difficult part

Also note that this form of testing would probably be expensive.

Replies from: Cameron_Taylor

↑ comment by Cameron_Taylor · 2009-03-18T04:59:23.717Z · LW(p) · GW(p)

See artemis fowl and the butler training.

↑ comment by Cameron_Taylor · 2009-03-18T04:58:45.271Z · LW(p) · GW(p)

An insurmountable problem?

comment by Perplexed · 2011-02-21T17:43:16.600Z · LW(p) · GW(p)

I think that the most important skill a rationalist can have is the ability to assess the quality of other rationalists, and to participate effectively in team projects. A measurement of individual rationality has to include how well a randomly selected team including that individual performs on team rationality tests.

So, I think that a rationalist 'decathlon' would consist of a variety of competitions between individuals and small teams including math/logic problems, general knowledge tests, cooperative and non-cooperative game theory games, prediction markets, and engineering challenges (egg drops, programming robots to compete in some arena, etc.)

But then there would be a second level, in which individuals and teams would compete in a prediction market in which they observe (by video recording) the deliberations of other teams on first-level problems and bet on their relative performance.

And even a third level, in which individuals observe the deliberations of second-level teams and bet on their performance in that second-level prediction market.

There are a variety of other things that might be interesting to measure - for example, what team sizes perform best, whether individual rationalism and team-participant rationalism are different skills, and whether team performance is best predicted by strongest member, average member, or weakest member.

Replies from: TheOtherDave

↑ comment by TheOtherDave · 2011-02-21T17:47:48.438Z · LW(p) · GW(p)

This is a brilliant idea.

comment by lessdazed · 2011-03-04T16:39:10.476Z · LW(p) · GW(p)

I'm not sure why "teaching to the test" is so disparaged for its effects on the learning process. Obviously that is a different use for tests than evaluation of ability, as is the main goal here.

Studying for the LSAT taught me to feel genuine physical unease when I read a bad argument, then be calm it by the next problem. It's very hard to turn that off when reading the newspaper.

The third stage of my growth as a rationalist was discovering this site. I no longer go through the day thinking of things I read and hear: "Wrong (fallacy), wrong (incorrect premise), wrong (fallacy), true (but irrelevant)." Now it's more like: "Wrong (fallacy), not even wrong (internally inconsistent), wrong (map/territory confusion), wrong (fallacy), not even wrong (argument from definition)."

I propose thinking of ways to hijack the human mental machinery as an alternative to overcoming it, akin to what evolution does.

comment by Emile · 2009-03-15T22:32:39.858Z · LW(p) · GW(p)

Organize large games/contests where a lot of candidates are locked up in an area, and have a finite time to reach a certain point / find a certain object.

The exact rules would be specially designed each time for that years challenge, by a group of rationalists and game designers. So the details would vary, but some common themes would be:

physical prowess does not come into play (beyond maybe moving around faster, not getting tired as easily etc.)
some people would be liars / saboteurs, and not real candidates

For example, the candidates are blindfolded and brought into a large underground circular room, whose only unlocked exits are twenty slides along on the edge (so, one-way exit only). The goal is to take the exit that's due north.

Or, the players are dropped in a maze, and each player is given twenty balls with his name written on them. In the maze are tall glass tubes in which the player can drop their balls. The players know that at the end of the games everyone gets points for the balls with his name that are in "good" tubes (from 10 to 1 points, depending on whether his ball is at the bottom or top - only ten balls fit in a tube), and loses points for balls in "bad" tubes (whatever it's position). There are also neutral tubes. On the tubes are various signs and portents, and on the walls are statements about the the meanings of the signs ("about 10% of good tubes have red triangles", "two squares of the same color cancel out", "a blue triangle means that there's a bad tube close to this one"). The players have 30 minutes to place their balls.

Additional twists:

there are in fact several simultaneous games taking place, in the same place, but the rules are such that it's very difficult to tell who's part of which game (for example, if some players' goal is to unmask/identify other players)
the goal may not be reachable at all (no candidates accepted this year). The "global" rules of the contest might include that there must be a certain probability each year (10% ?) that the contest is impossible.
candidates are not alone but in teams

... well, there is plenty of inspiration to take from board games and TV shows. And many factors of those can be controlled by careful design (importance of luck or of trivia knowledge, how much "herd behaviour" can come into play, etc.). The games should be more complicated than what's said above, and contain many red herrings. The designers should try to introduce as much sources of bias and irrationality as possible.

Replies from: Nebu, charles-paul, Cameron_Taylor

↑ comment by Nebu · 2009-03-16T17:31:37.147Z · LW(p) · GW(p)

Voted up if only because this reads like a description for the first reality TV show I would actually want to watch.

Replies from: MichaelHoward

↑ comment by MichaelHoward · 2009-03-16T22:43:52.852Z · LW(p) · GW(p)

Here you go :) (and here's the kids' version)

↑ comment by Charles Paul (charles-paul) · 2021-07-01T17:29:37.166Z · LW(p) · GW(p)

Love this idea, here is another game:

two teams, red and blue team. Blue team plays as computer scientists who are trying to build an AI to help them do something about an asteroid heading towards earth, (or some other extential threat that would justify building an AGI without knowing if its friendly) but they build it so fast they have no idea if its friendly. They win if they save humanity.

the read team plays as the AI, and gets a point for each paperclip in its future light cone.

you would have to have rules like: the AI is contained in a box, the AI must execute all orders given to it by the blue team, etc.

↑ comment by Cameron_Taylor · 2009-03-18T04:57:42.560Z · LW(p) · GW(p)

Fascinating concept.

comment by Psy-Kosh · 2009-03-15T19:31:21.465Z · LW(p) · GW(p)

Hrm... Well, one initial notion I have is along the lines of this: Rationality training should improve how good one can become at other stuff, or at least improve ability to gain skills/etc in other fields.

So, maybe tests could be something along the lines of find various subjects/fields a student is unfamiliar with and basically assign them to "get some knowledge and skill in this field."

How efficiently students can basically bootstrap up into something they're unfamiliar with should vary with their rationality, right? So something like this may be a starting point.

(Yes, I can see a bunch of details that would need to be worked out, but seems to be that this notion may at least be somewhere to start for developing rationality tests.)

Replies from: MichaelVassar, Cameron_Taylor

↑ comment by MichaelVassar · 2009-03-16T05:31:22.402Z · LW(p) · GW(p)

I think Tim Ferris was going to display this ability as the theme of a TV show.

↑ comment by Cameron_Taylor · 2009-03-18T04:59:48.059Z · LW(p) · GW(p)

This biasses towards fast learners. A different problem.

comment by Johnicholas · 2009-03-15T23:26:47.479Z · LW(p) · GW(p)

Frank Mager, in various books, including "Preparing Instructional Objectives", suggests working backward from evidence that would make you conclude that someone is, e.g. a Bayesian Master Rationalist, to the tests (and instructional objectives) for a course of instruction intended to turn someone into a Bayesian Master Rationalist (or whatever you want to turn them into).

Replies from: pjeby, Eliezer_Yudkowsky

↑ comment by pjeby · 2009-03-20T18:11:35.852Z · LW(p) · GW(p)

After skimming some of his stuff on Amazon, I bought the whole "Mager Six-Pack" and am eagerly devouring it. I can already tell it''s going to make a huge difference in the way I teach mind-hacking.

One of the first ones I read, Goal Analysis, is particularly relevant to LW discussions: how to turn "fuzzies" (abstract qualities, adjectives, and adverbs) into concrete, measurable specifications of behavior. One minor catch: goal analysis can't make people magically agree on the True Meaning of a term, it can only expose the things they do or don't agree on...

...which probably makes it an incredibly valuable Rationality Tool in its own right.

Anyway, thanks for mentioning Mager's books -- I'd never heard of them before your comment.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T23:41:19.137Z · LW(p) · GW(p)

Example?

Replies from: Johnicholas

↑ comment by Johnicholas · 2009-03-16T11:09:40.994Z · LW(p) · GW(p)

Telephone operators were supposed to have good "tone of service". So then the education people asked "What does good tone of service mean? What evidence would help you conclude whether an operator has good tone of service?"

And drilling down, they found that there was an entire list of behaviors implicit in the phrase "tone of service", like inflection as the operator reads the standardized phrases, such as "I'm sorry". One of the behaviors amused me - no banging - that is, hitting the telephone handset against something, presumably in anger at a frustrating customer.

So you can test for "good tone of service" by testing the observable behaviors.

If your concept of a Master Rationalist includes an "aura of competence", then probably we can break that down into concrete evidence that would cause you to conclude that someone has an "aura of competence". The concrete items become instructional objectives. If evidence that someone failed a bias or calibration test would cause you to conclude that they're NOT a Master Rationalist, then passing the bias or calibration test can be one of the instructional objectives.

Replies from: MichaelHoward, MichaelHoward

↑ comment by MichaelHoward · 2009-03-16T20:51:08.031Z · LW(p) · GW(p)

Bearing in mind the human tendency to favor authority over quality given a choice between the two, I think it's important when testing to distinguish between "aura of competence" and ability to achieve useful results, and after testing to connect the former to the latter.

Replies from: Johnicholas

↑ comment by Johnicholas · 2009-03-17T16:35:34.627Z · LW(p) · GW(p)

Right. EY has mentioned a couple of times that he expects graduates of the hypothetical Rationality Dojo to exude their abilities, like Taking a Level in Badass, or his hedge-fund elites.

I want to clarify that I do not agree with this notion, and I suspect that individuals who exude preternatural skills are primarily good at exuding, not at performing. The example was just an example.

↑ comment by MichaelHoward · 2009-03-16T20:47:31.191Z · LW(p) · GW(p)

comment by rwallace · 2009-03-15T19:50:27.058Z · LW(p) · GW(p)

Compile a large enough database of historical events that nobody could memorize more than a fraction of it. For the test, choose a few events at random, describe the initial conditions and ask the candidate to predict the outcomes.

Replies from: conjectures, Cameron_Taylor

↑ comment by pcm50 (conjectures) · 2019-03-14T09:17:05.292Z · LW(p) · GW(p)

This is a good idea.

Though I think that the condition that 'nobody could memorize more than a fraction of it' is actually quite hard to meet. E.g. legal training seems analogous, and lawyers seem to be able to remember a lot of examples.

If the corpus could be kept secret or ever changing that might help.

When I was thinking of something similar, I had a concern about the task length. E.g. will this result only in relatively short or simple tasks?

↑ comment by Cameron_Taylor · 2009-03-18T05:03:46.077Z · LW(p) · GW(p)

That would actually work.

comment by steven0461 · 2009-03-17T14:54:27.955Z · LW(p) · GW(p)

Carry around a notepad, form probabilistic opinions on lots of little questions that you can find out the answer to soon after, record all the probabilities assigned to correct answers, where applicable add tags like "politics", "project completion", "my social status", "trivia", put into a spreadsheet or something and see if you're miscalibrated globally and for different tags.

Replies from: Fhyve, Cameron_Taylor

↑ comment by Fhyve · 2010-12-18T08:41:56.187Z · LW(p) · GW(p)

This can get gamed pretty easily though, by selecting things that you have more previous knowledge of or know the actual probabilities of over things that you know are more likely to be wrong.... realization

Except that that could be exactly the point, the ability to identify what you know you are likely to assign accurate probabilities for and identifying when you aren't as likely. However, there still is the problem of just not reporting certain things to boost your scores. There could be something that takes into account or measures the ability to identify when you are likely to be wrong.

Replies from: ejstheman, datadataeverywhere

↑ comment by ejstheman · 2010-12-18T09:12:17.375Z · LW(p) · GW(p)

If you break the habit of claiming confidence you don't really have, to improve your score, then it seems the exercise has had the intended effect, no?

↑ comment by datadataeverywhere · 2010-12-18T09:35:12.874Z · LW(p) · GW(p)

Or: guess confidence intervals. 95% might not be as useful as 50%; test yourself not only on how often you are under or over, but make sure that 50% (or %5) of the time it is outside the range you guessed.

If you try to guess things that you're really sure about, this forces you to quantify how sure you are about that, and makes those guesses no more or less useful than those that you are much less sure about.

↑ comment by Cameron_Taylor · 2009-03-18T04:55:19.128Z · LW(p) · GW(p)

How do I tell?

comment by jimrandomh · 2009-03-16T04:29:04.513Z · LW(p) · GW(p)

There are two problems with measuring rationality, one of which is difficult but manageable, the other of which might be insurmountable. The first problem is that most conceivable tests of rationality require using information from other fields (such as finance, physics, or psychology), such that you can gain a considerable advantage on the test by studying things from that field which don't actually make you more rational. This can be solved with sufficient cleverness.

The second problem is that how rational someone is depends on how well they maintain it under stress. Pressure, fatigue, emotionally charged situations, alcohol, and/or deliberate manipulation, can make the best rationalists act completely insane. (About a year ago, I went on a reality television show, which was in a way like a series of rationality tests. I didn't do all that well, rationality-wise, but some people who should have known better did dramatically worse.)

Replies from: patrissimo, handoflixue

↑ comment by patrissimo · 2009-03-21T22:10:56.703Z · LW(p) · GW(p)

Yes, the maintaining under stress aspect is key. This is a large part of why poker is hard - it has many characteristics which maximize stress by triggering bad primal instincts.

↑ comment by handoflixue · 2011-07-15T19:44:18.132Z · LW(p) · GW(p)

About a year ago, I went on a reality television show

This suggests a very easy way of inducing conditions appropriate to a more thorough testing of rationality. Any student who insists on leaving (which I think you'd be ethically obliged to allow for) would receive a failing grade. See how well the rest manage to be rational despite the circumstances.

alcohol

This one is probably also eminently doable, especially in a casual setting. I'm sure enough people would object to "Binge drinking night" that you couldn't make it a course requirement in modern-day US, alas. (There's possibly also more ideal drugs than alcohol for these purposes - at a minimum, given individual reactions and tolerances vary, using a variety of pharmaceuticals would probably reduce noise some)

Replies from: beoShaffer

↑ comment by beoShaffer · 2011-07-15T21:37:45.107Z · LW(p) · GW(p)

I'm not sure how well this would carry over to mental stuff, but I know that some martial arts schools and many police and military organizations use physical exercise to create fatigue and/or adrenaline highs during training.

comment by Johnicholas · 2009-03-15T21:33:38.764Z · LW(p) · GW(p)

Here's a stupid idea: Evaluate people by auditing their domiciles. I've read (and from personal experience, I believe it) that you get really solid insight into someone's personal qualities by inspecting their home, as good as interviewing them and all of their friends and family. (I googled a bit, but I can't find the source.)

Anyway, it can probably be gamed.

Replies from: None, David_Gerard, Cameron_Taylor

↑ comment by [deleted] · 2009-03-15T22:30:09.829Z · LW(p) · GW(p)

deleted

↑ comment by David_Gerard · 2011-02-21T16:03:37.618Z · LW(p) · GW(p)

Heh. I have recently applied this to our house, which is remarkably better after just a few months, and visitors remark upon it. Doing so is the origin of this rant, which is made of hard-won anecdotal experience.

↑ comment by Cameron_Taylor · 2009-03-18T05:03:18.200Z · LW(p) · GW(p)

That's a test women do. I game it.

comment by MichaelHoward · 2009-03-15T21:06:07.458Z · LW(p) · GW(p)

Give the students sodium pentothal and ask if they're one of the top 50% of rationalists in their school. However many out of 200 say 'no', that's the school's percentage score. Schools scoring over 100% are thrown out for cheating.

Replies from: JGWeissman, Cameron_Taylor

↑ comment by JGWeissman · 2009-04-01T05:42:02.816Z · LW(p) · GW(p)

A school that reports to each student their class ranking easily games this test. The test could even favor schools that don't teach students enough to question an arbitrary class rank.

Also, this doesn't consider the possibility that students can be good rationalists, but don't interact with enough of the other students to make a good assessment of their relative strengths.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-04-01T06:02:32.874Z · LW(p) · GW(p)

Also, this doesn't consider the possibility that students can be good rationalists, but don't interact with enough of the other students to make a good assessment of their relative strengths.

Good rationalists, taken as a group, shouldn't be systematically optimistic.

Replies from: pjeby, JGWeissman

↑ comment by pjeby · 2009-04-01T14:54:58.237Z · LW(p) · GW(p)

Good rationalists, taken as a group, shouldn't be systematically optimistic.

They should be if they want to win in practice, as opposed to just getting theoretically-correct answers. See, e.g., the studies referenced in Seligman's "Learned Optimism", that show optimists consistently out-perform pessimists (i.e., realists) in a wide variety of fields and endeavors.

(Of course, Seligman's definition of optimism may be different from yours.)

↑ comment by JGWeissman · 2009-04-01T06:36:37.190Z · LW(p) · GW(p)

Perhaps we can still test for this systematic optimism, while filtering for the noise I objected to, by instead of asking a "yes" or "no" question, asking for the probability that the student is in the top 50%. Treat the sum of these probabilities as the count of "yes" answers in the original version. Then a rational student should be able to account for his ignorance of other students in his answer.

Replies from: jschulter

↑ comment by jschulter · 2011-01-21T06:12:01.562Z · LW(p) · GW(p)

This is even easier to game: assuming the school has any merit, any individual you ask should have good incentive to simply say "50%" guaranteeing a perfect score. The very first time you used the test it might be okay, but only if nobody knew that the school's reputation was at stake.

↑ comment by Cameron_Taylor · 2009-03-18T05:07:37.679Z · LW(p) · GW(p)

haha

comment by MBlume · 2009-03-15T19:49:13.178Z · LW(p) · GW(p)

Here's an immoral one: crack a rationalist

Most, if not all, human minds are vulnerable to hacking, eg by cults, religions, pseudoscience, etc. The minds of rationalists should be harder to hack than others.

Make a copy of a (would-be) rationalist, subject the copy to significant emotional stress, and then send missionaries his way.

The myths carried by the missionaries should be invented for the challenge so everyone can agree that they are false, but should, of course, be significantly more plausible than today's religions.

Replies from: MichaelHoward, JGWeissman, Cameron_Taylor, Roko

↑ comment by MichaelHoward · 2009-03-15T19:57:24.622Z · LW(p) · GW(p)

Make a copy of a (would-be) rationalist, subject the copy to significant emotional stress, and then send missionaries his way.

Moral qualms aside, we should probably have a back-up plan just in case we don't solve human uploading before we want to start testing.

↑ comment by JGWeissman · 2009-04-01T06:03:38.286Z · LW(p) · GW(p)

"crack a rationalist" made me think of the AI-Box Experiment ("http://yudkowsky.net/singularity/aibox") Maybe a rationality test could be something like how long the subject lasts as the gatekeeper before letting the AI out.

Replies from: gwern, ciphergoth

↑ comment by gwern · 2009-04-01T15:59:31.858Z · LW(p) · GW(p)

What ciphergoth said. Also, we can't derive an 'ought' from an 'is' - we don't actually know whether letting the AI out is the right thing to do (unless the contest had a stipulation that the AI was evil and the box keeper knew it, which I don't remember being the case). Perhaps the rational thing is to let the AI out!

Further, this could also just be a test of stubbornness or patience. Which aren't neither of them rationality. But good try anyway.

Replies from: JGWeissman

↑ comment by JGWeissman · 2009-04-02T00:26:58.676Z · LW(p) · GW(p)

For the first objection, that the AI Box experiment has too many unknowns, let us instead construct an argument based on psychological tricks for any bad conclusion to try on the subject.

For the second objection, that this tests stubbornness rather than rationality, use a sequence of tests, some using tricks to argue for false conclusions, and some using Bayesian evidence for a good conclusion. The score should reward being convinced when, and only when, the subject should be convinced. Stubbornness can only meet half this requirement.

The task of compiling arguments of both types, which would not be readily available to the subject ahead of time, remains.

↑ comment by Paul Crowley (ciphergoth) · 2009-04-01T08:23:16.601Z · LW(p) · GW(p)

The means by which EY persuades people to let the AI out of the box are secret. We shouldn't draw any conclusions from that experiment except that it is plausible to think a boxed AI could talk its way out of the box.

↑ comment by Cameron_Taylor · 2009-03-18T05:05:09.190Z · LW(p) · GW(p)

Brilliant.

↑ comment by Roko · 2009-03-15T21:55:37.888Z · LW(p) · GW(p)

The same complaint applies to this comment as to the wife-cheating test. It may actually (under certain really bad circumstances) be rational (in the "winning" sense) to believe in religion.

Replies from: MBlume

↑ comment by MBlume · 2009-03-15T22:19:03.819Z · LW(p) · GW(p)

I'll be honest -- my life has taken a sharp downturn since I deconverted. My theist girlfriend, with whom I was very much in love, couldn't deal with this change in me, and after six months of painful vacillation, she left me for a co-worker. That was another six months ago, and I have been heartbroken, miserable, unfocused, and extremely ineffective since.

Perhaps this is an example of the valley of bad rationality of which PhilGoetz spoke, but I still hold my current situation higher in my preference ranking than happiness with false beliefs.

Replies from: Eliezer_Yudkowsky, PaulWright

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T23:17:04.378Z · LW(p) · GW(p)

You have my sympathy and my praise.

If anyone's unusually good at deconversions, there might be a market for deconversion attempts aimed at the friends and family of atheists.

Replies from: MBlume, MartinB

↑ comment by MBlume · 2009-03-16T09:40:54.451Z · LW(p) · GW(p)

Thank you. You taught me (a large chunk of) everything I know, so that means a lot.

Honestly, thinking back, I suspect the best opportunity I ever had to deconvert her was when I myself did not yet identify as atheist -- when the crisis of faith was still in full swing. I'd have been perceived as sharing my doubts, rather than as "attacking" her with arguments.

Of course, back then I feared atheism -- I saw it as something terrible happening to me, that I should avoid doing to her. If I'd done a better job of leaving a line of retreat, I might have made better choices -- I might have shared each doubt as it occurred to me, instead of winding up 30 inferential steps removed from the woman I loved.

(And no, explaining that there is an inferential distance between you greater than is likely to be encountered in the ancestral environment really does not help in a fight)

I've been thinking lately of trying to write something addressed specifically to those beginning to question their religions. Life doesn't come with save points, but standing at the spot you went wrong, calling out advice to passers-by seems like the next best thing.

Replies from: DSimon, Cameron_Taylor

↑ comment by DSimon · 2013-02-14T14:58:46.241Z · LW(p) · GW(p)

That last sentence is just ludicrously dense with both important advice and good tips for game design. It's excellent, is what I'm saying, and thanks for writing it. :-)

Replies from: MBlume

↑ comment by MBlume · 2013-02-15T02:00:08.222Z · LW(p) · GW(p)

Thanks XD

↑ comment by Cameron_Taylor · 2009-03-18T05:06:27.163Z · LW(p) · GW(p)

Please do.

↑ comment by MartinB · 2010-09-21T01:43:00.254Z · LW(p) · GW(p)

Isn't there already one to get people out of not widely accepted cults? The market might explode once public perception changes.

↑ comment by PaulWright · 2009-03-15T23:54:56.788Z · LW(p) · GW(p)

My empathies: that happened to me about 6 years ago (though thankfully without as much visible vacillation).

My sister, who had some Cognitive Behaviour Therapy training, reminded me that relationships are forming and breaking all the time, and given I wasn't unattractive and hadn't retreated into monastic seclusion, it wasn't rational to think I'd be alone for the rest of my life (she turned out to be right). That was helpful at the times when my feelings hadn't completely got the better of me. I suppose we can be haunted by stuff that is real.

Replies from: MBlume

↑ comment by MBlume · 2009-03-21T04:40:34.300Z · LW(p) · GW(p)

Thank you. I've been struggling with that haunting myself. I think part of the problem is that when you're in a relationship long enough, you wind up with a term in your utility function for that person. And even if you know you could wind up with someone objectively better, better suited, the outcome doesn't seem like good news to your mind. A job for self-modification, I suppose, even if it's the slow, manual kind.

Very glad to hear she was right =)

comment by MBlume · 2009-03-15T20:43:54.170Z · LW(p) · GW(p)

Ask a thousand married rationalists of a given school to estimate the probability that their spouses have cheated on them. Confidentially ask their spouses if they have. Measure group calibration.

ETA: This applies to any potentially painful, but verifiable question. Ask them to draw a probability distribution over their date of death, or the longevity of their marriages. Estimate the probability of various kinds of cancer appearing over the next (5,10,15) years, etc. etc.

Replies from: Roko, swestrup

↑ comment by Roko · 2009-03-15T21:50:55.172Z · LW(p) · GW(p)

I've thought of a problem with this: if rationality is about /Winning/, then it may be rational to not consider the hypothesis that your wife cheats on you. You may better serve your preferences if you remain in blissful ignorance. Also, human relationships have a very Newcomb-like feel to them, because other humans are very good at ascertaining your true beliefs. If you are entertaining the hypothesis seriously, your wife will probably detect it.

So in this case winning and having a map that accurately reflects the territory may be anti-aligned.

Replies from: MBlume

↑ comment by MBlume · 2009-03-15T21:59:18.084Z · LW(p) · GW(p)

You may better serve your preferences if you remain in blissful ignorance.

There is a difference between wanting not to be a cuckold and wanting not to believe that you are a cuckold. I want the former.

Also, human relationships have a very Newcomb-like feel to them, because other humans are very good at ascertaining your true beliefs. If you are entertaining the hypothesis seriously, your wife will probably detect it.

Presumably, if you are entertaining the hypothesis -- at least beyond a societal average, or some such -- there is a root problem already in play.

But yes, this does have some self-fulfilling aspects which make it rather hard to model well.

Replies from: Roko

↑ comment by Roko · 2009-03-15T22:15:56.537Z · LW(p) · GW(p)

For me the biggest problem is that many people's preferences will be:

(a) wanting to not be cheated on

AND

(b) wanting to trust the other person so much that the possibility doesn't even arise.

i.e. your preferences in this area are a function of your own mind-state.

Replies from: MBlume

↑ comment by MBlume · 2009-03-15T22:29:35.639Z · LW(p) · GW(p)

On introspection, this does agree with my preferences, yes.

That does complicate things -- I'm not sure how to resolve this one.

I think we are using the world "rationalist" to cover too many meanings. One highly socially useful meaning for the word would be "person who can be reliably expected to speak the truth". Whatever you choose to call those, it'd certainly be useful to have some around for any society you'd like to build. We would want to have some tests to identify them.

↑ comment by swestrup · 2009-03-15T21:21:38.607Z · LW(p) · GW(p)

You'd have to define 'cheated on'. A fair number of the most rational folks I know live in non-traditional marriage arrangements.

Replies from: MBlume, Cameron_Taylor

↑ comment by MBlume · 2009-03-15T21:33:35.884Z · LW(p) · GW(p)

This is entirely true. We're going for emotional effect, so on that test, I'd keep it to the self-identified monogamists

↑ comment by Cameron_Taylor · 2009-03-18T05:08:54.439Z · LW(p) · GW(p)

Perhaps because they realise the real probability of cheating.

comment by Adele Lopez (adele-lopez-1) · 2022-08-23T03:17:12.516Z · LW(p) · GW(p)

(I consider it drop-dead obvious that the task of verifying acquired skills and hence the power to grant degrees should be separated from the institutions that do the teaching, but let's not go into that.)

Was/are there any organizations that are just dedicated to verifying rationality skills? CFAR tried to do both IIRC. Seems pretty bad if there haven't been any attempts at this even.

Replies from: elityre

↑ comment by Eli Tyre (elityre) · 2023-04-19T22:27:03.300Z · LW(p) · GW(p)

CFAR tried to do both IIRC.

According to me (who worked at CFAR for 5 years) CFAR did approximately 0-rationality verification whatsoever.

Indeed, while that would be crucial to the kind of experimental rationality development that's described in the Craft and the Community, it isn't and wasn't a natural component of CFAR's functional strategy, which was something more like rationality community-building and culture-building.

[I hope to write more about what CFAR did and why, and how it differed from the sort of thing outlined in the Craft and the Community, sometime.]

Replies from: JohnSteidley

↑ comment by John Steidley (JohnSteidley) · 2023-04-20T00:32:47.955Z · LW(p) · GW(p)

I'm currently one of the four members of the core team at CFAR (though the newest addition by far). I also co-ran the Prague Workshop Series in the fall of 2022. I've been significantly involved with CFAR since its most recent instructor training program in 2019.

I second what Eli Tyre says here. The closest thing to "rationality verification" that CFAR did in my experience was the 2019 instructor training program, which was careful to point out it wasn't verifying rationality broadly, just certifying the ability to teach one specific class.

comment by bentarm · 2009-03-16T03:47:23.731Z · LW(p) · GW(p)

I'm not sure if this has already been said, but does the "biases" literature not already contain a lot of perfectly good (although probably overly game-able) rationality tests? Just pick an experiment at random from Tversky and Kahneman and see how well the people in the school do.

Of course, there is a problem of people learning how to do some of these tests, but I'm pretty sure there are some that could be reworked so that they're pretty damned hard to pass even if you're well-acquainted with the literature. I'm thinking particularly those where half of the subjects are asked a different question to the other half, and the results compared - e.g., tests for the Lake Wobegon effect, for Social Attribution Bias, etc.

Replies from: zaph

↑ comment by zaph · 2009-03-16T20:03:57.338Z · LW(p) · GW(p)

Shouldn't the rationality school suggested by Eliezer, though, be able to train someone to be able to do well on these tests, by essentially becoming very familiar with the literature? Just devil's advocating against your devil's advocation; it seems like this would actually be pretty ideal, as you have scientifically benchmarked tests that show what let's say "naive" individuals think when encountering these problems, from where you could then see progress from the "trained" rationalists. The problem with gaming this system would be with people who are studying rationality but plan to subvert it at some point; the rationalist community would need to have frequent re-certifications so that rationalists don't rest one their laurels and rely on status to convey and inferred rationality of the decision.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-16T20:31:36.993Z · LW(p) · GW(p)

The problem is if they do well on written questions in classes but no better than average at applying the same knowledge to real life.

Replies from: bogdanb, thomblake, Cameron_Taylor

↑ comment by bogdanb · 2009-03-28T21:47:10.792Z · LW(p) · GW(p)

This is a problem with “class tests” of anything, of course. I've thought (more than five minutes) on your post, but I didn't come up with much specifically about rationality testing. (Except for “automatically build arbitrary but coherent «worlds» automatically, let students model them and the check how well their model fits «reality» afterwards”, which is an obvious application of the definition, and has been suggested already several times.)

I've come up with a few thought on testing in general:

1) As you say, cheap-but-game-able tests are often useful; we do have useful universities despite the problem of Us awarding diplomas to their own students. I think this is more than just “works well enough”, in some case it's actually useful: (a) Having good tests (e.g., by a third party) requires defining well in advance exactly what you're testing. But in many cases it can be useful if a school experiments with what it teaches (and even why), and the only test needed is internal. (b) In many (most?) cases, you can't really test some ability until you really try using it. There are plausible cases where a quick-and-dirty (but cheap) test (e.g. university diplomas) is needed only to pre-select people (i.e., weed out most incompetents), and then get to real testing doing actual work (e.g., hiring interviews and tests, then probation period). If you make the initial test «better» (e.g., harder to game) but more expensive you may be actually loosing if it's not «better» in the sense of accurate for whatever you need people to be good in.

OK, now I'm getting to what you're saying about doing good in class but bad in real life. It seems an obvious solution that you should actually be doing the testing in real life: first weed out the bad as well as you can with an approximate test (how good you do on this tests your map against reality), then “hire” (whatever that means in the context) people who look promising, make them do real work, and evaluate them there.

You don't have to evaluate everything they do, as long as you do it randomly (i.e., nobody knows when they're evaluated). The fact that random testing is done can be safely made public: if you don't know when it's done, the only way to “game” this is to actually be as good as you can be all the time.

The random testing can be passive (e.g. audits) or active (e.g. penetration testing). The only trick is that you have to do it often enough to give significant information, and that the tested can't tell when they're being tested. For instance, testing for biases can be very useful even in a context where everybody is extensively familiar with their existence, as long as you do it often enough to have a decent chance of catching people unawares. (This is hard to do, which is why such tests are difficult. Which is why university exams are still useful.)

Note that you don't have to make all tests undetectable; having some tests detected (especially if it's not obvious that they are detectable on purpose) both reminds testees of them, and allows detecting people who react differently when tested than in real life. (This can then allow you to notice when people detect tests you're trying to keep secret, assuming there's enough testing going on.)

Replies from: bogdanb, rysade

↑ comment by bogdanb · 2009-03-28T21:50:50.785Z · LW(p) · GW(p)

Oh, and another thing that seems obvious: change tests often enough that they can't be gamed. This is of course hard and expensive, which is why it isn't done very often.

↑ comment by rysade · 2010-10-23T23:24:08.655Z · LW(p) · GW(p)

I had a similar idea, but I'm still not sure about it. Succeeding in Real Life does seem like a good measure, to a point. How could one gauge one's success in real life, though? Through yearly income, or net worth? What about happiness or satisfaction?

↑ comment by thomblake · 2009-03-16T20:37:19.012Z · LW(p) · GW(p)

You have to admit that's an empirical question, though. It could be that getting the competence to do well on rationality tests requires the same skill as applying the same knowledge to real life. There are some areas where 'fake it till you make it' works, and there are some things you can't pretend to do without actually succeeding in doing the thing.

↑ comment by Cameron_Taylor · 2009-03-18T04:56:58.265Z · LW(p) · GW(p)

Test for real life? Ouch.

comment by Sebastian_Hagen · 2009-03-15T22:20:37.173Z · LW(p) · GW(p)

Use small-scale, limited-term betting markets with play money.

Put the group of people you want to rank relative to each other into a room - without internet access. Everyone starts with 0 points. People are ranked on how many points they have at the end of the test.

Participants make bets (for points) with each other. There's a time limit for settling those debts; all bets made have to be specified in a way that clearly determines the winner within a fixed period after the end of the test. Of course, bets that can be settled immediately (e.g. on current trivia, history or fiction) are also permissible.

Aside from that, there's no limits: Any time two participants agree they want to bet against each other, on whatever they specify for however many points they choose, they can register that bet.

For instance, Alice and Bob bet on the temperature as reported by for at 6:00 local time, monday after the test:

Bob will pay Alice 5 points if the temperature is at most 20 degree Celsius
Otherwise, Alice will pay Bob 20 points.

After enough time has passed for all bets to be settled, have a trusted third party determine the winner for each, tally up the points and rank participants by final score.

This game is absolute zero-sum: the only way to earn points is by taking them from another participant. Test runs and outcomes can be published without obviously weakening the idea: If there's something to be learned from previous rounds, all participants have a chance to learn it.

Studying obsesssively on certain subjects may help you, but only to the point that other participants don't know you've done it: If everyone knows that you are a major Higurashi no Naku Koro ni fan, they're unlikely to bet against you on that subject - or if they do, they won't bet very much.

Edit: Thinking about this some more, this kind of test has a failure mode: There's a strong incentive not to bet against people who are better at tests like this than you, so with sufficient information about the players the entire game may freeze up: For every possible bet, there's somebody who expects to end up worse off, no bets get made and everyone always walks out with 0 points.

Possible solution: Keep participants anonymous to each other during each test. If nobody knows who they're playing against, there's a higher chance they'll be willing to make some bets.

Replies from: steven0461

↑ comment by steven0461 · 2009-03-17T15:07:08.860Z · LW(p) · GW(p)

Good idea. It could work online if there's enough trust between participants.

Replies from: Sebastian_Hagen

↑ comment by Sebastian_Hagen · 2009-03-18T09:06:39.601Z · LW(p) · GW(p)

As an addendum, I think the whole thing could still work pretty well even if everyone is explicitly allowed to use the web (or any other data store) for research.

Bets that can be settled with immediately available information won't be very useful in that context, of course; but you could still bet on near future events. Speed research would be a valuable skill in this variant. Nevertheless, if you have any significant domain specific knowledge useful for making a short-term prediction, that should give you an advantage over someone speed-researching the topic before deciding if they want to make a specific bet on it against you.

The real problem is that access to the internet (or any nontrivial subset) also allows you to do realtime communication with other humans, so you might convince/hire a master rationalist to offer you advice during the test, which would be an extremely effective way to cheat.

Replies from: rysade

↑ comment by rysade · 2010-10-23T23:45:47.893Z · LW(p) · GW(p)

A fairly simple windows application could nearly eliminate the problem of research during the test - if it were timed. Each round being timed would allow little time to bypass the lockdowns that can be imposed through a windows API. Each time the test is given, a new version of the test software would be released Even the fastest hacker would be locked into taking the test!

comment by swestrup · 2009-03-15T21:24:16.796Z · LW(p) · GW(p)

Well, there's always the idea of using fMRI scans to determine if someone is thinking in 'rational' patterns. You stick them under the machine and give them a test. You ignore the results of the test, but score the student on what parts of their brains light up.

comment by Roko · 2009-03-15T17:37:28.811Z · LW(p) · GW(p)

Clearly real life achievement correlates well with rationality, by definition. So an impractical but "gold standard guaranteed" test of rationality would be to wait until the person in question got to the age of, say, 50, and check to see whether they had made lots of money, or achieved other obvious life goals (fame, for example).

A more specific good test of rationality is the world of startups. Other than the OB/LW community, the entrepreneurial world is the closest to perfect rationality I have found. You could test someone in a month or so by asking them to enter a startup competition, like this:

http://www.cue.org.uk/5k-Challenge

Again, probably not so practical.

The kind of challenges you find on Alan Sugar's "The Apprentice" are fairly rationality oriented, and administrable in a few days.

Potential rationalists could be tested by putting them in the position of a venture capitalist/angel investor, and have a combination of real business and fakes come to pitch at them. This test could be made harder by supplying some of the fakes with convincing cover stories and the real businesses with poor presentation skills, forcing the rationalist to concentrate on the merit of the underlying idea rather than superficial clues. It could be made a better test by allowing participants to do their own research beforehand, and giving them the opportunity to hire and consult "experts" some of whom would again be fakes.

In general, rationality tests have to take a long time IMO, because genuinely creative thinking takes a long time at a serial speed of ~10Hz. Any "test" that takes a few hours (like an exam) is just going to be regurgitation of previously memorized material, or activation of previously trained-up hardwired circuits, a la "cached thoughts". It seems to me that a day long test is the absolute minimum.

Of course you could also administer a classic academic style exam on the OB material, cognitive biases, etc. But someone could do very well on that without really understanding it. Still, it would provide some indication of real life rationality performance.

Replies from: Cameron_Taylor

↑ comment by Cameron_Taylor · 2009-03-18T05:04:14.423Z · LW(p) · GW(p)

Clearly real life achievement correlates well with rationality, by definition. So an impractical but "gold standard guaranteed" test of rationality would be to wait until the person in question got to the age of, say, 50, and check to see whether they had made lots of money, or achieved other obvious life goals (fame, for example).

Not by definition.

comment by jyasskin · 2010-12-29T18:15:59.637Z · LW(p) · GW(p)

I don't see what I thought were the obvious answers, so here they are. The foundations are elsewhere on the site, but they seemed missing from this list.

Reputational: Expect Bayesian masters to participate in other scientific fields. People who make more discoveries in other fields get more street cred among rationalists, especially when they can explain how rationalism helped them make the discoveries. Obviously, this is a long-term process that doesn't lend itself to improving the art quickly.

Experimental: This one's a two-step process. First, ask a large collection of university professors to insert one lie into each of their lectures a'la http://www.overcomingbias.com/2008/02/my-favorite-lia.html (mentioned in another comment). Have them note which students discover each lie, but don't have that count for any sort of grade (to prevent gaming). Second, sort students randomly into the experimental rationality classes, and/or have the classes "fill up" (with a lottery for seats) to provide a control. Look for whether there's a difference in lie-detection rates between the differently-taught groups.

Experimental #2, much longer term: Track the career outcomes of the students who took each different rationality class. See whether there's a difference in winning between the groups.

Replies from: None

↑ comment by [deleted] · 2015-04-18T19:22:41.148Z · LW(p) · GW(p)

Note that for some of them, leaving the career track altogether might be the rational choice.

comment by patrissimo · 2009-03-21T22:09:06.405Z · LW(p) · GW(p)

I'm tempted to say "have them play poker", except it uses lots of domain-specific knowledge as well as general rationality. Perhaps if you could generate random games from a large enough space that people don't build up game-specific skills, and the games just end up testing general rationality? While poker-like games don't test all aspects of rationality, there are some things like "ability to keep making good decisions when frustrated / bored / angry" that these games test very well.

I think people would develop skill at the whole class of games...but at the same time, they would be improving their rationality.

comment by Emile · 2009-03-16T21:03:35.671Z · LW(p) · GW(p)

"Piggyback" on other tests: ask people taking part in various tests (standardized exams, sport competitions, driving lessons, programming contests, art exhibitions - whatever) their chances of success (or their probability distribution over the range of results).

The other items should themselves be important enough, so it would fit well with a university cursus, so that it can be "automated" for a lot of things. The way of asking for predictions should be made so as to maximize bad predictions: for example the students are asked to give estimations in front of their peers (if that's shown to get them to overestimate), but afterwards not reminded of the prediction they gave nor of whether it came true (so that they don't deliberately try to make it come true).

It could also be extended to other events like "when I'll turn in my thesis" or even "whether I'll be single in a year" or "how much I'll weight in six months".

The more subjects they have to estimate on, the better. At the end, measure the Bayes-score.

This could be combined to some more "dramatic" and explicit rationality tests (see the other comments) to constitute the scoring method of a university rationality course. The explicit rationality tests would also help take a bit of attention away from the day-to-day probability estimates on exams and stuff, to diminish the "only rational when deliberately thinking about it" phenomenon.

Oh, also - ask the students for an estimate before the exam and after the exam (but before they have a chance of talking to someone else). Maybe even a week before and a week after too.

comment by haig · 2009-03-16T20:33:34.719Z · LW(p) · GW(p)

There is a recent trend of 'serious games' which use video games to teach and train people in various capacities, including military, health care, management, as well as the traditional schooling. I see no reason why this couldn't be applied to rationality training.

I always liked adventure style games as a kid, such as King's Quest or Myst, and wondered why they aren't around any more. They seemed to be testing rationality in that you would need to guide the character through many interconnected puzzles while figuring out the model of the world and how best to achieve the goals of the protagonist. It seems like the perfect video game genre for both developing and testing rationality skills.

Specifically, I've thought of a microcosm of the real world, taking place in a different setting yet similar enough to our real world that there would be analogues to religion, science, politics, etc. As you progress through the game, say from child to adult, you learn about the world and see how different beliefs and strategies effect the game. Players would encounter similar challenges to the real world but be disconnected enough not to put up a defense mechanism, yet involved enough to care about the outcome. Add MMO et al features to taste.

Replies from: steven0461, rysade, Cameron_Taylor

↑ comment by steven0461 · 2009-03-17T14:40:00.626Z · LW(p) · GW(p)

I always liked adventure style games as a kid, such as King's Quest or Myst, and wondered why they aren't around any more.

Google "interactive fiction".

↑ comment by rysade · 2010-10-23T23:09:37.973Z · LW(p) · GW(p)

I just finished playing a side-scrolling game called Closure (http://www.closuregame.com) that has some qualities of Myst, et al. I think that you've got a good idea here, but a problem could arise from the 'death penalty' that most games impose. Typically, you just restart the 'mission.' Games that operate like that don't provide quite enough incentive to pull out your whole intellect. If the player knew ahead of time that a single failure meant permanent loss, they would be more apt to give the game effort enough to have their rationality tested accurately.

Replies from: handoflixue

↑ comment by handoflixue · 2011-07-15T19:37:52.626Z · LW(p) · GW(p)

If the player knew ahead of time that a single failure meant permanent loss

That would be the RogueLike genre, of which NetHack is a pretty good example of "painful trial and error to learn how the world works". Most successful players just go online and read the spoilers, and I'd argue that this is the more rational approach - it's irrational to go out and pay the price of failure when someone else has already done that for you, and you can learn from them.

Besides, most people don't find that sort of trial and error game play fun, which I think is a fairly important consideration if you're trying to teach people.

↑ comment by Cameron_Taylor · 2009-03-18T04:55:53.766Z · LW(p) · GW(p)

Good idea. What details would you be able to convey?

comment by abstractapplic · 2024-06-05T00:04:09.296Z · LW(p) · GW(p)

Reputational: D&D.Sci [? · GW].

Experimental: D&D.Sci, with a consistent limit on time & resources used.

Organizational: D&D.Sci, with a consistent limit on time & resources used, using freshly-baked scenarios you know no-one has ever played before.

Limitations:

Takes several hours to play most scenarios.
Requires generic coding/spreadsheeting/data-science-ing skills in addition to Rationality; people who are good at those skills get an unfair(?) advantage.
Getting familiar with the genre gives an unfair(!) advantage.

Misc. addl. reflections on the topic:

Starting from zero is a valid approach, but looking at existing tests and thinking "okay but what if this was better/harder/about slightly different skills" is also sensible. Figuring out how clever and effective people are is a big industry! We should take inspiration from tests employers give job applicants, and any test any gatekeepers give anyone. (Especially if that means we get to subsidize development of rationality-tests by selling them to HR departments.)
. . . are there any ways to test rationality which don't rely on complementary skills? Even written tests test your ability to read the questions.
Videogames could be so good for this if they weren't optimized for fun and accessibility.

comment by orthonormal · 2009-03-22T16:23:23.122Z · LW(p) · GW(p)

(haven't looked through comments, so this may have been suggested many times over)

In a college-level rationality course, it would be most appropriate for a portion of the grade to be determined by an artificial economy. That is, set up a currency and a (relatively even) starting distribution, add (probabilistic) opportunities for investment (perhaps linked to other important parts of the course) and, most importantly, make defection possible, anonymous and easy. Make it, as much as possible, like a vast array of one-shot (or known number of iterations) Prisoner's Dilemmas.

Then allow students to organize into institutions with rules. Well-taught rationalists should be able to construct a very strong economy along these lines; poorly-taught ones will be only rational enough not to cooperate out of an irrational sense of honor. A student's final grade on that component will be the logarithm of their final wealth, curved as little as possible.

It would take a well-designed setup, of course, to ensure that we're truly measuring rationality and not (say) merely group cameraderie; but I think it could be worked out in a satisfactory way.

The main upshot of this as regards rationality verification: if two different rationality curricula run the same economy setup, a consistently better growth rate of one class economy is evidence of the second kind that more complete rationality is being taught. The students have a much bigger incentive towards their own grade than towards the reputation of the class, so it should be a pretty decent test.

Replies from: Will_Newsome

↑ comment by Will_Newsome · 2010-09-20T09:36:31.812Z · LW(p) · GW(p)

What's the starting rationality level of the students? Traditional rationality level or post-Sequences level?

Replies from: orthonormal

↑ comment by orthonormal · 2010-09-22T22:43:19.434Z · LW(p) · GW(p)

I'm assuming an introductory type of class, for students with some scientific background but no rationality training. (Where on earth would you find a college class full of post-Sequences people?)

comment by Roko · 2009-03-15T21:19:42.857Z · LW(p) · GW(p)

Send rationalists to do consulting work where real money is involved, for example techdirt:

http://www.techdirt.com/

The Techdirt group blog uses a proven economic framework to analyze and offer insight into news stories about changes in government policy, technology and legal issues that affect companies’ ability to innovate and grow.

Here you basically get paid for good insights. A "team" of rationalists could be sent in to dominate this particular arena, thereby validating the technique. Basically any online arena where real money can be made is fair game. Trading in Second Life, for example.

Replies from: Johnicholas

↑ comment by Johnicholas · 2009-03-16T01:26:18.093Z · LW(p) · GW(p)

The feature of "profitable in the real world" is very valuable. Keeps the test calibrated to what we're interested in measuring.

Real-money, real-world prediction markets also have this feature; I wonder what other examples exist.

comment by Roko · 2009-03-15T20:40:00.584Z · LW(p) · GW(p)

An interesting idea would be to feed people the scientific data that ancient or medieval scientists had and see whether they reproduced all the incorrect but (given the limited knowledge) plausible theories that were invented.

This would work especially well on the vast numbers of people in our society who don't know any science anyway.

In fact just finding some sufficiently obscure area of current science would suffice. There's so much of it... How much of contemporary paleontology or inorganic chemistry could I re-invent?

I once succeeded in deriving the solution to the cubic and quartic equations invented by Tartaglia in the 1500's. It took me all day and almost the whole night, plus some tidying up the next day. I think it made a good exercise in rationality, and was harder than I thought it would be.

Related to this, many elite universities face the same kind of problem that we have here: they have to put a lot of effort into sorting out the good wrote learners from the creative thinkers in their crop of prospective students. To do this they have various entrance exams, plus interviews.

One way to test the effectiveness of rationality arts would be to offer rationality training to 17/18 year olds who were soon to be interviewed by these top universities. If our group did significantly better than one would have predicted based on their grades, then we would have a quantifiable and objective signal of effectiveness.

Replies from: simpleton, infotropism, Cameron_Taylor

↑ comment by simpleton · 2009-03-16T23:37:22.155Z · LW(p) · GW(p)

I strongly second the idea of using real science as a test. Jeffreyssai wouldn't be satisfied with feeding his students -- even the beginners -- artificial puzzles all day. Artificial puzzles are shallow.

It wouldn't even have to be historical science. Science is still young enough that there's a lot of low-hanging fruit. I don't think we have a shortage of scientific questions which are genuinely unanswered, but can be recognized as answerable in a moderate amount of time by a beginner or intermediate student.

↑ comment by infotropism · 2009-03-15T22:06:30.909Z · LW(p) · GW(p)

Just to mention in passing, when I read your particular example, my immediate thought was "right, I'd fail right away". Someone who sucks at math would probably find it very difficult to derive those solutions. Yet, I don't think that means they couldn't be rational. You'd have to take into account their personal skills and affinity in the scientific domain you're testing, and adjust for that.

↑ comment by Cameron_Taylor · 2009-03-18T05:02:48.354Z · LW(p) · GW(p)

All the plausible but incorrect theories? Why? Guessing other people's inferior answers doesn't demonstrate rationality. It demonstrates empathy.

comment by Marshall · 2009-03-15T20:37:14.467Z · LW(p) · GW(p)

Maybe there is a simple thing, which rational people can't do - always get wrong.

Some not very good examples could be:

Skipping with closed eyes.

Telling a lie to a stranger without it being discovered

Saying - "Ooops, I' m wrong," quickly enough

Going to church and sitting thru' a whole sermon without getting very very upset

Multi-tasking

Irony

Understanding metaphors metaphorically.......

Replies from: Rings_of_Saturn, infotropism, Cameron_Taylor

↑ comment by Rings_of_Saturn · 2009-03-16T02:12:23.626Z · LW(p) · GW(p)

Yeah... I can't think of any good actual examples either, but maybe we should be trying to falsify rationality, rather than verify it.

↑ comment by infotropism · 2009-03-15T22:14:53.369Z · LW(p) · GW(p)

I don't know if any of those particular suggestions would work, but the general idea is interesting, no one else suggested testing a negative correlate of rationality I think.

↑ comment by Cameron_Taylor · 2009-03-18T05:00:55.532Z · LW(p) · GW(p)

Huh? Those are mostly independent of rationality.

comment by Roko · 2009-03-15T17:58:15.918Z · LW(p) · GW(p)

Another key feature of [edit] group rationality is the ability to not be swayed by what the social group thinks.

There are simple experiments (though I cannot think of the relevant keywords) where a test subject is put in a room full of confederates, all of whom estimate one line segment to be longer than another when the two lines are in fact the same length.

EDIT: Conforming to the group opinion (on average) increases the probability that you are right, thus improving individual truth-tracking. But adding more conformers to the LW community just screws it over. In the limit of infinitely many perfect conformers, the community would display irreversible belief hysteresis; as soon as >50% of the group believed X, all the conformers would switch to believing X, and they would stay that way.

EDIT: The Asch experiments found that only 25% of people would consistently report the truth their own eyes were telling them. Thus most people don't make good group rationalists.

This comment thread makes clear the need to distinguish between creating good individual rationalists and good group rationalists. Optimizing for group rationality and individual rationality are different tasks.

A gold-plated ignorance prior will be awarded to the first commenter who finds the link for me...

Replies from: MichaelBishop, MichaelHoward, CarlShulman

↑ comment by Mike Bishop (MichaelBishop) · 2009-03-15T18:29:36.085Z · LW(p) · GW(p)

You're thinking of Asch's experiments. Apparently, they are widely misrepresented: http://webpage.pace.edu/yrafferty/Yvonne/AschConformityStudy.pdf See also: http://www.hss.caltech.edu/~jkg/Conformity.pdf (I don't remember where I found these... possibly through OB)

Replies from: Roko

↑ comment by Roko · 2009-03-15T18:42:28.249Z · LW(p) · GW(p)

You are the proud recipient of a gold-plated uniform distribution on a finite set. Congrats.

Since my comment has been downvoted to 0, I assume that the LW community likes people who go along with the group opinion even when they know it is wrong? Perhaps people are unsatisfied with this as a rationality test because they think that the test should focus on getting as close to the truth as possible (in which case conforming is good in most cases for most people) rather than adding value to a rationalist community (in which case conforming just because everyone else does is actively hurting the community).

Also, having skimmed the pace.edu link, I am unconvinced that Asch's results are being misinterpreted, at least by me. Asch found that, in the situation of overwhelming evidence, only 25% of subjects could be trusted to consistently call things the way they really were, i.e. 25% of the subjects pass what I would call the absolute minimum standard of rationality over social conformity.

Note that Carl's link to the OB article gives us a more nuanced version of this debate, which I recommend.

"Paul Crowley reminds me to note that when subjects can respond in a way that will not be seen by the group, conformity also drops, which also argues against an Aumann interpretation."

Replies from: MichaelHoward

↑ comment by MichaelHoward · 2009-03-15T18:52:22.539Z · LW(p) · GW(p)

Since my comment has been downvoted to 0, I assume that the LW community likes...

Hasty generalization/Belief in the law of small numbers

Replies from: Roko

↑ comment by Roko · 2009-03-15T19:07:00.618Z · LW(p) · GW(p)

yeah, OK, it's only 1 person's opinion, I'll wait and see what happens when more time passes and more people get the chance to vote.

In defense of my interpretation... few comments get downvoted to zero, so even a small amount of time at zero is fairly significant evidence that people don't like what you're saying.

↑ comment by MichaelHoward · 2009-03-15T18:17:39.876Z · LW(p) · GW(p)

...and here's the video (the one in the OB link is dead).

↑ comment by CarlShulman · 2009-03-15T18:13:25.486Z · LW(p) · GW(p)

http://www.overcomingbias.com/2007/12/aschs-conformit.html

comment by Sunny from QAD (Evan Rysdam) · 2020-05-05T06:43:46.099Z · LW(p) · GW(p)

Stupid idea: Have a handful of students from each school volunteer to be assigned extremely difficult, real-world tasks, such as "become an officer at Microsoft within the next five years". These people would be putting any other of their life plans on hold, so you'd need to incentivize them with some kind of reward and/or sense of honor/loyalty to their school.

comment by Eli Tyre (elityre) · 2019-07-13T05:01:56.213Z · LW(p) · GW(p)

Let's see...

Prediction contests are an obvious one.
Also, perhaps, having people compete at newly designed games, so that everyone has the same amount of time to learn the rules and how to win, given the rules.
Perhaps we could design puzzles that intentionally have places where one would make a mistake, error, or wrong choice, and such errors are visible (to an observer who knows the puzzle) when made.

comment by [deleted] · 2009-03-15T22:25:24.385Z · LW(p) · GW(p)

deleted

Replies from: MichaelVassar, John_Maxwell_IV, beriukay

↑ comment by MichaelVassar · 2009-03-16T05:16:56.012Z · LW(p) · GW(p)

I rate fairly poorly by these metrics. That makes me suspect that people like me also do. I see that this comment has been poorly rated and hope that people haven't rated it poorly for being unflattering. If you have done this, please rate it back up, OK.

↑ comment by John_Maxwell (John_Maxwell_IV) · 2009-03-17T04:38:55.093Z · LW(p) · GW(p)

Degree of equality of percentage of income spent on books and percentage of income spend on club memberships.

I'm pretty sure Rational Man never buys a book he can borrow for free from the local library.

Replies from: MBlume

↑ comment by MBlume · 2009-03-17T06:20:25.945Z · LW(p) · GW(p)

I certainly don't mean to refer to myself as a candidate for Rational Man, but I do like owning books. Especially textbooks, I would not want to go down to the library every time I wanted to go through my copy of Sakurai. But even old favorite novels, it's good to have them on the shelf, ready to throw in a saddlebag at a moment's notice before a long train ride.

↑ comment by beriukay · 2010-07-26T13:20:47.303Z · LW(p) · GW(p)

I know of some other stupid tests for rationality, borrowed happily from Invader Zim.

Absorbency
Electrical Conductivity
Something involving a beaver and a toy taxi.

On a less stupid note: Reputationally, I have an explicit agreement with one of my friends that we fact check each other. This was actually a one-way fact checking until fairly recently when he asked me why I didn't call him on something he later realized was total bullcrap. Note, this works best if you actually have a good memory and aren't pickling your brain with alcohol. It also seems to help check the mindkilling effects of disagreement.

A long time ago, I was reading about critical thinking, and was presented a relatively short list of questions to try and use to stimulate critical thought. Questions of this nature could be used in some form of standardized test; or could be used to build a portfolio of rationale behind opinions on all manner of things, which could be graded by peers or instructors (preferably ones who also aspire to rationality, and disagree). I suppose the portfolio would be more organizational than experimental, and almost as easy to game as cheating on essays. But those were my main thoughts before reading the cool ideas other people came up with.

In case you're interested, this was the list as I transcribed it:

What do you mean by _ ? How did you come to that conclusion? What is the source of your information? What is the source of their (opponents') source of information? What assumptions led you to that conclusion? Suppose you are wrong. What implications are there? Why did you make your inference? Is there another inference more consistent with the data? Why is this issue significant? How do I know what you say is true? What is an alternate explanation for this phenomenon?

Oh, and after reading the Logic of Failure, maybe running simulations like they did with the Sim City-like vibe, or the optimizing bug population or the refrigeration tests could be instructive. Even after learning about them, (especially the city planning and the African tribe) they may be sufficiently complicated to be of experimental or organizational value. On the other hand, they may turn out to be just as useless as chess for testing rationality if success strategies are posted and shared. Maybe some of the sims could have randomly assigned (Kirk resistant) Kobayashi Maru modes, but then I don't see how a predetermined loss would be very instructive unless the player didn't know it was rigged---and even then, only to illustrate Eliezer's point that even if you do everything right, you can still fail.

comment by infotropism · 2009-03-15T21:52:10.208Z · LW(p) · GW(p)

Give them a motivation that is higher than the drive to game the test. I'm an immortalist. I don't want to die. I could deceive myself and others in many ways about my skills, purposes, beliefs, but in the end I can't do that at the expense of my chances of not dying. Finding a similarly important purpose, something that might even be gamed, but for which gaming means you loose. Some real life test.

Maybe, measuring someone's capability to win. I have often wondered if being rational correlates with being succesful in society. I can't be sure, though it seems to be it should, if it doesn't then I suppose it either means there's a problem with a rationality that would leave you worse off, or more likely, that you aren't being rational enough, or do not have enough mental ressources to use that rationality to make a difference. Bounded rationality, always an issue.

Capability to win could be measured in many ways, economical success for instance, or any other existing societal position of power or prestige. Of course any single of those may be gamed, but it's ok to cheat, if cheating brings you closer to what you want, then it is rational to game. However, the goal that you hold, and for which you are vying, may not be very interesting. Empty fame, etc.

It would be best to have a personal goal set, and known, and measure how a person fares as to that goal; a goal difficult enough to require the proper use rationality to win in society, that would require to apply rationality to a very large and diverse bunch of situations, a goal that you'd want to preserve.

Can't help much to determine what that would be, I have my own thing to protect, as I said, not sure what it might be for other people. It doesn't work all the time either. Sometimes short time goals are vying for dominance over my actions, and I'll give in to them, even if it means getting farther from my own personal long term goal. That's a lack of willpower, not a lack of rationality at work there I think.

comment by swestrup · 2009-03-15T21:11:29.129Z · LW(p) · GW(p)

A friend of mine, the most consistently rational person I know of, once told me that his major criteria for whether a piece of information is useful is if it can allow him to forget multiple other pieces of information, because they are now derivable from his corpus of information, given this new fact.

I have a vague feeling that there should be a useful test of rationality based on this. Some sort of information modeling test whereby one is given a complex set of interrelated but random data, and a randomly-generated data-expression language. Scoring is based on how close to optimal once gets on writing a generator for the given data in the given language.

Unfortunately, I think this is someone one could explicitly train for, and someone with knowledge of data compression theory would probably be at an advantage.

Replies from: Roko

↑ comment by Roko · 2009-03-15T21:27:49.193Z · LW(p) · GW(p)

compression != rationality, methinks

Replies from: swestrup, Vladimir_Golovin

↑ comment by swestrup · 2009-03-31T18:22:24.243Z · LW(p) · GW(p)

I'm only now replying to this, since I've only just figured out what it was that I was groping for in the above.

The important thing is not compression, but integration of new knowledge so that it affects future cognition, and future behaviour. The ability to change one's methodologies and approaches based on new knowledge would seem to be key to rationality. The more subtle the influence (ie, a new bit of math changes how you approach buying meat at the supermarket) then the better the evidence for deep integration of new knowledge.

↑ comment by Vladimir_Golovin · 2009-03-15T21:39:55.270Z · LW(p) · GW(p)

Yes, "not equals", but compression is necessary for reality-mapping, which is one of the key components of rationality as defined at the beginning of this post. There's a great quote on this:

“We can take this huge universe, and put it inside a very tiny head -- you fold it.”

Replies from: Roko

↑ comment by Roko · 2009-03-15T21:44:20.533Z · LW(p) · GW(p)

The thing is, humans do the compression thing very naturally. The heuristics and biases researchers' innovation was that we suffer from specific mental illnesses such as overconfidence, confirmation bias, tribal politics, rationalizing after we've written the bottom line, etc.

EDITED

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T22:11:59.759Z · LW(p) · GW(p)

That's, um, hardly my own innovation...

comment by Kaj_Sotala · 2009-03-15T18:00:26.606Z · LW(p) · GW(p)

Hmm. Some off the top of my head:

Look for studies that have recognized a certain bias, then use that information to come up with reasoning problems where the participants have to reach the correct answer without falling prey to the biases. Somewhat vulnerable to people studying to beat the test, though can potentially be defeated by creatively combining several different biases and applying them into new situations. Downside: coming up with lots of different scenarios where one may fall victim to biases is a lot of work. Perhaps come up with suitable computer games where success depends on avoiding biased behavior, and the scenarios can be automatically generated?
Calibration tests. These could be auto-generated, drawing on a far wider field of information than the current ones.
As the above two, but subjects are forced to write down their reasoning. This may be more helpful in making them reflect more on their reasoning, than for actual verification - somebody's train of thought can be very hard to interpet, since they'll never write down everything that influenced their decision.

Replies from: ciphergoth, Roko

↑ comment by Paul Crowley (ciphergoth) · 2009-03-15T22:02:19.182Z · LW(p) · GW(p)

Somewhat vulnerable to people studying to beat the test

If the test is, say, a battery of experiments already performed that demonstrate the existence of various well-known cognitive biases, most people could not study to beat the test without improving your rationality to a significant extent if they tried.

↑ comment by Roko · 2009-03-15T18:02:05.063Z · LW(p) · GW(p)

"Calibration tests."

ftw ignorance prior will game this

Replies from: Kaj_Sotala

↑ comment by Kaj_Sotala · 2009-03-15T18:07:54.851Z · LW(p) · GW(p)

I'm not sure I understand this comment.

Replies from: Roko

↑ comment by Roko · 2009-03-15T18:29:26.829Z · LW(p) · GW(p)

Sorry, should have been clearer. I will make a note to devote less effort to humor and more to clarity in my comments in future...

A calibration test would consist, I presume, of questions of the form "estimate the value of parameter X, and then give upper and lower bounds U and L such that the probability of parameter X lying in [U,L] is 90%". You are "well calibrated" if the actual value of X is in [U,L] roughly 90% of the time.

But you can do very well on such a test by picking a good ignorance prior the parameter space for X - for example the uniform distribution (if the set of values of X is a finite set) - and sampling randomly from that distribution, and then randomly choosing U and L such that 90% of the probability mass is contained in [U,L] and your random guess is contained in [U,L]. On average, you will come out as well calibrated ( if there is a statistics expert here, then please correct me if I'm wrong... ), even though this procedure is really totally mechanical and doesn't involve any real thought. Someone who actually thought hard about what they thought the value of X was would inevitably be (at least slightly) overconfident and would do worse. See:

http://www.overcomingbias.com/2008/10/expected-creati.html

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T19:15:34.819Z · LW(p) · GW(p)

Use Bayes-score (log of final joint probability) as primary outcome, measure calibration only secondarily.

Replies from: Roko

↑ comment by Roko · 2009-03-15T19:25:23.817Z · LW(p) · GW(p)

... In which case you're measuring knowledge about the questions asked, not calibration. My little sister could beat you on such a test if it was about pop idol.

You could design a good test where the score was some combination of calibration and knowledge, such that someone with less knowledge but better calibration could outscore someone with better knowledge but poorer calibration.

Something like (calibration) * (Bayes Score), perhaps?

Nick suggested something like this:

http://www.overcomingbias.com/2007/01/a_game_for_self.html

He solves the problem that (e.g. Bayes Score) will test for narrow knowledge by suggesting that the questions be very general.

Replies from: MBlume, Eliezer_Yudkowsky

↑ comment by MBlume · 2009-03-15T19:40:52.274Z · LW(p) · GW(p)

you maximize Bayes Score iff you use all your knowledge as well as possible. This seems to indicate that any perturbation will introduce an incentive not to do so.

Ask completely ridiculous things. Estimate the probability that the yearly rainfall in Ghana exceeds that of Switzerland. Ask questions like that, and you will learn something about how much true general knowledge a person has gained (and why not -- a rationalist should absorb more true general knowledge in X years on earth than a non-rationalist), but much more about the subject's ability to honestly estimate their own ignorance.

Replies from: Roko

↑ comment by Roko · 2009-03-15T19:50:03.877Z · LW(p) · GW(p)

"you maximize Bayes Score iff you use all your knowledge as well as possible. "

yes, but in a test where you have no knowledge (e.g. Eliezer is a great rationalist but knows nothing about pokemon) this is unhelpful... This test would work well on ranking rationalists iff you had a set of general knowledge questions that you were confident everyone had roughly the same amount of knowledge about.

Replies from: Eliezer_Yudkowsky, MBlume

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T20:21:33.857Z · LW(p) · GW(p)

The test would also work statistically to measure the effect of an intervention, if you had more subjects than variance. A test with too much variance can't be organizational, but it can be experimental.

↑ comment by MBlume · 2009-03-15T20:01:06.407Z · LW(p) · GW(p)

If you are asked about pokemon, AI design, 13th century chinese history, martian geology, german literature, Yankees batting averages, lyrics to popular songs from the 1820s, etc. you would be forced to get maximal mileage out of whatever knowledge you can bring to bear on each question, which would in most cases be slim to none.

If the questions are chosen randomly and eclectically enough, there should be no way to game the system, and scores should average out for people knowledgeable in different areas.

If you dependably know more than I do across a broad spectrum of subject areas, then I would assume that you have learned more than I have during your life so far, which seems to me to be symptomatic of good rationality.

Replies from: Roko

↑ comment by Roko · 2009-03-15T20:22:44.581Z · LW(p) · GW(p)

"across a broad spectrum of subject areas ... questions are chosen randomly"

but this is the real weasel in there. Defining a good prior on "subject areas" is problematic. A very rational nerd would get wiped out if there are too many trivia questions... which is what happened to me just now on Tom's rationality test:

http://www.acceleratingfuture.com/tom/calibrate.php

Though my calibration on this test was very good, my Bayes Score was rubbish. Most of the questions were about America, (cultural bias) and most were about people (subject area bias). I like my idea of (calibration) * (Bayes Score).

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T19:26:01.755Z · LW(p) · GW(p)

Then use more obscure questions.

Replies from: daedalus2u

↑ comment by daedalus2u · 2010-07-26T16:08:21.621Z · LW(p) · GW(p)

Test for data, factual knowledge and counterfactual knowledge. True rationalists will have less counterfactual knowledge than non-rationalists because they will have filtered it out. Non-rationalits will have more false data because their counterfactual knowledge will feedback and cause them to believe things that are false are actually true. For example that Iraq or Iran was involved in 9/11.

What you really want to measure is the relative proportion of factual and counterfactual knowledge someone has, and in what particular areas. Then including areas like religion, medicine, alternative medicine, and politics in the testing space is advantageous because then you can see where the idea space is that the individuals are most non-rational in.

This can be tricky because many individuals are extremely invested in their counterfactual knowledge and will object to it being identified as counterfactual. A lot of fad-driven science is based on counterfactual knowledge, but the faddists don't want to acknowledge that.

A way to test this would be to see how well people can differentiate correct facts (data) from factual knowledge (based on and consistent with only data) from counterfactual knowledge (based on false facts and not consistent with correct facts) from opinion consistent with facts or opinion consistent with false facts.

An example: in the neurodegenerative disease of Alzheimer's, there is the association of the accumulation of amyloid with dementia. It remains not established if amyloid is a cause, or an effect or is merely associated with dementia. However there have been studies where amyloid has been removed via vaccination against amyloid and a clearing of amyloid by the immune system with no improvement.

I imagine a list of a very large number of statements to be labeled as 1.true (>99% likelihood) 2.false (>99% likelihood to be false) [edited to improve definition of false] 3.opinion based on true facts 4.opinion based on false ideas 5.no one knows 6.I don't know

A list of some examples

Iraq caused 9/11 2 WMD were found in Iraq 2 Amyloid is found in Alzheimer's 1 Amyloid causes Alzheimer's 2 (this happens to be a field I am
working in so I have non-public knowledge as to the real cause) Greenhouse gases are causing GW 1 Vaccines cause autism 2 Acupuncture is a placebo 1 There is life on Mars 5

You don't want to test for obscure things, you want to test for common things that are believed but which are wrong. I think you also want to explicitly tell people that you are testing them for rationality, so they can put themselves into “rational-mode” (a state that is not always socially acceptable).

The table-like lists look fine in the edit box but not fine once I post. :(

Replies from: arundelo

↑ comment by arundelo · 2010-07-26T22:28:37.582Z · LW(p) · GW(p)

http://daringfireball.net/projects/markdown/syntax

I'm not sure what effect you're     !
going for, but indenting by four    !
spaces allows you to do things like !
this.                               !

Replies from: daedalus2u

↑ comment by daedalus2u · 2010-07-26T23:15:44.578Z · LW(p) · GW(p)

Thanks, I was trying to make a list, maybe I will figure it out. I just joined and am trying to focus on getting up to speed on the ideas, the syntax of formating things is more difficult for me and less rewarding.

Replies from: arundelo

↑ comment by arundelo · 2010-07-26T23:42:38.717Z · LW(p) · GW(p)

There's also a help link under the comment box.

* Bullet lists look like this.

1. Ordered lists look like this.

Replies from: daedalus2u

↑ comment by daedalus2u · 2010-07-26T23:48:42.498Z · LW(p) · GW(p)

Yes, thankyou just one problem

too obvious

and

too easy

comment by Emiya (andrea-mulazzani) · 2020-12-12T17:30:34.099Z · LW(p) · GW(p)

Something the masters (and students) of each school can do to keep it real:

The Winning Tournament: Organise a yearly or so event. A group of clever, evil people select and creates a number of "games" or tests, if you'd rather. Wannabe masters of rationality can compete against each other for the title, pride and glory.

The type of games and tests should be kept varied. Some could be contests where participants randomly compete against each other, other might be battle royals where people can form alliance and all around try as hard as they can to win. Others are just games where there is a correct solution to be figured out from clues, the fastest you get there the more points you gain, but if you get to the wrong solution, well, that's really worse than having been slow.

Physical abilities and expertise in fields should be kept out of these games if possible, since they are powerful "noise" sources.

The tests should be really hard, so a committee of clever, evil people who approach the task of creating the games with a certain degree of evil glee is recommended. They can get inspirations from real life problems that have caused disasters and routinely mess up experts are recommended. Games that are similar to real problems people can meet in real life are recommended (for example, having to mediate an agreement between other groups of participants, or just by hearing the description of a situation).

A good example was the Darwing Game here on Lesswrong, though it seriously advantaged programmers it had the effect to spark a lot of interesting plans.

For organising it... if you keep things "fun" enough it should be doable to find interested people. If you keep things "amazing" enough, rationality might even start to get a bit of "awesomeness" reputation from it. It seems possible to even organise it online. You could charge a small fee to participate it to cover expenses (something proportionated to the "fun" you'd get participating) and perhaps use a part of it as prize money for the winner.

A group of intelligent people acts as "graders" of the game, to create an individual score that hopefully could be more or less coherent between tournaments, so you'd know how much you've improved from the last one.

Clearly, this is just as good as the games and tests that are put into it. Crucial aspects would be how much does a game has a victory that correlates with rationality skill rather than just some more specific skills, how much does a game has a victory that correlates with being able to "win" in common real life problems. But I think the committee could have fun in creating said games, and I'd expect people skilled in thinking and rationality to be advantaged. Raw native intelligence would be a major source of noise, but that can't really be helped I think.

Something you could do to test a hundred student

A number of "games" or "situations" where a mathematical solution is calculated is devised, and such games are handed in a yearly test people can do online.

The betting game with the lights in "A Technical Explanation of a Technical Explanation" is a good example. People are asked to bet each round, with evidence coming from a mathematical simulation of the game being provided bit by bit. More than a scoring function can be used for such betting games, so players also have to calculate what the best winning strategy is depending on the circumstance.

Deduction games can also be played, where you have to use the clues (again, we are supposing types of games where we can calculate the exact or most likely solution for the evidence provided)

For such games, elaborated rules can be devised, not necessarily going with simple random distributions. Perhaps some scoring system for correctly stating the rule could be used? Occam's Razor might apply to how likely you are to have a complex rule vs. a simple rule.

Other possible games can be devised in decision theory, with the task of maximising a given utility function, predicting the moves of other entities is incorporated as well. For extra fun, some of these games will be connected, if a certain number of people will be asked to choose the move of the entity A in a scenario, and a number of player will be asked to choose the move of the entity B in the same scenario, then the percentages of the players' decisions will be used to choose A and B moves, evaluating the final score each player got in that scenario depending on his individual decision.

Arguments with fallacies, biases or other errors of cognition can be provided (not in every test there will be one, and participants won't know how many there are). Participants would choose the correct option from a long list of options, and will have to also mark the numerical number of the line where the fallacy appears to specify which is what, so the scoring system can be automatised.

We can also put in some hard logic exercises for good measure.

Participants receive a global score and more detailed ones, which are determined using time and precision.

Note: I think this kind of test measures more the "theoretical skill" of the participant, and not whether they can apply it in real life.

Something you could use as a test even if people have an incentive to game it

Class project: a group of rationalists choose an ongoing problem in a particular field. A group of experts in the field is selected to act as judges.

A number of rationalists offers to collectively tackle the issue. When they sign up for the game they specify the number of hours they could put into this (a minimum number is required for participation). Applicants are required to specify why they think they could help solving that particular problem, those that aren't judged suited to the task might be refused by the judges.

Ways to communicate with each other are provided. The participants have to organise the work between themselves, and be able to propose a solution, analysis, or something that would provide progress in said field. Each participants or group of participants will be assigned a task based on the number of hours they offered to put in, if they can't complete their tasks this affect negatively their score (which also depends on the number of hours). Of course groups are expected to be able to take care of such problems by themselves.

When the final work is produced (which has to suit standard requirements for their fields) it is graded by the judges, to give a "quick score".

More importantly, attempts are made to put the work under the scrutiny of the fields it belongs to (for example, by publishing it in a scientific journal). If this scrutiny goes well, it goes into the "real score". Progress made in solving said problem are monitored. If the work produced by the rationalist group ends up being right, it goes into the "real score". If the project leads to solving the problem, it super goes into the "real score" and badges or something are handed out to everyone who worked on it and received an adequate score.

comment by Anomal3 · 2019-12-16T17:39:32.775Z · LW(p) · GW(p)

I doubt a few minutes of pondering will provoke any significantly insightful thoughts, but on the off chance that they do here's what I've got:

A major pitfall of most tests is that they can end up examining a wide variety of confounding variables. For example if the test for rationality is based on a written prompt then it selects against those with dyslexia in spite of their rationality. If it's based on a spoken prompt then it selects for those with similar accents to the test-giver, or against those who had it read to them in a strange way. Ideally since the thing that we're selecting for is (I assume) practical reasoning skills, we would want the test to have some similarities to real life.

Thus the thought that comes to mind is an escape room which can be set up and run essentially-identically for each participant, whose puzzling elements require you to make Bayesian updates on multiple propositions that you were given an idea of the likelihood at the start. In order to avoid biasing the tests in favor of those with more general knowledge, the propositions would ideally be totally fictitious. It occurs to me that the elements of real-world pressure and communication would bias the test against those prone to anxiety, but given that that's a common problem when you're called on to apply your rationality skills in reality I think that may be an acceptable flaw, if no other options are obviously superior.

comment by Klao · 2011-09-13T19:52:48.320Z · LW(p) · GW(p)

Two ideas I got after 5 minutes (by the clock :)) thinking.

If the tests are stressful and mentally (and possibly physically) exhausting, then even if it is still possible to prepare just for the test, it will not be as far from preparing for the "real thing". So, something like Initiation Ceremony could be done periodically and not just for initiation.

Give the students "stories" and see if they can make heads or tails of them. (How accurately can they guess the omitted details? Can they predict how it continues? Etc.) But, where can you get real stories? An authored story is very bounded in usefulness for this.
The idea: we have court cases. A lot of them, in all kind of domains, dating back to centuries. And they are very real, even if it's distorted (fake evidence, false testimony), it's done by someone for some concrete reason, which can be analyzed rationally. This might require learning some Law, but even without formal training many non domain-specific cases can be understood with moderate work. And Law is one of the oldest applications of human rationality.

Both of the ideas are mostly applicable to the second use-case: measuring a bunch of students in a school, but not good for comparing schools or designing a standardized "rationality test".

comment by beoShaffer · 2011-07-11T03:31:15.032Z · LW(p) · GW(p)

I should note that per the EY’s request I haven’t read the other comments before posting, so sorry if I duplicate anything.

The ability to make predictions in advance seems like one of the most important important, and assuming that you have enough time easiest to test measures of rationality. For the experimental and potentially the organizational level success on the prediction markets seems like an obvious choice, that also has the benefit of showing how good the person is at avoiding certain money related biases. There would of course need to be some controls in terms of equal access to capital and access to information, but I think we can work that out.
At the reputational level set up something like the David Brin prediction wiki for EY http://earthbydavidbrin.pbworks.com/w/page/15607657/Predictions though in this case we would be focusing on ones he explicitly makes rather than stuff culled from a work of fiction.

comment by zaph · 2009-03-16T19:33:42.986Z · LW(p) · GW(p)

Maybe something that tests "certainty faking"? I really don't know how to construct it, per se, may use a FACS test to see how much a person is trying to convey that they're very certain of something when they aren't. That would just be conscious faking, of course; you'd still need something to assess when someone is expressing their feeling of certainty vs. the data. Maybe something like Texas Hold 'Em, except with bets being placed on how accurate the probabilities are (e.g. randomized variations of situations like the cancer scenario at EY's Bayes page).?

Sorry if I'm not articulating this well, hopefully it's good enough to live up to the stupid idea criteria, if not the good idea. Oh, and I didn't read any of the comments, so I don't know if this has been suggested.

Replies from: Cameron_Taylor

↑ comment by Cameron_Taylor · 2009-03-18T04:56:38.536Z · LW(p) · GW(p)

Texas Hold 'em is suitable

comment by JulianMorrison · 2009-03-16T03:41:53.852Z · LW(p) · GW(p)

I'm reminded of your own introduction to Bayes. Even a really good test won't do a darn bit of good if rationalists are vanishingly rare.

comment by Kaj_Sotala · 2009-03-15T22:23:14.626Z · LW(p) · GW(p)

There are lots of proposals which basically say, let somebody predict the development of a situation they're previously unfamiliar with. But that'll probably be very heavily a test of IQ, and while rationality would certainly help your performance in such scenarios, it seems to me that IQ will regardless be a bigger factor. Same with using real-life performance as a factor.

I'm not opposed to using such scenarios, and I proposed something like that myself, but I do think that the scenarios have to be specifically designed so that they're likely to trigger known biases (even if in a subtle way). You can't just use totally random historical events or police cases.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2009-03-15T22:35:22.783Z · LW(p) · GW(p)

If the situation contains enough biasing factors, you'd need to be able to use the craft in order to correct for that, not just comprehend the situation. The situation should be simple enough for most people to notice the important details, if they know where to look.

comment by Mario · 2009-03-15T17:58:55.416Z · LW(p) · GW(p)

I get the feeling that the real problem here is repeatability. It's one thing to design a test for rationality, it's another to design a test that could not be gamed once the particulars are known. Since it probably isn't possible to control the flow of information in that way, the next-best option might be to design a test so that the testing criteria would not be understood except by those who pass.

I'm thinking of a test I heard about years ago. The teacher passes out the test, stressing to the students to read the instructions before beginning. The instructions specify that the answer to every question is C. The actual questions on the test don't matter, of course, but it's a great test of reading comprehension and the ability to follow instructions. Plus, the test is completely repeatable. All of the test questions could leak out, and still only those who deserve to pass would do so. If you are willing to assume that people who pass would not be willing to cheat (unlikely in this test, possible in a rationality test), then you would have an ungameable test.

A rationality test in this model might be one where an impossible task is given, and the correct response would be to not play.

Replies from: HA2, MBlume, handoflixue, MichaelHoward

↑ comment by HA2 · 2009-03-15T21:01:00.217Z · LW(p) · GW(p)

I don't think that it's reasonable to expect that secret criteria would stay secret once such a test would actually be used for anything. Sure, it could be kept a secret if there were a dozen people taking the test, of which the four who passed would get admitted to an exclusive club.

If there were ten thousand people taking the test, a thousand of which passed, I'd bet there'd be at least one who accidentally leaks it on the internet, from where it would immediately become public knowledge. (And at least a dozen who would willingly give up the answer if offered money for it, as would happen if there were anything at stake in this test.) It might work if such a test is obscure enough or not widely used, but not if it was used for anything that mattered to the test-takers and was open to many.

Replies from: Mario

↑ comment by Mario · 2009-03-15T22:03:17.342Z · LW(p) · GW(p)

True, but I think that would be a problem with any test. I'm just trying to find a way around it since I think that as you add ways to avoid gaming, you both complicate and weaken the test. Perhaps a solution would be to test people without their knowledge, and reveal whether they succeeded or not at a later date.

↑ comment by MBlume · 2009-03-15T19:35:33.163Z · LW(p) · GW(p)

A rationality test in this model might be one where an impossible task is given, and the correct response would be to not play.

Kobayashi Maru?

Replies from: MichaelHoward

↑ comment by MichaelHoward · 2009-03-15T19:48:30.316Z · LW(p) · GW(p)

Global Thermonuclear War?

Replies from: MBlume

↑ comment by MBlume · 2009-03-21T04:44:09.512Z · LW(p) · GW(p)

Well, only because the computer's search tree didn't include the "teleport giant psychic squid" action ;)

(spoilers behind link)

Replies from: handoflixue

↑ comment by handoflixue · 2011-07-15T23:29:31.457Z · LW(p) · GW(p)

Thank you for making my day :)

Replies from: MBlume

↑ comment by MBlume · 2011-07-16T19:38:04.571Z · LW(p) · GW(p)

^_^

↑ comment by handoflixue · 2011-07-15T23:28:13.498Z · LW(p) · GW(p)

"Psssst, when Mrs. P says to read the instructions, it's because it's a fake test! If you just follow the directions you can get an A without even trying!"

And that is that test ruined for all subsequent classes. People may not read instructions, but they will generally listen to peers highlighting that there is something unusual, or some easy way of cheating. Heck, it might become a weird group wisdom to always answer "C" because the answer scanning machine is broken or something. I've seen weirder in actual work places.

↑ comment by MichaelHoward · 2009-03-15T18:20:15.309Z · LW(p) · GW(p)

The instructions specify that the answer to every question is C.

Isn't that more a test of attention to detail and willingness to follow instructions rather than rationality per se?

Replies from: Mario

↑ comment by Mario · 2009-03-15T18:39:21.855Z · LW(p) · GW(p)

Yes. I wasn't offering that particular formulation as a rationality test, just the idea that you should hide from the testee the nature of the test.

comment by biochem06921 · 2009-03-15T17:41:27.256Z · LW(p) · GW(p)

Like R.A.W. has said, "The more you see yourself acting like a cosmic smuck thus less of a cosmic smuck you will become". I think it is very important that the environment stresses awareness of moment to moment actions and thoughts. If not, I think decent application of the knowledge of rationality will be very hard indeed.

If this is an important aspect of your 'school', then I think it would be hard to game the system without actually learning what is supposed to be learned. This would especially be true when it is a part of the reputation heirarchy. Sure, some could mimic to gain status but others with actual awareness would see through them easily.

comment by gilch · 2016-04-22T02:26:25.662Z · LW(p) · GW(p)

I seem to be years late to this party, but I've heard the LW culture isn't opposed to commenting on old posts. In the interest of "breadth" I'll answer anyway after at least five minutes of thought, without looking at the other answers first (though I've probably seen subsequent posts that have been influenced by this one by now).

So there are three categories of tests here. In order of strictness: those for masters, those for students, and those for employees?

There are many skills under the "rationality" umbrella. Enumerate them and test separately. Maybe there are some we don't know yet. How do we test for those? There's also a difference between epistemic and instrumental rationality. Epistemic seems easier to test and is probably required for instrumental. But instrumental is what we really want. Some of my test suggestions will only test a part of "rationality".

Schools and science have a lot of experience measuring things like this. Can we learn from them?

Every test I've come up with seems to be in one of two categories: toy problems, or real-life problems. The real-life problems are better for the masters, perhaps; and the toy problems for the students. The toy problems are less real, but more replicable. I thought we're supposed to hold off on proposing solutions to avoid attractors like categorization prematurely limiting our scope. But we've been asked to brainstorm. Can we break out of these categories?

Some Ideas:

Give the students a sum to invest for a small business, and time limit, then see how much they make. Require strict record keeping to prevent cheating. Noisy.
Give them a sum to invest in a prediction market, then see how much they make.
Use more direct calibration tests. Make students give probabilities for things. See how often they're right.
A student must catch specific examples of cognitive errors/fallacies in a video. (Arguably the important part is to catch one's own errors, and the ability to find others' errors doesn't prove that.)
Make a student write an essay before the term. The instructor will find examples of cognitive errors in it, but keep it secret. Then after the term, the student must review his essay and find as many errors in his former thinking as possible. This will measure personal improvement, but might not help measure relative to peers, since they're all taking different "tests".
SAT-style multiple-choice exam. This can test knowledge of the material, and synthesis too (or so the test writers claim) to a limited extent.
Like the three integers test, the master can play the role of nature, while the students play the role of a scientist, trying to figure out a simple rule by "experiment". Grading can be on the number of questions asked, the time taken, the difficulty of the rule, or the number of these questions answered correctly. The instructor must be strictly forbidden from giving hints that could ruin the results. This is actually very similar to debugging software. Maybe this kind of test could be computerized, with "nature" as an opaque program and, students writing code that interacts with it as their "experiments". They then may have to write code to emulate the rule. If it passes the unit tests, a human instructor can confirm if it implements the same rule. This can also give students a feel for what it's like to do science correctly.
Competition programming AI to win at game-theory-inspired challenges. See how they compare to well-known strategies. May be hard to keep challenges secret. Could the payoff grid be randomized?
Life outcome survey over years. Are they "winning" more versus control group? May be hard to define. Slow. We should do this, but we shouldn't wait for it before developing the program.
Masters can actually try to accomplish something. Maybe improve life outcomes in a third-world country or something. To be meaningful, it would have a control group, competitors, a time limit and a budget.

comment by Regex · 2015-10-11T03:49:10.699Z · LW(p) · GW(p)

Generate a fantasy world with certain rules of magic. The goal is to figure out precisely what those rules are, all the while working towards some end goal. Perhaps this could be run by a handful game masters who know exactly what the rules are supposed to be, or magics are input into a computer program, so no one knows for sure. One would promise to keep the rules secret once figure out. This would encourage proper hypothesis testing and thoughtful use of evidence, especially if resources are limited. I suspect this wouldn't just be a one-off, but a repeatable exercise if one had multiple worlds or the ability to arbitrarily generate the system. Perhaps one could engage in duels using the uncovered magics, and would be able to encourage creativity by applying these in different ways. I'd imagine one could use systems much like those in role playing games, but but qualitatively based perhaps?

There was a card game based around hypotheses in a class I took once which I've improved upon somewhat here: https://lordregexrationalist.wordpress.com/2015/09/29/rationalist-belief-card-game/

comment by Dues · 2014-06-09T03:51:04.724Z · LW(p) · GW(p)

If rhetoric is the dark arts, then rationalists need a defense against the dark arts.

I've always seem debates as a missed opportunity for rationality training/testing. Not for debaters, but for the audience.

When you have two people cleverly arguing for an answer, that is an opportunity for the audience to see if they can avoid being suckered in. To keep things interesting, you could randomize the debate so that one, bother, or neither debater is telling the truth. (Or course in the toughest debates, the debaters are both partially true and the audience needs to find out what is the real answer.) And if we want to keep the students from compartmentalizing what they have learned, we probably need to make the debates a mix of real world and abstract debates. We might also have easy, medium, and hard difficulty debates, but you don't tell the audience beforehand.

I think that this would be useful thing, because lots of place already have debate clubs and public debates. All we would need to have an audience game running in the background.

I think that the most helpful part of the lesson would come after the debate and after the audience has been scored on their confidence intervals. If we can get the debaters to explain the rhetorical tricks they used, so the audience can recognize them in the future and hopefully not fall for them a second time.

Replies from: ChristianKl

↑ comment by ChristianKl · 2015-10-11T11:56:29.501Z · LW(p) · GW(p)

I don't think your model of the nature of debate is good. Most rhetorical strategies aren't tricks in the sense that they have no basis at all.

Replies from: Dues

↑ comment by Dues · 2015-10-30T01:22:55.757Z · LW(p) · GW(p)

I suspect you are right. But still, lying and tricking people is a skill, and I know where I can learn to practice it. (Debate clubs) Are the courses for the skill of detecting lies and tricks? All I can think of offhand is those fbi courses on micro expressions and maybe playing lots of poker. It feels like they off a currently unfilled market for defensive techniques.

Replies from: ChristianKl

↑ comment by ChristianKl · 2015-10-30T09:22:10.145Z · LW(p) · GW(p)

I remember one leftwing person who was in favor of Barack Obama before the presidental debates but switched to being against him after seeing Barack Obama because of body language that indicates lying. In my experience the amount of people who have body language reading skills that are developed to that degree that they make actual decisions like this is quite rare.

In that case it's not only exterordinary skill of reading bodylanguage. It's also a skill of not getting mindkilled at all. It took me till Barack Obama announced Rahm Emanuel as his chief of staff to understand that Barack Obama wasn't planning on creating real change. That happened to be the week after the presidential election.

I know from talking to people with political experience that people decisions are very important and thus could use this signal to detect the tricking in which Barack Obama engaged.

In both of those cases special knowledge to interpret a signal that most people wouldn't perceive and the confidence in trusting that signal allowed the conclusion that Barack Obama isn't the person as which he presented himself.

Apart from knowledge not getting mindkilled and entangling yourself is hard. If you care hard about the outcome being a certain way you are less likely to spot errors.

I remember (but unfortunately have no source) that the US Secret Service is best at detecting lies. A Secret Service person who guards an important figure has to assume that most of the people he comes into contact with are no threat but he still has to check them for being a possible threat. That's better for learning lie detection than the setting of a policeman who interrogates a person he believes to be lying.

But still, lying and tricking people is a skill, and I know where I can learn to practice it. (Debate clubs)

Debating don't focus on the skill of changing the mind of other people but on training the skill of saying something that a judge judges to be correct. That's a different skill. Trained debaters talk fast. Trained hypnothists talk slow to allow for emotional processing and changes in beliefs to happen. Oprah talks slow and then repeats a statements that has an emotional effect to further that emotional effect. From a debating perspective that's wasting time.

Replies from: Jiro

↑ comment by Jiro · 2015-10-30T15:04:44.537Z · LW(p) · GW(p)

I thought the Secret Service was pretty notorious for considering everyone who reports a threat to be a threat, with the advice being that you should never inform the Secret Service of anything. If anything, their lie detection is askew.

Replies from: Vaniver

↑ comment by Vaniver · 2015-10-30T18:49:20.038Z · LW(p) · GW(p)

This is from Ekman's work on lie detection. He thinks that this comes from dealing with crowds--the SS spends much more time looking at different faces trying to detect emotions / intent to harm, and thus actually has practice at distinguishing faces, rather than considering one person for extended periods of time (like normal interrogations). It isn't a commentary on how they respond to reports.

comment by handoflixue · 2011-07-15T23:50:46.453Z · LW(p) · GW(p)

http://lesswrong.com/lw/3h/why_our_kind_cant_cooperate/

I can't help but think the focus on competition is a fairly bad idea. If a student can raise the entire group's score by 10%, that is far more commendable than raising their own individual score by 20%. We don't want high-scoring individuals, we want to win. That's something which is quite often done as part of a group, in the real world.

comment by rysade · 2010-10-23T22:50:23.859Z · LW(p) · GW(p)

When I started thinking about this I realized that testing for rationality is pretty complicated! The hardest part about it is determining the 'most rational person' in a group. If the 'most rational person' is a member of the group being tested, how can the testers determine who they are if the testers are less rational than them? Does a tester's ability to recognize the best of the test group depend on whether the tester is biased, and how they are biased? And who would test the testers, then?

Regardless, here's an idea or two.

A Multilevel test: Biases may be fairly easy to test for, in general. The are relatively well defined. Someone who is known to be fairly unbiased in one respect or another could run tests for that bias.

An experimental test: Learning ability could be tested. A list of 100 skills, facts and methods could be presented to individuals in a test group. Some of the items on the list would be false, untenable or in some other way illogical. Members of the group would have to learn the useful ones, thus demonstrating their ability to overcome any biases they may have to learning this-or-that concept, trait, skill or whatever. They would also have to not learn the bogus ones, demonstrating their ability to recognize bad ideas, harmful methods and false facts, etc. Groups that did well in the experiment would could be held up as examples of how one ought to approach learning.

comment by RobinHanson · 2009-04-01T01:14:53.156Z · LW(p) · GW(p)

I just suggested a relevant rationality test here: http://www.overcomingbias.com/2009/03/how-spend-rationality-test.html

comment by CarlShulman · 2009-03-15T21:42:16.014Z · LW(p) · GW(p)

Experimental methods for measuring rationality can be converted into organizational tools through the measurement of biological traits that are minimally malleable. For instance, you could map genomic and brain structure information to experimental tests of particular biases or bias-promoting traits, and then use those biological markers as ungameable indicators. Unfortunately, while this could help organizations get more rational employees (possibly deriving economies of scale), it would be much less useful for measuring improvement.

comment by [deleted] · 2009-03-15T20:36:00.683Z · LW(p) · GW(p)

deleted

Replies from: patrissimo

↑ comment by patrissimo · 2009-03-21T22:16:37.650Z · LW(p) · GW(p)

I like the idea of using games, but I worry that people would learn to get good at the specific games or game-space, especially if there are few of them. Specializing in a certain logic puzzle != being rational. Also there is the issue others mentioned that performance under stress is a big part of rationality.

comment by MichaelHoward · 2009-03-15T20:12:32.647Z · LW(p) · GW(p)

Vladimir Gritsenko mentioned Rational Debating on an old post. It looks like it would be a useful addition to the list.

Replies from: steven0461

↑ comment by steven0461 · 2009-03-17T15:04:35.591Z · LW(p) · GW(p)

As the post mentions, RD participants have an incentive to argue dishonestly. They also have little incentive to say anything informative at all. To solve this, I'd propose Paranoid Debating: everyone is scored on the correctness of a team estimate, except for one participant who's secretly designated an Advocate and one participant who's secretly designated a Naysayer. The Advocate gets more points for higher team estimates and the Naysayer gets more points for lower team estimates. Variants: give points for figuring out who the A and N are, or let it be known publicly.

comment by Marshall · 2009-03-15T20:10:36.488Z · LW(p) · GW(p)

Make a very detailed audit of the habits, hobbies, books, music, shoes, watch, cell-phone etc. etc. etc. of the top/average/bottom contributors to LW. Are there correlations? Match to new candidates.

Replies from: MBlume, MichaelHoward

↑ comment by MBlume · 2009-03-15T20:15:03.867Z · LW(p) · GW(p)

you're assuming LW karma is itself a good test of rationality...

Replies from: gjm

↑ comment by gjm · 2009-03-15T20:39:03.097Z · LW(p) · GW(p)

Which Marshall is on record as not believing, so I guess he's poking fun here.

↑ comment by MichaelHoward · 2009-03-15T20:14:18.037Z · LW(p) · GW(p)

I expect the causation would be in mostly the wrong direction.

comment by Scott Alexander (Yvain) · 2009-03-15T19:42:07.486Z · LW(p) · GW(p)

Here is a stupid one: Detective stories. Like Encyclopedia Brown, but subtler. And with false leads. I don't think normal mass-market detective stories would work, because they may try to deliberately choose an irrational answer to surprise you. But special ones written by rationalists for rationalists could be a fun distraction if nothing else.

Replies from: rwallace

↑ comment by rwallace · 2009-03-15T20:25:49.976Z · LW(p) · GW(p)

That still has the problem that it doesn't test for lack of bias, but for having bias that matches that of the people who wrote the stories. I suggest instead using real cases - and not taken from the media, because that means selection bias, but taking all the cases from the files of a particular police department during a particular span of time.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T20:27:11.445Z · LW(p) · GW(p)

What true stories can we use besides police cases? (Also, note that in this case you're only testing for being as smart as the police or making the same judgments as the jury - even using cases with a confession may get you false confessions.)

Replies from: CarlShulman, rwallace

↑ comment by CarlShulman · 2009-03-15T21:10:58.527Z · LW(p) · GW(p)

You can take cases with enough evidence to overdetermine the result, and then subtract pieces.

↑ comment by rwallace · 2009-03-15T20:39:27.827Z · LW(p) · GW(p)

Point. Still, we've been recording lots of different kinds of events for a long time. Off the top of my head, other kinds of historical data that could be useful here:

Medical cases, minor scientific controversies, engineering projects, battles, the stock market, markets in general, expeditions.

comment by CannibalSmith · 2009-03-15T19:13:45.717Z · LW(p) · GW(p)

Role play. Build a corpus of fictional scenarios too big to memorize and present a random subset in the test.

Also, standard tests on rationality lore and mathematics would work to a degree because they're correlated with actual rationality.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T19:17:13.685Z · LW(p) · GW(p)

If they're fictional scenarios, then you're matching the taste of the students (in fictional answers) against the taste of the teachers; that may work to propagate a school, but how do you keep the tastes real?

Replies from: Yvain, CannibalSmith

↑ comment by Scott Alexander (Yvain) · 2009-03-15T22:02:57.768Z · LW(p) · GW(p)

Then use scenarios that actually happened. From history, business, people's personal lives, et cetera. For example: "Here is a brief description of the Byzantine Empire in 1200. The Emperor decided to change the tax policy in the following way. Predict what happened." Gives an unfair advantage to anyone who knows a lot of history (or in this case economics), but if you vary the cases enough and use little-known enough examples you might be able to control for that.

Another example: "Here's a psych profile of my friend John, and a psych profile of his girlfriend Sally. They started dating ten years ago. Predict what happened."

↑ comment by CannibalSmith · 2009-03-15T21:07:35.203Z · LW(p) · GW(p)

I'm sorry, I don't understand your question.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-03-15T21:15:01.049Z · LW(p) · GW(p)

The "right answer" in the fictional scenarios is determined by the teacher. So you're testing the degree to which the student matches the teacher, not the degree to which the student matches reality.

Replies from: Lightwave

↑ comment by Lightwave · 2009-03-16T14:27:34.421Z · LW(p) · GW(p)

How can you be sure that in the historical scenario, the Byzantine Emperor actually did the "right thing", i.e. he wouldn't have done better by doing something else? It's the teachers who have to decide that. Also, what if the Emperor got the "right answer" for the wrong reasons, and the student also got the "right answer" for the wrong reasons? It's up to the teacher to decide that as well. The best thing you can do is have several groups of rationalists selecting the scenarios and verifying the students' answers, but ultimately, when using either real life or fictional scenarios, you're comparing the teachers to the students.

Same thing with measuring "success" of people in real life. They could've arrived at the correct answer for the wrong reasons, it's up to the teachers to decide whether the reasons were right or wrong, i.e. whether they were actually rational or just lucky.

In order to assess the rationality of the students you need to use the sort of things/tests that convinced you that the teachers are rational in the first place. The same things that make the teacher's tastes real can be matched against the student's tastes.

comment by vizikahn · 2009-03-15T18:57:19.704Z · LW(p) · GW(p)

What we need is a rationality equivalent of a katana or a machine gun. One for each student, some basic training and even ninja masters go down pretty quickly (unless they really can dodge bullets). Occupatio "weapon of mass rationality".

Replies from: Johnicholas, Tom_Talbot, swestrup

↑ comment by Johnicholas · 2009-03-16T01:39:11.609Z · LW(p) · GW(p)

Software tools for rationality, decision support systems, might very well be more valuable than extensive personal training in rationality.

↑ comment by Tom_Talbot · 2009-03-17T17:05:59.172Z · LW(p) · GW(p)

Perhaps the notion of an 'art of rationality' is completely misguided. Why are we relying on the skills of individual people who evolved to be irrational when systems can be built for the purpose of giving rational answers? Why walk to the answer when you can drive?

Replies from: MichaelHoward

↑ comment by MichaelHoward · 2009-03-17T19:36:19.104Z · LW(p) · GW(p)

Following this analogy, you still need to people to get good at driving, at choosing between vehicles, building vehicles, knowing when a particular vehicle is or isn't appropriate and not driving the school bus when drunk.

Some spectacular crashes have been caused by driving systems built for the purpose of giving rational answers without due care and attention.

↑ comment by swestrup · 2009-03-15T21:38:40.878Z · LW(p) · GW(p)

This has been voted into the negatives, but I'm not sure its so basically bad as an idea. If we can set up a system where all of the students, teachers, and any other staff, are all in continuous rationality competitions with each other, then this would quickly cause one to hone their skills.

For example, maybe the teacher of a class is chosen from within a class and has to fight (metaphorically) to maintain that position. Maybe the choice of whether you are teacher, student, principal, cafeteria cook, or janitor depends on the outcomes of numerous rationality contests between members.

And note that I don't necessarily mean that cafeteria cook or janitor would be positions that go to the losers...

comment by Crazy philosopher (commissar Yarrick) · 2024-06-03T15:40:19.199Z · LW(p) · GW(p)

To look at the successes of the person taking into account his initial conditions. If he is a Nobel laureate, he have success, if he do science for humanity. If he is an egoist we should to look his happyness.

comment by ProofBySonnet (redheadbros) · 2023-07-29T01:33:10.981Z · LW(p) · GW(p)

I ended up going in a completely different direction with this: I intend to test my OWN rationality, and I figure that if rationality is about WINNING, about being EFFECTIVE, then I ought to find direct measures of the things I want, and test myself in 6 months or so (timeframe dependent on the toughness/length of the task). This will, in other words, be a test of my ability to understand the territory insofar as that understanding makes me more effective at a given task.

The things in particular, a few subgoals of my personal life-optimization:

artistic endeavors and life enjoyment: engaging in things like art or gamedev or other mediums while aiming for MAXIMUM FUN.
being RESTED, having ENERGY to do things (I have a tendency to burn out due to overworking myself)
studying AI alignment (and the general objective of "actually have a positive effect on whether we're all going to die or not")

the tests in question:

for each artistic project (i.e. a drawing, a game, et cetera), I post it to a public forum and, with said post, I add a simple strawpoll asking "would you judge this work as 'complete'?" I would not be allowed to argue with the results, and I'd be judged by the number of projects I complete, with "complete" defined by the poll being greater than 80%. It's hard to directly measure fun, but "completing quality projects and showing them to others" seems like a good enough way to achieve it, for me in particular.
I've already got a method for measuring this: each day, if I notice that I'm tired during my free time, I force myself to rest until I stop being tired, and I record the time in a spreadsheet. Then, I sum up all the rest time accumulated over a period of a month. The lower it is, the better my sleep patterns hypothetically are. I predict that, if I had myself do this for as long as a month, I'd avoid pushing myself to my productive limit - that would induce burnout, which would eventually FORCE me to rest during free time, and a lot.
A good measure of "you are making quality posts" would be either something like summed upvotes, or maximum upvotes on a post in a given period, or maximum links from other posts made by other users. That last one seems difficult, but also points me in the right direction of "look into what posts got referenced the most, and try to make those kinds of insights."

I'm a little bit nervous about taking on the 1st or 3rd test - I'm not sure if I could pull them off - but I suppose that's the right feeling to have, if they're hard but accurately so.

comment by Elias (Eliyoole) · 2022-02-23T23:30:19.665Z · LW(p) · GW(p)

An idea that might be both unsustainable and potentially dangerous, but also potentially useful, is to have someone teach as a final test. Less an exam and more a project (with oversight?). Of course, these trainees could be authentic or disguised testers.

Problems with this idea (non-exhaustive): - Rationality doesn't necessarily make you good at teaching, - Teaching the basics badly are likely to have negative effects on the trainee, - This could potentially be gamed by reformulated regurgitation.

So... What behaves differently in the presence of Rationality. I like Brennan's idea of time pressure, though he himself demonstrates that you don't need to have finished training for it, and it doesn't really hit the mark.

Or: What requires Rationality? Given Hidden Knowledge (may only require facts that are known, but not to them), one could present new true facts that need to be distinguished from new well-crafted falsehoods (QM anyone?^^). This still only indicates, but it may be part of the process. If they game this by studying everything, and thinking for themselves, and coming to correct conclusions, I think that counts as passing the test. Maybe I am currently not creative enough though. This test could also be performed in isolation, and since time would probably be a relevant component, it would likely not require huge amounts of resources to provide this isolation. Repeat tests could incorporate this (or seemingly incorporate it) too.

If you wanted to invest more effort, you could also specifically not isolate them, but put them in a pressured situation (again, I am being influenced by memories of a certain ceremony. But it is simply really good.) This doesn't have to be societal pressure, but this kind at least makes rash decisions less likely to be costly.

I can't really formulate the idea concretely, but: A test inspired by some of ye olden psychology experiments might provide double yield by both testing the rationality of the person in question and also disabuse them of their trust. Though I can see a lot of ways this idea could go awry.

An issue that most if not all of my tests run into is that they limit what could be taught, since it is still part of the test. This is a problem that should be solved, not just because it irritates me, but because this also means that random chance could easier change the results.

This is, I think, because so far all tests check for the correct answer. This, in itself, may be the wrong approach. Since we try to test techniques which have an impact on the whole person, not "just" their problem solving. I would for example hope that a crisis situation would on average benefit from the people being trained in rationality, not just in regards to "the problem solving itself", but also the emotional response, the ability to see the larger picture, prioritization and initial reaction speed, and so on.

(Maybe having them devise a test is a good test...^^ Productive, too, on the whole.)

(I can think of at least one problem of yours that I still haven't solved, though I therefore can't say whether or not my not-solving-it is actually showing a lack of rationality[though it's likely], or rather depends on something else. Not sure if I should mention it, but since you (thankfully) protect the answer, I don't think that I need to. This, still, is asking for a correct answer though.)

That's all I can think of for now. Though I am not really satisfied... Do I need to be "at a higher level" to be able to evaluate this, since I don't fully grasp what it is that should be tested yet? Seems like either an option or a stop sign..

comment by Toby Anderson (toby-anderson) · 2021-01-12T16:02:42.979Z · LW(p) · GW(p)

One large theme I've seen in biases is the tendency to affirm positions you already hold, by treating evidence and arguments with imbalance.

So my idea, is to purposefully select arguments from both sides of highly controversial issues such as gun control, abortion, or whatever is polarized at the time period. Then riddle the arguments with mistakes, and challenge the student to find errors in both sides of the issues.

Possibly having a bank of possible rational missteps that they must dole out to different arguments, or a free form analysis that has to be well justified and is subjectively judged by a group of rationalists.

comment by Pascal Morimacil (pascal-morimacil) · 2020-07-26T17:01:18.286Z · LW(p) · GW(p)

Take any cognitive bias that is supported by previous experimental data. Replicate to confirm.

Subject students to various training regimens, with control group.

Test again for presence of cognitive bias, note any improvements.

Repeat, repeat again for other known cognitive biases.

Not perfect, but it should be enough to make some headway.

Also just subjecting a student to a battery of tests, ideally creative stuff potentially involving real life scenarios not just written tests, to look for all sorts of cognitive biases.

Should the student try to game the system by learning, well great!

comment by DPiepgrass · 2020-06-26T08:48:50.832Z · LW(p) · GW(p)

How about a test that causes people to build and use mental models and formulas? People are asked to estimate primarily numeric facts based on other facts. In each question, give people a set of "measured facts"* and ask people to estimate more relevant facts/consequences via back-of-envelope calculations (or a computer program, for more precision). But unlike a normal math word problem, set up the test so that, say, 2/3 of the questions cannot be accurately estimated with only the information given. Among that 2/3, half can be accurately estimated by adding some common-sense info (e.g. that most people work about 40 hours a week, that life expectancy is about 80 years, that almost half of American voters vote Republican, etc.), and the other half require more esoteric information that test-takers will rarely have. For all the questions, test-takers need to build a simple mental model or formula that would allow them to do the calculation, state any information they need that is missing, and try briefly to look up the info online in order to compute a reasonable estimate. If they can't do this, they need to express the answer in terms of unknown variables and then guess what the values of the variables are. They must also state relevant assumptions.

This is a means both to improve rationality as well as test it.

Example question set:
Background: in some types of accidents at some types of nuclear plants, radioactive substances can be released into the atmosphere (radioactive substances emit ionizing radiation). It is medically plausible that there is no perfectly safe dose of ionizing radiation in human tissue, and that radiation damage to DNA is cumulative, because cells repair some DNA damage very slowly, or never, and this damage can lead to cancer years after radiation exposure. This is known as the linear no-threshold hypothesis: that the health risk is proportional to exposure and there is no safe dose. If residents are promptly evacuated during an accident, the primary risk to their health upon returning will be from long-term exposure to radioactive cesium, which mainly causes a type of cancer called non-CLL leukemia.**
• A metastudy reports that the excess relative risk (ERR) of non-CLL leukemia from 100 mGy of radiation is about 19% (this means that people get non-CLL leukemia 19% more often than normal).**
• The normal rate of leukemia in the U.S. is about 14 diagnoses per 100,000 people per year. About 1.5% of people are diagnosed with leukemia at some point in their lifetime
• The normal death rate of leukemia is about 6.4 per 100,000 people per year in the U.S.
• One third of leukemia cases are CLL leukemia cases.
• Another study estimates that in the U.S. there are about 16,000 excess deaths annually due to electricity generation emissions, which is a low rate compared to some developing countries. The researchers estimate that 91% of these deaths were the result of emissions from coal-fired power plants.
• There are 328 million people in the U.S. and 7.5 billion in the world
• About 65% of all electricity worldwide is produced by burning fossil fuels. About 10% of electricity is from nuclear plants and 38.3% is from coal.
• Assume two-thirds of cancer cases and deaths from a nuclear accident occur outside the city where the accident occurred***

Scenario: suppose that another nuclear accident were to happen, one somewhat more serious than Fukushima, inside a city of one million people, in a developed country. Suppose that all evacuated persons return to their homes after one month and, as a result, are exposed to 100 mGy of radiation on average, mostly from cesium. Assume that half of this radiation dose occurs in the first 10 years and that most of it has occurred within 40 years***.

Questions:
1. Estimate the chance that the radiation will cause non-CLL leukemia in a particular, random person in the city at some point in their lives.
2. Estimate the chance that the radiation will kill a particular, random person in the city after they move back.
3. Estimate the total number of non-CLL leukemia cases caused by the radiation (over 40+ years).
4. Estimate the total number of people that will die as a result of the radiation (over 40+ years).
5. Assume that all nuclear accidents worldwide, combined, cause this number of deaths once every 20 years (e.g. in a 20-year period there might be two accidents, each half as serious as this one). What is the expected number of deaths per year in a randomly selected city of about one million people?
6. Estimate the number of excess deaths caused by power generation in that same city (i) per year, and (ii) over a 40-year period, if all its electricity came from fossil fuels instead of the nuclear plant.
7. Brainstorm additional factors that might change your estimates above.
8. Brainstorm other considerations that would be relevant to evaluating safety of nuclear power compared to alternatives.

Example answers:
1. Assumptions: All people have lives of average length (80 years). Age distribution in the city is uniform from 0 to 80. Leukemia risk is elevated uniformly after exposure for the rest of the person's life. All developed countries have similar leukemia rates. Leukemia is diagnosed very soon after it develops. Leukemia risk does not vary by age (this is not true, but on the other hand, I question whether it was appropriate for the metastudy to use ERR instead of excess absolute risk (EAR)). Radiation exposure probably drops off mostly according to cesium's half-life, but to simplify the calculation, assume 50% of the 100 mGy dose is delivered linearly in the first 10 years and the other 50% linearly over the following 30 years.
• Normal non-CLL leukemia risk is 14*2/3 = 9.333 per 100,000 per year
• A random person has on average 40 years of life left (50% of an 80-year lifetime)
• Excess risk of non-CLL leukemia is 19%, so 9.333*0.19 = 1.773 per 100,000 once the full dose happens.
• But there's a long delay before reaching the full dose... integrating over my approximate exposure function, average excess incidence should average 1.773/2/2 per 100,000 in the first 10 years and 1.773*0.75 over the next 30. Neglecting young and old people to simplify the calculation, the lifetime risk is about 1.773*0.25*10 + 1.773*0.75*30 = 44.3 per 100,000 over 40 years, so the lifetime risk is about 0.0443%, or 1 in 2260.

Fun fact 1: Before writing this LessWrong post, I did a calculation like this to learn about the risks of radiation, because I couldn't find any research estimating what I wanted to know. Radiation risks seem to be among the world's best-kept secrets. I'd rather see a peer-reviewed paper answer "how likely is X amount of radiation to kill me" than rely on my "napkin" model, but I haven't found any such research.
Fun fact 2: the answer increases if your starting point is "1.5% of people are diagnosed with leukemia at some point in their lifetime" since "14 per 100,000 people per year" only adds up to 1.12% per 80-year lifetime. I don't know why these numbers don't match up.
Fun fact 3: I should really be using a simple (Monte Carlo) computer model for this with exponential decay of radiation exposure... no idea if it would raise or lower my estimate.

2. (Further) Assumptions: Non-CLL leukemia is the only cause of death driven by radiation. Years of life left after the first cell turns cancerous is negligible. Probably both assumptions are significantly wrong, but the first assumption underestimates deaths and the second overestimates them so it seems like a wash.
• 6.4/14 = 45.7% of cases are fatal so the risk is 0.0443%*0.457 = 0.0202% or 1 in 4939.

3. Assumption: cancer screenings do not increase as a result of the accident (I'm sure this is wrong). There will be about 0.000443*1,000,000 = 443 excess cases in the city and about 487*3 = 1329 excess cases total
4. There will be about 1329*6.4/14 = about 607 excess deaths total

5. There will be 607/20 = 30.3 deaths worldwide per year from all nuclear accidents. Given a world population of 7.5 billion, that's about 0.004 deaths in a city of one million. The risk increases somewhat in cities that contain their own nuclear plant, if the plant is one of the more hazardous (read: old) models.

6. In a random U.S. city, the expected deaths per million in the U.S. from fossil fuels is 16'000/328=48.8 per year. (i) Assuming air pollution's effects are mainly local and 100% of power generation comes from fossil fuels, the expectation for a U.S. city is 16'000/328/0.65 = 75 deaths per year due to fossil fuels. (ii) which is 3001 deaths over a 40-year period (4.5x higher than the nuclear meltdown scenario).

7.
• Increased screening due to concern about the risk will increase the rate of cancer diagnoses, but not rates of cancer, and cancer death rates may be reduced by early detection.
• Radiation could cause other types of cancer deaths (I heard, for example, that short-lived iodine isotopes can cause thyroid cancer, but that this can be mitigated with iodine pills).
• Etc.: I'm getting tired but you get the idea

8.
• Regulations passed after Post-Three-Mile-Island probably increase safety a lot in newer reactors (but make new plants cost-prohibitive to certify and build)
• Nuclear waste increases long-term risk (but less than most people think, I would add)
• It has been suggested that terrorists could steal nuclear fuel and build a bomb with it. (I don't know if this is remotely plausible, but I do know that reactor-grade uranium is not directly usable in a bomb.)

• Deaths during plant construction and related mining should be similar between nuclear and fossil fuel plants; solar plant construction seems like it should be safer than nuclear, oil/coal, and wind.

• Though deaths from fossil fuels are more numerous, each death is expected to be less bad because it should happen near the end of a person's life due to many years of lung damage, whereas in the nuclear case, some young people will be affected. It's strange to me that fossil fuel deaths are not measured as "years of life lost" instead.

* The "facts" can be real or based on back-of-envelope calculations, but the test-taker is to assume the information is factual. If it is not factual, and concerns the real world, it mustn't be excessively off-the-mark because humans can't simply erase misinformation from our minds so it's best not to intentionally mess with us.
** This is roughly correct AFAIK but I'm not an expert. Also, the metastudy strangely neglects to model time, e.g. it does not say that the risk is elevated for the rest of peoples lives, or that it is elevated for X years, or anything time-related like that. I don't see why risk would be elevated for life—if damage will cause a cell to turn cancerous, why would it wait 20 years to do so?—but conservatively this is my mental model anyway. I've seen a study that indicates 100 mGy is more than the average dose avoided by relocating residents of Fukushima; note also that mGy and mSv are the same SI units, so I don't understand the difference.
*** This datum is made-up as I haven't found information about it.

After going through this exercise I think the formulas need to be more explicit... really we should write a program for nontrivial models, e.g....

// TODO: turn into Monte Carlo simulation

let ExcessLifetimeChanceOfCancer = 0, BaseNonCLLLRisk = (14.0*2/3)/100'000, Dose = 0, InitialYearlyDose = ??

let CesiumHalfLifeYears = 30.17, YearlyDecayFactor = 0.5**(1/CesiumHalfLifeYears), ERRPermGy = 0.19/100

for year in 1..YearsOfLifeLeft {

. Dose += InitialYearlyDose

. InitialYearlyDose *= YearlyDecayFactor

. ExcessLifetimeChanceOfCancer += BaseNonCLLLRisk * ERRPermGy / 100

}

print(Dose) // TODO: pick initial dose so that total tends to be 100 or a bit less

print(ExcessLifetimeChanceOfCancer)

And also there would be need of numerous easier exercises than this one.

Replies from: DPiepgrass

↑ comment by DPiepgrass · 2020-06-26T16:43:08.612Z · LW(p) · GW(p)

To make things more interesting, measure the pre-existing biases of the test-taker and then... give bonus points for assumptions and issues mentioned by the test-taker that are contrary to their own bias? e.g. if they are predisposed to be against nuclear power then a comment like "Regulations passed after Post-Three-Mile-Island probably increase safety a lot in newer reactors" would count in their favor, whereas if they are predisposed to be in favor of nuclear power, mentioning risks of nuclear waste would count in their favor. Also, correctly including factors in their model that are contrary to their bias (e.g. +1 if their preconception is against nuclear but they correctly identify the rate of non-CLL leukemia (14*2/3 or 1.5%*2/3) and use that number to estimate the risk, rather than mixing up non-CLL with total leukemia). A special case, common outside LessWrong: failure to identify any factors contrary to their bias is a red flag. Another red flag: isolated demands for rigor / questioning studies only when the conclusion is disliked.

A problem with my style here, especially re: the final two questions, is the difficulty of automated testing. It's tempting to convert to a multiple-choice test, yet we want participants to generate their own ideas. A compromise for sake of automation: gather hundreds of reasonable ideas from initial test-takers, and identify searchable keywords that will, when typed, find those ideas. Then test-takers can type (complete) keywords to find and add pre-existing ideas as their answers.

comment by Дмитрий Зеленский (dmitrii-zelenskii) · 2019-08-19T15:51:23.835Z · LW(p) · GW(p)

Erm... let me be Brennan and go with the "obvious". Find problems whose solutions are known in some field but not widely, provide the initial data and results of additional experiments on request (with "too expensive to perform" being a possible result). Then have two measures:

1)Someone who is _also not an expert_ checks solutions for, well, everything you discuss here. Biases, effort, mysterious answers - you name it. (For effort, you might need to register when every thought was written, not just what it was.)

2)An expert checks the dataset used - what of the really conducted experiments students failed to request and which of them were actually useful.

comment by [deleted] · 2015-04-18T18:07:52.983Z · LW(p) · GW(p)

The 'test even if gamed' reminds me of a labyrinth. Suppose there are several ways of reaching the end, and the participants can't know which way they are set upon, because it is chosen randomly. They are asked questions from outside of their domain of knowledge (it would need a big database to pick from), constructed in such a way that it is impossible to pick the right answer without knowing about various cognitive biases (e.g., the conjunction fallacy etc.) The questions can be independently rated for apparent difficulty, and masters will be given the hardest ones. (I don't know what makes some questions seem simpler even if the answer is still wrong. I asked people to rate questions in my post 'Before the seed. I. Guesswork', and somehow people chose exactly one to be 'the easiest to formulate hypotheses about', but I don't know how they did it. Plus, few people answered at all.)

comment by Marshall · 2009-03-15T19:18:45.822Z · LW(p) · GW(p)

How about asking people:

i) What is rationality for you?

ii) How rational are you?

iii) How will you prove it?

The askee can then ask the asker: Do you agree?

And then we have a conversation. Both parties have to agree on the final score.

comment by Angela · 2014-04-08T00:57:36.505Z · LW(p) · GW(p)

Basic true/false test; reverse stupidity is not intelligence but rationalists tend to have fewer false beliefs. Taking the test upon entering the school would prevent the school from teaching to the test and the test could be scored on multiple areas of which one is a cunningly disguised synonym for rationality and the others are red herrings so that irrationalists have no incentive to lie on the test.

comment by [deleted] · 2012-12-16T12:08:43.602Z · LW(p) · GW(p)

It seems like rationality overlaps so many different fields that it does not seem very plausible to be able to test rationality specifically. Political and ethical debates though seem to contain a lot of elements dealing with rationality.

comment by abramdemski · 2012-09-14T18:32:19.518Z · LW(p) · GW(p)

Although this post is old now, I'll still enter my ideas (good or bad) before reading the other comments...

Video games. Expertise in one video game is not good enough; ideally, speed rationality of 100 people could be tested on a new game none of them had seen before.
Along similar lines, ask the 100 people to cooperate in a large artificial project which requires that number of people, such as the manufacture of a complicated item invented for the day. It should be complex enough that cooperation is needed; IE, involve several complex skills such as a specific (ideally new, or new to the group) origami folding. Each individual is scored on how many of the item they have in their possession by the end of the day. It would be possible to work alone, but more efficient to work in groups and split profits. (Ideally the item is complex enough to encourage very large groups, tending towards everyone needing to work together.) This has the additional benefit of being a good learning experience.

comment by beoShaffer · 2011-07-12T05:45:29.466Z · LW(p) · GW(p)

Try to simulate the apparently supernatural/ create other hoaxes and see who can debunk them. There is enough domain specific knowledge that it wouldn’t work too well with individuals, particularly if they have a motivation to game the system. Still if a school doesn’t generally increase its students ability to deal with the apparently supernatural and false information it’s almost certainly bad sign.

comment by DBreneman · 2011-04-26T11:12:17.237Z · LW(p) · GW(p)

Experimental and Organizational tests seem to be the most important test types here; if the students and methods are able to show they're capable, and are measurably better than the students of another craft, then their school is obviously doing something better than other schools anyway, no Reputational test needed. So I'll concentrate on those.

What do we need for an experimental test? We need a way of comparing the strengths of students and ideas, to see which are stronger. The problem here is that there's not really a standard unit of rationality. If you want to measure something's volume, you can put it in a water bath and measure how many mL it displaces. If you want to measure someone's rationality... you're a bit out of luck.

I'm not well versed enough yet in cognitive sciences to propose a unit of raw intelligence/rationality measurement, and a way of at least estimating it. Until such a metric is apparant, I think we can make do with comparative testing. Take two students and have them perform some test of rationality that returns less rational, more rational, or equally rational as a rough comparison of the two. Perform it on an entire school, and you can rank each student. Perform it between similarly ranked students in two schools, and you can determine which school is better. Roughly. A test like this could also potentially serve as an organizational test.

What tests would I propose as an experiment? How about something like having the students competitively build a weirdtopia? (http://lesswrong.com/lw/xm/building_weirdtopia/) You could have a panel of randomly selected scifi fans read one of the two weirdtopias (don't compare them side by side, we're trying to get their honest opinion about one of the stories, not their comparison of the two) and rate 1-10 how much they'd like to live in that weirdtopia. The student with a higher voted paper is more rational, and if the stories are about equally weighted, we have two roughly equal weirdtopias.

That... doesn't test every facet of rationality I know. However, using tests as a way of comparing two students is something that a lot more tests could be adapted to, without necessarily having to make a measurable yardstick of rationality. Just need to figure out which aspect of rationality you want to test, look at papers and stories that display this aspect, have the two students write a similar paper using their own skills, and compare the two.

comment by VAuroch · 2013-12-09T08:43:43.310Z · LW(p) · GW(p)

I might use something similar to The Book from Neal Stephenson's Anathem, but less deliberately harmful and more confusingly-related-to-reality. Something where, in order to succeed, you must Change Your Mind, at least partially. If possible, include a real scenario where you must apply the knowledge in a charged context, where people are most prone to irrationality.

comment by Tenoke · 2012-11-08T23:05:06.661Z · LW(p) · GW(p)

A debate-like environment seems like an obvious example for a martial arts-like competition.

comment by [deleted] · 2012-08-13T02:38:04.875Z · LW(p) · GW(p)

Hmm... To me, a master of rationality might seem to be able to debate fairly well with other heads of powerful schools, such as philosophy and physics. I myself can pose some interesting questions to physics knowledgeable people, and refute offhand philosophical stupidity in a stride.

To test students for rationality I guess is easier to test for debiasing, by making classical bias experiments?

I need to mull this over with my fellow Bayesian conspirators.

3 Levels of Rationality Verification

Contents

246 comments