Beware Experiments Without Evaluation

davis_kingsley

Beware Experiments Without Evaluation

post by Davis_Kingsley · 2020-11-15T08:18:33.830Z · LW · GW · 13 comments

13 comments

Sometimes, people propose "experiments" with new norms, policies, etc. that don't have any real means of evaluating whether or not the policy actually succeeded or not.

This should be viewed with deep skepticism -- it often seems to me that such an "experiment" isn't really an experiment at all, but rather a means of sneaking a policy in by implying that it will be rolled back if it doesn't work, while also making no real provision for evaluating whether it will be successful or not.

In the worst cases, the process of running the experiment can involve taking measures that prevent the experiment from actually being implemented!

Here are some examples of the sorts of thing I mean:

Management at a company decides that it's going to "experiment with" an open floor plan at a new office. The office layout and space chosen makes it so that even if the open floor plan proves detrimental, it will be very difficult to switch back to a standard office configuration.
The administration of an online forum decides that it is going to "experiment with" a new set of rules in the hopes of improving the quality of discourse, but doesn't set any clear criteria or timeline for evaluating the "experiment" or what measures might actually indicate "improved discourse quality".
A small group that gathers for weekly chats decides to "experiment with" adding a few new people to the group, but doesn't have any probationary period, method for evaluating whether someone's a good fit or removing them if they aren't, etc.

Now, I'm not saying that one should have to register a formal plan for evaluation with timelines, metrics, etc. for any new change being made or program you want to try out -- but you should have at least some idea of what it would look like for the experiment to succeed and what it would look like for it to fail, and for things that are enough of a shakeup more formal or established metrics might well be justified.

13 comments

Comments sorted by top scores.

comment by Dagon · 2020-11-15T16:55:50.944Z · LW(p) · GW(p)

Very useful - note that in many cases, this is intentional (or semi-intentional). The proposer of the change is labeling it an experiment, but doesn't actually have any intention to learn anything or to roll back the change under foreseeable circumstances.

Your point is valid, but it's a mistake to think it can be fixed by trying to formalize the "experiment". In many cases, when "experiment" is just a euphemism for "implement without discussion", it'll be more effective to just disagree with the policy (if you do) than to object to the "experimental setup".

Replies from: Viliam, Daniel V

↑ comment by Viliam · 2020-11-15T21:34:24.802Z · LW(p) · GW(p)

Management at a company decides that it's going to "experiment with"...

Yeah, in this case you already lost, because the managers will be the ones doing the evaluation, so you can bet that if they want to do X, they will declare X to be a success regardless of anything that happened in reality.

To make it more legit, they may even give everyone a questionnaire like "Did open space increase your productivity?" Note that the only options are "increased" and "not increased", there is no "decreased", so you can only be neutral or positive about the whole thing. Furthermore, if all managers choose "increased" that alone is enough to conclude that "50% of employees answered that their productivity increased".

↑ comment by Daniel V · 2020-11-15T17:42:27.594Z · LW(p) · GW(p)

Indeed, these aren't controlled experiments at all, but sometimes they are also not policy-sneaking. Sometimes they are just using the phrase "experimenting with" in place of "trying out" to frame policy-implementation. At that point, the decision has already been made to try (not necessarily to assess whether trying is a good idea, it's already been endorsed as such), and presumably the conditions for going back to the original version are: 1) It leads to obviously-bad results on the criteria "management" was looking at to motivate the change in the first place or 2) It leads to complaints among the underlings.

The degree of skepticism, then, really just depends on your prior for whether the change will be effective, just like anything else. Whether there should have been more robust discussion depends either on the polarity of those priors (imagine a boardroom where someone raises the change and no one really objects vs. one where another person suggests forming an exploratory committee to discuss it further), or on whether you believe more people should have been included in the discussion ("you changed the bull pen without asking any of the bulls?!"). It has little to do with the fact that it was labeled an experiment, since again, it's likely being used as business-speak rather than as a premeditated ploy. I would love to have data on that though- do people who specifically refer to experimentation when they could just use a simpler word tend to use it innocuously or in a sneaky way?

Replies from: Dagon, qyng

↑ comment by Dagon · 2020-11-15T18:10:14.557Z · LW(p) · GW(p)

"trying out" also has the implication of reversibility, and should get (some of) the same criticism as "experiment" as a policy weasel-word. The degree of skepticism depends on your prior for the effect of the change, and also for the motivation of the proposer, and also of whether there is actually a path to measurement (decision to reverse) and reversal (implementation of reversal).

In the examples given, they seem very likely to be "policy-sneaking".

↑ comment by qyng · 2020-11-17T13:26:29.710Z · LW(p) · GW(p)

Similar thoughts. How one organisation defines "experiment" may be different to another, or how the employees themselves could interpret (business speak vs weasel). There's also the factors of company values and culture which provide the guardrails for what "experimentation", along with other hefty words such as "productivity" (depth of work vs breadth vs quality vs so on), means to them specifically. Assuming the employees have bought into those values for the most part (and hopefully why they became employees in the first instance) maybe there's an implied, unique understanding of these terms.

Such broad concepts are often used to paint employee satisfaction surveys, and might appear to policy-sneak, but it's easy to miss seemingly unimportant definitional particulars from multiple angles. Not to say that sneaking doesn't occur, and values can definitely be lost in translation, especially if management only takes a top-down approach and stakes the goalposts but doesn't elicit, receive or adequately respond to feedback. The ability to metricise arises from wrangling with the devil in the details, and not every company takes the time to.

Replies from: Davis_Kingsley

↑ comment by Davis_Kingsley · 2020-11-17T21:57:18.408Z · LW(p) · GW(p)

Yeah, I should point out that not all cases of experiments without evaluation are "sneaking" by any means -- sometimes one might have a well-intentioned idea for a change and just not go about testing it very systematically. However, in some ways the negative consequences can be similar.

comment by Richard_Ngo (ricraz) · 2020-11-15T21:50:27.852Z · LW(p) · GW(p)

This seems true in some cases, but very false in others. In particular:

Humans tend to experiment too little, and so we should often encourage more experimentation.
Qualitative observations are often much richer than quantitative observations. So even if you don't know what you're going to measure, making changes can help you understand what you were previously missing.

Your observations seems most true when there's institutional inertia, so that calling something an experiment can be a useful pretext.

comment by arunto · 2020-11-15T14:48:23.669Z · LW(p) · GW(p)

I think you are making an important point.

A relevant follow up question could be: What makes it more (or less) likely that a group or an organisation does make plans to evaluate the results of an experiment?

Some ideas:

Culture. It would be helpful to have (or create) a culture that doesn't see a "failed" experiment as a failure but as an important learning opportunity.

Intellectual humility. Running a true experiment (and not just calling one's plan "experiment") requires accepting that one has less certainty about how the world works.

comment by FraserOrr · 2020-11-15T16:57:25.471Z · LW(p) · GW(p)

A few thoughts on this:

The person advocating such experiments is usually an advocate of the change rather than a curious seeker of truth. Consequently, the lack of evaluation is a feature not a bug.
It is very hard to get good data out of any experiment that has a sample size of one.
It is often very hard to measure the actual thing you desire, and so as a consequence, insofar as there is measurement it is measurement of something that can be measured rather than something that is a useful measurement.
Even insofar as it is possible to measure something useful, it is often impossible to get all experimenters to agree on which of the useful metrics to use.
To some degree these things can be dealt with honestly by taking an A/B approach. So, for example
1. You have an open office plan on the first floor, and keep it as is on the second floor.
2. You introduce new forum rules for the month of December, and revert on January (and see if the forum members demand the return on the new rules.)
3. You spin off a new forum: my-fourm-expanded, that allows in the new users and see if the old users naturally migrate to the broader discussion. Failure and the new forum will automatically die.

However, as you say in your comment oftentimes such things are introduced more out of an implementation of a belief system about what is "right", rather than a curious investigation into which is more effective. It often requires utterly disastrously bad results before reverting the change is likely.

comment by adamShimi · 2020-11-16T20:56:55.507Z · LW(p) · GW(p)

This seems obvious to me, at least in the specific context that you mention (a community-level change). But the post is clear, short, and to the point, so it will surely help some readers.

comment by cozy · 2020-11-16T13:03:07.716Z · LW(p) · GW(p)

Military strategy is probably more careful than science in how many contingencies it expects, yet how few are truly accurate. Legitimately, science is more careful to actually try and do things it thinks has merit, while wars are fought off of conjecture and often just error. I think you can find a great number of amazing examples of both careful planning and completely spurious decision making, often on purpose to discredit another leader, within war. von Falkenhayn's me moir of the Great War is a wonderful example.

He also brings up an absolutely excellent point, that is often under-considered.

"Whether the solution proved itself in actual life depended, to be sure, as in all things in this imperfect world, primarily upon the men who had to put these principles in practice."

offtopic: why is the editor so hard to use right now, I don't remember quote blocks being this fiddly.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2020-11-16T18:52:44.558Z · LW(p) · GW(p)

It appears you only used soft line breaks for every paragraph, making your comment de-facto one big paragraph. That will make blockquotes a lot harder. As in most text-editors, you get paragraph breaks by pressing enter, and soft line-breaks by pressing shift-enter. Sadly, some places on the internet (like IM apps) have conditioned me to often use shift-enter in places where it doesn't make sense, because enter also submits the message, so my guess is you were caught by those habits and were using shift-enter in the above?

Replies from: cozy

↑ comment by cozy · 2020-11-17T12:07:06.330Z · LW(p) · GW(p)

Yes.

test

thank you very much.

Beware Experiments Without Evaluation

Contents

13 comments