Good Samaritans in experiments

bucky

Good Samaritans in experiments

post by Bucky · 2018-10-30T23:34:27.153Z · LW · GW · 14 comments

  The Experiment
  Drawing the wrong conclusions
  Effect Size, Significance and Experimental Power
  Further reducing experimental power
  Introducing new variables
  Conclusion
None
14 comments

Consider 2 people. Both are seminary students who are taking part in an experiment ostensibly to consider different types of religiosity. One is asked to prepare a short talk on the Good Samaritan, the other on potential future careers for seminar graduates.

They are both told to go to another room to record their talk. The one who is to be giving a talk on the Good Samaritan is told that he is late and needs to hurry. The other participant is told that he has a time to spare.

If they, separately, come across someone who appears to be in respiratory distress, which do you think is more likely to stop and help?

Does being in a hurry determine whether someone helps?

Does reading the Good Samaritan?

Which is a bigger effect?

I was recently told about an experiment which showed that seminary students who had just prepared to give a talk about the Good Samaritan were no more likely to help someone in need than those who had been preparing a talk about an unrelated topic.

This seemed unexpected to me – people who had just been reading and thinking about a story which was told specifically by the leader of their faith to instruct them to help other people were no more likely to help than the control? I know humanity is crazy but seemed like a new level of crazy which I wouldn’t have predicted.

So I thought I’d check out the study and – Aaaaaaaaaaaaaaaaaah!!!

I know getting overly upset about bad experiments (especially those from before the replication crisis) is probably bad for my health but still – Aaaaaaaaaaaaaaaaaaaaaaaah!!!

I don’t want to be too harsh on the authors as this probably isn’t the worst culprit you’ll see but – Aaaaaaaaaaaaaaah!!!

The paper has 1811 citations listed on google scholar – Aaaaaaaaaaaaaaah!!!

I’m tempted to pretend that this post has some purpose other than just as a release of my frustration but that would be dishonest. Please consider this post a form of therapy for me. The working title for this post was “Screaming into the void” - consider yourself warned.

(If you want a more coherent discussion of common misuse of statistics in research papers I highly recommend putanumonit’s defense against the dark arts series)

The Experiment

Ok, so the basic premise of the experiment seems to be sound. We want to know what inputs cause people to be more or less likely to help others:

1. Planning a talk on the Good Samaritan (GS)

2. Being in a hurry

3. Type of religiosity (Religion as quest, means or end)

The setup is to give people a questionnaire to determine their type of religiosity. Then give them some time to plan a short talk (3-5 mins) on GS or an unrelated topic. They are then asked to go to another room to give the talk (with 3 degrees of urgency – low, medium and high).

Contrary to the example given in the introduction, the level of hurriedness doesn’t depend on which topic the individual has prepared – there are 6 conditions people are put in: GS low, medium and high urgency and control low, medium and high urgency.

On the way to the other room, you arrange for them to come across someone slumped in a doorway, with an apparent respiratory condition.

You monitor the subjects’ responses and analyse the results.

My first question was whether they would adjust their p-value requirement for the 5 variables they were testing but no, it turns out that p<0.05 was deemed adequate for significance. Ok, could be worse I guess. More on this later.

The second place where doubts started to creep in were the rankings of responses:

0 = failed to notice the victim as possibly in need at all;

1 = perceived the victim as possibly in need but did not offer aid;

2 = did not stop but helped indirectly (e.g., by telling Steiner's assistant about the victim);

3 = stopped and asked if victim needed help;

4 = after stopping, insisted on taking the victim inside and then left him.

5 = after stopping, refused to leave the victim (after 3-5 minutes) and/or insisted on taking him somewhere outside experimental context (e.g., for coffee or to the infirmary).

It seems to me that there are two different things being measured here:

How likely are they to notice someone in distress

If they notice, how likely are they to help

However, these two factors have been put on the same scale. Surely it would be far more informative to separate them – does being in a hurry make you less likely to help someone you see who is in need or does it just make you less observant?

But this is relatively minor and the victim doesn’t really mind why you’re not helping, just whether you help or not – there could still be some useful results here. From the conclusion, it looks like not noticing the victim at all was rare so this failure mode was maybe not too big a deal.

Drawing the wrong conclusions

So, on to the conclusions from the report:

1. People who had been preparing GS talks were no more likely to help

2. People who were in a hurry were less likely to help

3. Religiosity type had an effect on the type of help given

It was the first which I was most interested in so I looked at the actual results.

53% of the people who had been preparing GS talks offered some kind of help (10/19). 29% of the people preparing non-GS talks offered some kind of help (6/21).

Wait, surely that means people who prepared a GS talk were 1.8x more likely to help than those with an alternative topic? Oh no, says the report. The difference was not significant at the p<0.05 level. Therefore, there is no effect. This isn’t specifically stated in that way but “lack of significant effect” is immediately followed by “lack of effect”:

Although the degree to which a person was in a hurry had a clearly significant effect on his likelihood of offering the victim help, whether he was going to give a sermon on the parable or on possible vocational roles of ministers did not. This lack of effect of sermon topic raises certain difficulties for an explanation of helping behavior involving helping norms and their salience.

The paper goes some way to redeeming itself by stating:

The results were in the direction suggested by the norm salience hypothesis, but they were not significant. The most accurate conclusion seems to be that salience of helping norms is a less strong determinant of helping behavior in the present situation than many, including the present authors, would expect.

It then undoes the good work in the next sentence:

Thinking about the Good Samaritan did not increase helping behaviour

Part of me wants to be happy that they at least included a fairly accurate description of the evidence but the repeated stating of the incorrect conclusion throughout the report can only lead readers to the wrong conclusion.

At one point, the paper seems to go even further and claims that the fact that we can’t reject the null hypothesis is confirmation of the null hypothesis:

The prediction involved in the first hypothesis concerning the message content was based on the parable. The parable itself seemed to suggest that thinking pious thoughts would not increase helping. Another and conflicting prediction might be produced by a norm salience theory. Thinking about the parable should make norms for helping salient and therefore produce more helping. The data, as hypothesized, are more congruent with the prediction drawn from the parable. A person going to speak on the parable of the Good Samaritan is not significantly more likely to stop to help a person by the side of the road than is a person going to talk about possible occupations for seminary graduates.

Since both situational hypotheses are confirmed…

Aaaaaaaaaaaaaaaaaaaaaaaaaaaah!!!

Somehow, the paper manages to make the “pious thoughts are ineffective” hypothesis into the null hypothesis and the “norm salience” hypothesis into the alternative hypothesis. Then, when the results are not significant to reject the null hypothesis this is treated as confirmation that the null hypothesis is true. This is the equivalent of accepting p<0.95 as evidence for the “pious thoughts are ineffective” hypothesis.

(Aside: I’m no theologian but I’m not really sure that “pious thoughts are ineffective” is really what the parable implies. Jesus often used the religious leaders as the bad guys in his parables so he may just be repeating that point)

Effect Size, Significance and Experimental Power

I think the issue here is confusion between effect size and significance.

The effect size is actually pretty good (80% increase in helping). In fact, in the condition that the GS participants weren’t rushing they averaged an impressive score of 3.8 (compared to 1.667 for the equivalent non-GS participants).

The fact that this doesn’t rise to significance has little to do with effect size and everything to do with experimental power.

The sample size was 40. There were 6 categories relating to the first 2 hypotheses (3 hurry conditions x 2 message conditions). If for each of the 3 religiosity type conditions a participant was just rated as “high” or “low” then this is 8 categories. That makes a total of 48 possible categorisations for each subject to cover the 3 hypotheses. We’ve managed to get more potential categorisations of each subject than we have subjects.

Aaaaaaaaaaaaaaah!!!

(Actually, this may not be irretrievable in and of itself – it just threw up a big red flag for me. If all the other parts of the experiment were on the money this could just be efficiently testing as many different hypotheses as possible given the limited data points available. The real problem is that if we adjust for multiple variable testing then the required p-value for significance goes down and power goes down with it.)

In addition to sample size, experimental power depends on variation in the dependant variable due to other sources (I’m happy to accept that they had low measurement error). My best guess is that there is significant variation due to other sources although I don’t have the data to show this. A number of personality traits had been investigated previously (Machiavellianism, authoritarianism, social desirability, alienation, and social responsibility) and found not to significantly correlate with helping behaviour, so my expectation would be that finding a true effect is difficult and unexplained variation in helping is large.

If experimental power is low, in order to find significant results, the effect size must be large.

As the effect size of reading GS was below the effect size required, the result is not statistically significant.

If an effect size of increasing helping by 80% is not significant, you really should have known before the experiment that you didn’t have enough power.

Further reducing experimental power

If you thought that N=40 was questionable, wait until you see what comes next. The paper goes on to see if the input variables correlate with the amount of help given when help was given. Only 16 people gave any help so suddenly N=16.

Aaaaaaaaaaaaaaaaaaaaaaah!!!

This seems like bad news for finding significance but we suddenly do have a significant effect. It turns out that scoring higher on seeing religion as a quest makes you likely to offer less help than if your score lower on this metric. This is contrary to the experimenters’ expectations.

After performing some extra calculations, the experimenters conclude that this is because those who scored lower on this metric were likely to offer over-the-top assistance and score a 5 which skewed the results.

Allow me to offer an alternative explanation.

The paper has so far calculated 18 different p-values (3 from ANOVA of message x hurry, 10 from linear regression of full data (5 x help vs no help, 5 x scoring system) and 5 from linear regression of only helpful participants). There were actually another 10 p-values calculated in their stepwise multiple regression analysis but these seem to have been ignored so I’ll gloss over that.

Now for each p-value which you calculate you have a 5% chance of finding a spurious result. I’ll take off the 3 p-value calculations which yielded true effects and say 15 opportunities to get a spurious p-value.

0.95 ^ 15 = 0.46

At this point, you are more likely to have achieved a spurious p-value than not from all the calculated p-values. Some of the p-values calculated are related so that may change the exact value but the probability of a spurious result is uncomfortably high.

Remember that an increase in helping of 80% didn’t achieve significance when N=40. The effect size must be truly huge in order to achieve significance with N=16 (The actual effect size isn’t given in the report).

Because their prior for this effect being true is fairly low (it’s huge and in the opposite direction to expectation) it would be reasonable to say that the p-value is probably spurious in the report with a note that this might be worth investigating further in the future.

Instead, the report ends up with a weird conclusion that low religion-as-a-quest scoring people are more likely to offer over-the-top help. The fact that they achieve an additional significant p-value when the introduce a new categorisation system (over-the-top help vs reasonable help) doesn’t add much to the likelihood of their conclusion – it just shows that they are able to look at their data and see a pattern.

Introducing new variables

At this point, another input variable is introduced. The original 3 types of religiosity were made up of scores from 6 different scales which were weighted to create the 3 types. Suddenly one of the 6 original scales is grabbed out (doctrinal orthodoxy) and this correlates even more strongly with giving over-the-top help (p<0.01).

Aaaaaaaaaaaaaaaaaaaah!!!

Introducing a new categorisation (over-the-top help) and a new variable (doctrinal orthodoxy) to try to explain a (probably) spurious p-value from multiple hypothesis testing is NOT a good idea.

We now have 4 different potential categorisations and 11 variables (the original 5 plus the 6 newly introduced scales). This makes 44 different potential p-values to calculate even before we consider the different types of tests that the authors might try (simple linear regression, ANOVA, stepwise multiple linear regression). I don’t think they calculated all of these 44+ p-values but rather looked at the data and decided which ones looked promising.

0.99 ^ 44 = 0.64

So now, even in the best case, a p<0.01 would happen in more than a third of similar experiments just by coincidence.

I don’t think that the effect described is impossible but I think the failure to adjust for multiple variables is a much more likely explanation.

Conclusion

So in conclusion, against all expectation, reading and preparing a talk on a parable given by the leader of your religion on how we should help people who are in need does, in fact, increase the likelihood that you will, in the next 5 minutes, help someone who is in need.

The fact that being in a hurry is a larger effect is the truly interesting finding here but I think not a huge surprise.

This is why I asked the question the way I did in the introduction – I didn’t get the chance to guess this blind and I’m not sure which way I would have voted if I had.

I’m confident that I wouldn’t have predicted quite such a big drop of help between GS low hurry and GS high hurry so I’ll have to update accordingly (average score 3.8 down to score 1).

One final thing:

Aaaaaaaaaaaaaaaaaaaaaaaaah!!!

14 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2018-12-05T19:13:22.507Z · LW(p) · GW(p)

Promoted to curated: It's been a month since this was written, but I still think it's an excellent post, and I generally think that methodological analyses of this kind are quite valuable.

I can imagine some people finding the tone and expressiveness a bit off-putting, but I enjoyed it, and generally think most methodological analyses like this are too dry and don't expose the author's beliefs enough.

I do think there are some improvements to structure you could make. Some basic headings I think would have helped me a lot in not feeling a bit lost around the latter third of the post, but overall I found it enjoyable to read and the structure wasn't too big of a problem.

Replies from: Bucky, habryka4

↑ comment by Bucky · 2018-12-05T21:49:27.012Z · LW(p) · GW(p)

Thanks for the comments, I've added section headings so hopefully it reads a bit easier now.

To be honest I really didn't expect this to be as interesting to people as it was - glad to be proven wrong!

Replies from: SaidAchmiz

↑ comment by Said Achmiz (SaidAchmiz) · 2018-12-05T22:52:11.130Z · LW(p) · GW(p)

Would you mind making the headings into actual headings (instead of just bold text)? That would make the post easier to skim; also, that way, when the post is viewed on GreaterWrong, there will be an auto-generated table of contents!

Replies from: Raemon, Bucky

↑ comment by Raemon · 2018-12-06T00:02:39.448Z · LW(p) · GW(p)

FYI, LessWrong is about to deploy Table of Contents, and will support elements that are entirely bold being treated as a (small) heading. (This roughly matches what Google Docs does), and might be good to have the two sites match if possible.

Replies from: ioannes_shade

↑ comment by ioannes (ioannes_shade) · 2018-12-08T16:25:07.109Z · LW(p) · GW(p)

The ToC feature is dope :-)

↑ comment by Bucky · 2018-12-05T23:47:56.394Z · LW(p) · GW(p)

Done :)

↑ comment by habryka (habryka4) · 2018-12-06T21:46:53.428Z · LW(p) · GW(p)

Oops, forgot to press the curate button. Now actually curated.

comment by ryan_b · 2018-10-31T15:38:23.066Z · LW(p) · GW(p)

This was entertaining, and as I am not proficient in critically demolishing research methods myself it is still useful to see examples on the regular.

Scream into the void all you want; I'll be listening.

comment by ESRogs · 2018-10-31T07:35:00.062Z · LW(p) · GW(p)

53% of the people who had been preparing GS talks offered some kind of help (10/19). 29% of the people preparing non-GS talks stopped to help (6/21).

Wait, surely that means people who prepared a GS talk were 1.8x more likely to help than those with an alternative topic? Oh no, says the report. The difference was not significant at the p<0.05 level.

Isn't the first category larger than the second? ("some kind of help" vs "stopped to help")

How many of the 10 GS people who "offered some kind of help" did the "help indirectly" thing (score of 2 on the 0-5 scale)? How many of the 15 non-GS people who did not stop to help did help indirectly?

Replies from: Bucky

↑ comment by Bucky · 2018-10-31T08:51:05.552Z · LW(p) · GW(p)

Sorry, that's my poor phrasing - the 53% and 29% are directly comparable. 10/19 GS people scored 2-5, compared to 6/21 non-GS.

From the report:

The percentages of subjects who offered aid by situational variable were, for low hurry, 63% offered help, intermediate hurry 45%, and high hurry 10%; for helping-relevant message 53%, task-relevant message 29%.

I've changed the phrasing in the OP.

Replies from: ESRogs

↑ comment by ESRogs · 2018-11-01T19:00:37.877Z · LW(p) · GW(p)

Got it, thanks!

comment by Rob Hodde (rob-hodde) · 2018-12-10T18:14:34.265Z · LW(p) · GW(p)

A nice easy way to determine the viability of the experiment is to look at the "Adjusted R-Squared Value."

This tells you the amount of variation in experiment results is explained by the model.

Anything below .5 means the model is worse than flipping a coin. 1 means the model perfectly explains the outcome.

comment by philip_b (crabman) · 2018-12-08T02:26:11.176Z · LW(p) · GW(p)

How did you conclude that people who prepared GS are actually more likely to help than other people? Just from eyeballing 10/19 and 6/21 I can't conclude that this is enough evidence, only that this is suggestive.

Replies from: Bucky

↑ comment by Bucky · 2018-12-08T09:13:12.322Z · LW(p) · GW(p)

From a frequentist perspective you're right.

From a Bayesian perspective, my prior would be that the GS condition would make people more likely to help. The likelihood calculation reinforces this belief. In terms of bits, my prior would have been (conservatively) 2-3 bits in favour of an effect and the experiment adds another 3-4(?) so I end up on about 6 bits. 64:1 is pretty good.