Self-Blinded Caffeine RCT

post by niplav · 2023-06-27T12:38:55.354Z · LW · GW · 9 comments

Contents

  Statistical Method
  Predictions on the Outcomes of the Experiment
  Analysis
    Summary Statistics
    Meditation
    Productivity and Creativity
    Mood
    Likelihood Ratios
  Conclusion
  Predictions on the Outcome of the Experiment
None
9 comments

I test 200mg caffeine in an n=1, m=50 self-blinded RCT. The outcomes are encouraging.

 

 Log-score of predictions of substanceAbsorption effect size d (λ, p, σ increase)Mindfulness effect size d (λ, p, σ increase)Productivity effect size d (λ, p, σ increase)Creativity effect size d (λ, p, σ increase)Happiness effect size d (λ, p, σ increase)Contentment effect size d (λ, p, σ increase)Relaxation effect size d (λ, p, σ increase)Horniness effect size d (λ, p, σ increase)
200mg Caffeine (n=1, m=40)-0.60.61 (λ=13.3, p=0.00017, -0.072)0.58 (λ=11.8, p=0.0007, 0.021)0.58 (λ=28.9, p=1.3-12, 0.11)0.38 (λ=32.9, p=5.2-15, 0.09)0.27 (λ=10.6, p=0.002, 0.3)0.13 (λ=7.66, p=0.02, 0.47)-0.11 (λ=5, p=0.15, 0.42)-0.14 (λ=1.9, p=0.64, 0.11)

I am especially interested in testing many different substances for their effect on meditation, while avoiding negative side effects. The benefits from high meditational attainments valuable to me, and seem especially likely to benefit from chemical intervention, since the Algernon argument likely doesn't apply: Meditative attainments might've not led to a fitness advantage (even, by opportunity cost, to a fitness disadvantage), and so were likely selected against, but most of us don't care that much about inclusive genetic fitness and more about psychological well-being. Evolutionary dynamics favor being like Dschingis Khan (dozens to hundreds of offspring) over Siddharta Gautama (one son), but I'd rather attain sotāpanna than pillage and murder.

And meditative attainments are costly: they take tens to hundreds to thousands of hours to reach, which would make simple psychopharmacological interventions worthwhile. I also don't buy that they miss the point of meditation—most people already struggle enough, so some help doesn't make it a cakewalk; "reach heaven through fraud". One must be careful not to fall into the trap of taking substances that feel good but lessen sensory clarity (which I believe was the original intent behind the fifth precept, and so I'll exclude e.g. opiates from the substances to test.

Variables tracked (see more here):

The total cost of the experiment is at least 21.5€:

200mg caffeine pills, placebo pills filled with sugar, of each 25. Put each pill with a corresponding piece of paper ("C" for caffeine, "P" for placebo) into an unlabeled envelope. Used seq 1 50 | shuf to number the envelopes, and sorted them accordingly.

Notes on the experiment:


Statistical Method

In general, I'll be working with the likelihood ratio test. For this, let  be the distribution of values of a variable for the placebo arm, and  the distribution of values for a variable of the caffeine arm. (I apologise for the  being ambiguous, since it could also refer to the control arm).

Then let  be the Gaussian maximum likelihood estimator for our placebo values, and  be the MLE for our caffeine values.

Then the likelihood ratio statistic  is defined as  where  is the likelihood the caffeine distribution assigns to the parameters . This test is useful here because we fix all values of . See Wasserman 2003 ch. 10.6 for more.

If , then the MLE for the placebo arm is very close to the MLE for the caffeine arm, the distributions are similar. If , then the MLE for the placebo arm is quite different from the caffeine arm (though there is no statement about which has higher values).  is not possible, since that would mean that the MLE of the placebo distribution has a higher likelihood for the caffeine data than the MLE of the caffeine distribution itself—not very likely.

Note that I'm not a statistician, this is my first serious statistical analysis, so please correct me if I'm making some important mistakes. Sorry.

Predictions on the Outcomes of the Experiment

After collecting the data, but before analysing it, I want to make some predictions about the outcome of the experiment, similar to another attempt here.

Moved here.

Analysis

We start by setting everything up and loading the data.

import math
import numpy as np
import pandas as pd
import scipy.stats as scistat

substances=pd.read_csv('../..//data/substances.csv')

meditations=pd.read_csv('../../data/meditations.csv')
meditations['meditation_start']=pd.to_datetime(meditations['meditation_start'], unit='ms', utc=True)
meditations['meditation_end']=pd.to_datetime(meditations['meditation_end'], unit='ms', utc=True)

creativity=pd.read_csv('../../data/creativity.csv')
creativity['datetime']=pd.to_datetime(creativity['datetime'], utc=True)

productivity=pd.read_csv('../../data/productivity.csv')
productivity['datetime']=pd.to_datetime(productivity['datetime'], utc=True)

expa=substances.loc[substances['experiment']=='A'].copy()
expa['datetime']=pd.to_datetime(expa['datetime'], utc=True)

The mood data is a bit special, since it doesn't have timezone info, but that is easily remedied.

mood=pd.read_csv('../../data/mood.csv')
alarms=pd.to_datetime(pd.Series(mood['alarm']), format='mixed')
mood['alarm']=pd.DatetimeIndex(alarms.dt.tz_localize('CET', ambiguous='infer')).tz_convert(tz='UTC')
dates=pd.to_datetime(pd.Series(mood['date']), format='mixed')
mood['date']=pd.DatetimeIndex(dates.dt.tz_localize('CET', ambiguous='infer')).tz_convert(tz='UTC')

Summary Statistics

We can first test how well my predictions fared:

probs=np.array(expa['prediction'])
substances=np.array(expa['substance'])
outcomes=np.array([0 if i=='sugar' else 1 for i in substances])

drumroll

>>> np.mean(list(map(lambda x: math.log(x[0]) if x[1]==1 else math.log(1-x[0]), zip(probs, outcomes))))
-0.5991670759554912

At least this time I was better than chance:

>>> np.mean(list(map(lambda x: math.log(x[0]) if x[1]==1 else math.log(1-x[0]), zip([0.5]*40, outcomes))))
-0.6931471805599453

Meditation

Merging the meditations closest (on the right) to the consumption and selecting the individual variables of interest:

meditations.sort_values("meditation_start", inplace=True)
meditations_a=pd.merge_asof(expa, meditations, left_on='datetime', right_on='meditation_start', direction='forward')
caffeine_concentration=meditations_a.loc[meditations_a['substance']=='caffeine']['concentration_rating']
placebo_concentration=meditations_a.loc[meditations_a['substance']=='sugar']['concentration_rating']
caffeine_mindfulness=meditations_a.loc[meditations_a['substance']=='caffeine']['mindfulness_rating']
placebo_mindfulness=meditations_a.loc[meditations_a['substance']=='sugar']['mindfulness_rating']

So, does it help?

>>> (caffeine_concentration.mean()-placebo_concentration.mean())/meditations['concentration_rating'].std()
0.6119357868347828
>>> (caffeine_mindfulness.mean()-placebo_mindfulness.mean())/meditations['mindfulness_rating'].std()
0.575981762563846

Indeed! Cohen's d here looks pretty good. Taking caffeine also reduces the variance of both variables:

>>> caffeine_concentration.std()-placebo_concentration.std()
-0.0720877290884765
>>> caffeine_mindfulness.std()-placebo_mindfulness.std()
0.02186797288826836

Productivity and Creativity

We repeat the same procedure for the productivity and creativity data:

prod_a=pd.merge_asof(expa, productivity, left_on='datetime', right_on='datetime', direction='forward')
creat_a=pd.merge_asof(expa, creativity, left_on='datetime', right_on='datetime', direction='forward')
caffeine_productivity=prod_a.loc[meditations_a['substance']=='caffeine']['productivity']
placebo_productivity=prod_a.loc[meditations_a['substance']=='sugar']['productivity']
caffeine_creativity=creat_a.loc[meditations_a['substance']=='caffeine']['creativity']
placebo_creativity=creat_a.loc[meditations_a['substance']=='sugar']['creativity']

And the result is…

>>> (caffeine_productivity.mean()-placebo_productivity.mean())/prod_a['productivity'].std()
0.5784143673702401
>>> (caffeine_creativity.mean()-placebo_creativity.mean())/creat_a['creativity'].std()
0.38432393552829164

Again surprisingly good! The creativity values are small enough to be a fluke, but the productivity values seem cool.

In this case, though, caffeine increases variance in the variables (not by very much):

>>> caffeine_productivity.std()-placebo_productivity.std()
0.1139221931098384
>>> caffeine_creativity.std()-placebo_creativity.std()
0.08619686235791152

Mood

Some unimportant pre-processing, in which we filter for mood recordings 0-10 hours after caffeine intake, since pd.merge_asof doesn't do cartesian product:

mood_a=expa.join(mood, how='cross')
mood_a=mood_a.loc[(mood_a['alarm']-mood_a['datetime']<pd.Timedelta('10h'))&(mood_a['alarm']-mood_a['datetime']>pd.Timedelta('0h'))]
caffeine_mood=mood_a.loc[mood_a['substance']=='caffeine']
placebo_mood=mood_a.loc[mood_a['substance']=='sugar']

And now the analysis:

>>> caffeine_mood[['happy', 'content', 'relaxed', 'horny']].describe()
           happy    content    relaxed      horny
count  88.000000  88.000000  88.000000  88.000000
mean   52.193182  51.227273  50.704545  46.568182
std     2.396635   2.911441   3.115254   3.117601
[…]
>>> placebo_mood[['happy', 'content', 'relaxed', 'horny']].describe()
           happy    content    relaxed      horny
count  73.000000  73.000000  73.000000  73.000000
mean   51.575342  50.876712  51.041096  47.000000
std     2.101043   2.437811   2.699992   3.009245
[…]

Which leads to d of ~0.27 for happiness, ~0.13 for contentment, ~-0.11 for relaxation and ~-0.14 for horniness.

Likelihood Ratios

We assume (at first) that the data is distributed normally. Then we can define a function for the gaussian likelihood of a distribution given some parameters:

def normal_likelihood(data, mu, std):
    return np.product(scistat.norm.pdf(data, loc=mu, scale=std))

And now we can compute the likelihood ratio  for the null hypothesis   for the placebo data , and also the result of the likelihood ratio test:

def placebo_likelihood(active, placebo):
    placebo_mle_lh=normal_likelihood(active, placebo.mean(), placebo.std())
    active_mle_lh=normal_likelihood(active, active.mean(), active.std())
    return active_mle_lh/placebo_mle_lh

def likelihood_ratio_test(lr):
    return 2*np.log(lr)

And this gives us surprisingly large values:

>>> placebo_likelihood_ratio(caffeine_concentration, placebo_concentration)
776.6147119766716
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_concentration, placebo_concentration))
13.309888722406932
>> placebo_likelihood_ratio(caffeine_mindfulness, placebo_mindfulness)
363.3984201164464
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_mindfulness, placebo_mindfulness))
11.790999616893938
>>> placebo_likelihood_ratio(caffeine_productivity, placebo_productivity)
1884090.6347491818
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_productivity, placebo_productivity))
28.8979116811553
>>> placebo_likelihood_ratio(caffeine_creativity, placebo_creativity)
14009015.173307568
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_creativity, placebo_creativity))
32.910423242578126

And, if one is interested in p-values, those correspond to (with 2 degrees of freedom each):

def llrt_pval(lmbda, df=2):
    return scistat.chi2.cdf(df, lmbda)

>>> llrt_pval([13.309888722406932,11.790999616893938, 28.8979116811553, 32.910423242578126])
array([1.66559304e-04, 7.23739116e-04 ,1.34836408e-12, 5.17222209e-15])

I find these results surprisingly strong, and am still kind of mystified why. Surely caffeine isn't that reliable!

And, the same, for mood:

>>> placebo_likelihood_ratio(caffeine_mood['happy'], placebo_mood['happy'])
204.81283712162838
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_mood['happy'], placebo_mood['happy']))
10.644193144917832
>>> placebo_likelihood_ratio(caffeine_mood['content'], placebo_mood['content'])
46.08310645632934
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_mood['content'], placebo_mood['content']))
7.6608928570645105
>>> placebo_likelihood_ratio(caffeine_mood['relaxed'], placebo_mood['relaxed'])
12.229945616108525
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_mood['relaxed'], placebo_mood['relaxed']))
5.007775005855661
>>> placebo_likelihood_ratio(caffeine_mood['horny'], placebo_mood['horny'])
2.670139324155222
>>> likelihood_ratio_test(placebo_likelihood_ratio(caffeine_mood['horny'], placebo_mood['horny']))
1.9642613047646074

And the p-values of those are:

>>> llrt_pval([10.644193144917832, 7.6608928570645105, 5.007775005855661, 1.9642613047646074])
array([0.0020736 , 0.02462515, 0.15015613, 0.63984027])

Conclusion

Caffeine appears helpful for everything except relaxation (and it makes me hornier, which I'm neutral about). I'd call this experiment a success and will be running more in the future, while in the meantime taking caffeine before morning meditations.

Full code for the experiment here.

Predictions on the Outcome of the Experiment

Predicting the outcomes of personal experiments give a useful way to train ones own calibration, I take it a step further and record the predictions for the world to observe my idiocy.

I also recorded my predictions about the content of the pill on PredictionBook (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50).

9 comments

Comments sorted by top scores.

comment by Tofly · 2023-06-27T22:58:30.627Z · LW(p) · GW(p)

Did you consider withdrawal effects at all?  A day without caffeine having used it the previous few days is going to be very different from one where you haven't used caffeine in months.

Replies from: niplav
comment by niplav · 2023-06-28T06:39:26.533Z · LW(p) · GW(p)

That's a good point. I have, but shelved them for sometime later™. If it were true that withdrawal effects are important, you'd expect the difference in placebo-caffeine scores to drop as the experiment progresses, but I haven't checked yet.

Replies from: Korz
comment by Mart_Korz (Korz) · 2023-06-28T15:28:51.339Z · LW(p) · GW(p)

you'd expect the difference in placebo-caffeine scores to drop

I am not sure about this. I could also imagine that the difference remains similar, but instead the baseline for concentration etc. shifts downwards such that caffeine-days are only as good as the old baseline and placebo-days are worse than the old baseline.

Replies from: niplav
comment by niplav · 2023-06-28T15:35:58.144Z · LW(p) · GW(p)

That's a good point, I'll keep it in mind.

comment by sludgepuddle · 2023-06-28T05:19:35.639Z · LW(p) · GW(p)

I'm surprised you were only able to predict whether you'd taken caffeine 80% of the time. 200 mg is not a heroic dose, but on little to no tolerance it should be quite noticeable.

Replies from: niplav
comment by niplav · 2023-06-28T06:41:13.599Z · LW(p) · GW(p)

Blinding is powerful! Not sure where you get 80% from, do you mean the number of times when I was directionally right in the prediction?

Replies from: ben-lang
comment by Ben (ben-lang) · 2023-06-28T15:17:02.700Z · LW(p) · GW(p)

I think they mean that beforehand you said "My prediction about the content of the pill is more accurate than random guesses80%" - meaning that you were 80% sure you would do better than a 50/50 guess of what type of pill it was throughout the trial. Then you found that you did indeed do better than 50/50, but didn't give a number and I think sludgepuddle thought you had guessed right 80% of the time.

Replies from: niplav
comment by niplav · 2023-06-28T15:22:26.216Z · LW(p) · GW(p)

Ah, I see. In the table is the log score of my pill predictions (-0.6). I made predictions about the accuracy of my predictions—confusing perhaps.

Replies from: Korz
comment by Mart_Korz (Korz) · 2023-06-28T15:49:28.864Z · LW(p) · GW(p)

log score of my pill predictions (-0.6)

If did not make a mistake, this score could be achieved by e.g. giving ~55% probabilities and being correct every time or by always giving 70% probabilities and being right ~69 % of the time.