✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Crémieux
@cremieuxrecueil
What's more convincing?

p = 0.04 in a sample of 10 or p = 0.04 in a sample of 1,000,000?

🧵
Crémieux
@cremieuxrecueil
Pick an answer, then go to the next post.
Crémieux
@cremieuxrecueil
OK, now you've answered, and I hope you answered correctly: p = 0.04 in a sample of 10 will generally be much more convincing than that same p-value in a sample of 1,000,000 people.

The reason has to do with a paradox.
Crémieux
@cremieuxrecueil
This is my favorite illustration of the paradox (from @lakens):

Even when there's no effect, with a large sample, you'll find plenty of significant estimates because minuscule deviations from 'no effect' with a point null will often be significant if you have high power.
Thread image
Crémieux
@cremieuxrecueil
Because of this fact, the same p-value at different levels of power corresponds to very different levels of evidence.

So p = 0.04 in a sample of 1,000,000? That could be better evidence against an effect than for it.

That the essence of Lindley's paradox.
Crémieux
@cremieuxrecueil
Lindley's paradox makes it very clear why p-values are not measures of evidence absent context.

If we want reliable measures of the evidence for something, we can use likelihoods or Bayes factors instead.
Thread image
Crémieux
@cremieuxrecueil
The way Bayes factors work is by dividing the marginal likelihood of the data you observe under one hypothesis to its marginal likelihood under another hypothesis.

Or even simpler, comparing the probability of one model (M0) to another (M1) given an observation.*
Thread image
Crémieux
@cremieuxrecueil
I'm going to get to some results, but first:

The ladder of evidential strength proposed by Sir Harold Jeffreys goes from evidence being worth little more than an anecdote to being extreme.

This goes in reverse when the denominator is larger than the numerator.
Thread image
Crémieux
@cremieuxrecueil
In 2018, Hoekstra et al. evaluated the evidence for null effects in medicine.

Several trials had found nonsignificant results, but it wasn't clear how much evidence in favor of the null those trials provided.

There was only a modest relationship with p-values.
Thread image
Crémieux
@cremieuxrecueil
But p-values are evidently less important than sample size for determining if some effect isn't real.

Studies with larger sample sizes provided much better evidence in favor of no effect.
Thread image
Crémieux
@cremieuxrecueil
Hulme et al. evaluated the efficacy of hydroxychloroquine for treating COVID-19.

They found that reanalyzing a trial showing benefits from HCQ with patients who deteriorated, excluding the untested, or when the excluded were assumed positive, the evidence became fairly weak.
Thread image
Crémieux
@cremieuxrecueil
In another HCQ reanalysis, Wagenmakers and Gronau found that the evidence for a treatment effect among COVID-19 patients with pneumonia was only moderate, suggesting the need for more research.
Thread image
Crémieux
@cremieuxrecueil
In 1959, Festinger and Carlsmith conducted a now-famous study where they documented Festinger's recently-coined concept of "cognitive dissonance".

The study has been cited almost 6,000 times.
Thread image
Crémieux
@cremieuxrecueil
In the study, participants did tedious tasks for an hour and were then not paid, paid $1, or paid $20 to tell people that the tasks were interesting.

The paid groups ended up rating the tasks more interesting after all was said and done, showing 'cognitive dissonance.'
Thread image
Crémieux
@cremieuxrecueil
Reanalysis of this result in 2018 suggested that the evidence for cognitive dissonance was actually not more than anecdotal.

In other words, it was not worth more than a bare mention for demonstrating the phenomenon it was argued to show.
Thread image
Crémieux
@cremieuxrecueil
Bayes factors might be able to help with the abuse of p-values. The way they could do that is by forcing people to put more thought into their analyses.

So if I'm being realistic, I don't expect them to help, but I hope you'll agree they're pretty cool!
Crémieux
@cremieuxrecueil
Crémieux
@cremieuxrecueil
For clarity, the first post is asking about which is more convincing evidence of an effect being present.

The post on Jeffreys' classifications mentions going in reverse, but to be clear, what I mean is fractional Bayes factors being evidence for the denominator hypothesis/model
Crémieux
@cremieuxrecueil
Clarification on what I meant with the apostrophes around 'no effect' in the p-value plot:
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export