p-Hacking is a pervasive problem in the sciences. It’s been the downfall of many a scientist and the source of many a pop sci article. Many people have analyzed this phenomenon and produced some pretty familiar graphs, like these:

Look and you’ll see a lot of results right around the conventional significance thresholds for p-values of 0.10, 0.05, and 0.01. This region between 0.10 and 0.01 is what psychologist Julia Rohrer has dubbed the “uncanny mountain”. For difference-in-difference and instrumental variables studies, their modes are located on the uncanny mountain. This distribution of p-values is, frankly, impossible without cheating.

Assume you have a field with the commonly recommended power level of 80%. If all the effects pursued in that field were “real”, then the z-values would be distributed like the yellow curve below:

In fact, with this knowledge, we can calculate that at 80% power and with all real effects, 12.60% of results should return p-values between 0.05 and 0.01. Under the null hypothesis, only 4% would be in that range. But there are problems comparing this to the observed distribution of p-values. To name a few

1. Scientists are not perfect: they do not know that every effect they’ve hypothesized about is real. Because many effects are probably not real, there should be more nonsignificant values. The distribution of z-values must go left.

2. Many studies have more than 80% power for their target effects. These should have a distribution of z-values that’s shifted to the right if their effects are real.

3. Most studies have way, way less than 80%, much less 50%, or even 20% power. The real distribution of z-scores should be shifted way to the left. As examples, consider environmental science, where the median power is 6-12%, or neuroscience, where the median power is 8-31%, or economics, where the median power is 10-18%.

4. Publication bias means published p-values will be censored because there’s a preference for the publication of significant results. In rarer cases, there’s a preference for nonsignificant, usually underpowered results.

So with incredibly low power, what proportion of results should be between p = 0.05 and 0.01? The answer is very few unless you’re right below 50% power, when around half of the significant results should be between 0.05 and 0.01. But most of the results at that level should actually not be significant.

The reason there are so many results between 0.05 and 0.01 is p-hacking: scientists know that these thresholds are important to other scientists, to editors, and to themselves after years of having it dinned into their brains that if the p-value is below 0.05, the result is real.

In the comments of that Twitter post, Hungry Brain author Stephan Guyenet asked what similar analyses of nutrition studies would look like.

Luckily, four relatively recent studies have provided large datasets including 3.62 million p-values from studies in the following fields: Pharmacological and Pharmaceutical Science, Multidisciplinary, Biological Science, Public Health and Health Service, Immunology, Medical and Health Science, Microbiology, Biomedical Engineering, Biochemistry and Cell Biology, Computer Science, Dentistry, Psychology and Sociology, Zoology, Animal, Veterinary and Agricultural Science, Complementary and Alternative Medicine, Education, Ecology, Evolution and Earth Sciences, Neuroscience, Genetics, Physiology, Chemistry and Geology, Plant Biology, Geography, Business and Economics, Informatics, Mathematics and Physics, Nutrition and Dietetics, Economics, Other and NA.

These studies were Jager & Leek (2013), Brodeur et al. (2016), Head et al. (2015), and Chavalarias et al. (2016). All of their data is publicly available, so I will use it to examine which fields produced the most suspicious results, omitting the entries for “Other” and “NA”.

Let’s start with the abstracts. Scientists often present some of their flashiest results in the abstracts of their papers, where regardless of paywall status, the world can see them. We have data from the abstracts of 25/26 fields; data on abstracts in economics was not gathered. That field will have to wait for the full-text results.

Remember how most fields have low power and a field with 80% power should have 12.6% of its significant results between 0.05 and 0.01? Every field managed to beat that number, and I guarantee every one of these fields has less than 80% power. Since the proportions of each field’s results are overwhelmingly marginally significant like this, there is clearly something going on. But maybe it’s just that the main results are in the abstract and those results should be expected to be significant.

So on to the full-texts. These are p-values obtained from looking through thousands of full papers. Here’s how they looked:

Without economics, the rank correlation ρ between abstract bunching percentages and full-text bunching percentages is an extremely significant 0.69.

Now that it’s obvious, I should note that the suspicious-looking field that started this whole investigation is actually the best of the pack. It may have an excess of highly-suspicious p-values compared to what it would have if researchers were unbiased, but it’s still far better than the next most suspicious, and by god is it better than plant biology, chemistry and geology, or informatics, mathematics and physics.

Let’s take a look at absolute z-value distributions by field.

Starting with the familiar example from economics, things don’t look too bad, and they’re all worse than this. Some of them are much worse.

Everyone’s favorites, psychology and sociology?

It pains me to go on, but what about zoology?

Public health and health services?

Given the results so far, I am really not surprised that these plots are so much rarer outside economics and medicine. But let’s keep going. Here’s geography, business and economics:

And the place where theories go to wow the uneducated and confound skeptics, neuroscience:

I actually find it hard to believe this one is worse than neuroscience, but let’s see genetics:

Now for computer science:

The big one, medicine:

Biomedical engineering next:

Multidisciplinary work, clearly not being protected by leveraging the benefits of multiple disciplines:

Onto the biological sciences:

Going smaller, microbiology:

Dentistry:

Immunology:

Physiology:

Pharmacology:

Stephan’s request, nutrition:

Biochemistry:

The animal sciences:

At least alternative medicine is worse than normal medicine:

Plant biology:

Chemistry and geology:

And finally, worst of all, informatics, mathematics and physics:

This set of results is very similar to the aggregated one reported by van Zwet & Cator (2020).

They looked at more than a million reported z-values reported in Medline between 1976 and 2019 and they found that there was an extraordinary paucity of values between -2 and 2, very close to the conventional significance bounds of |1.96|. Their result looked like these ones:

Going through these examples clearly shows why the replication crisis is a universal. Every field is beset by p-hacking and a fraudulent bunching of p-values around points that scientists think feel significant. Results that are obtained by cheating or that are selectively reported will obviously have a hard time replicating. But at least the economists are doing better stats than anyone else (p < 0.0000000000000001).