In my previous article, The Problem is Not Plagiarism, but Cargo Cult Science, I claimed that upwards of 90% of non-STEM academic papers should be classified as Cargo Cult Science. This assertion prompted a challenge from a HackerNews commenter who said I should “Pick a field and debunk it if you can.”

To reiterate, Richard Feynman considers the main feature of Cargo Cult Science to be the following:

But there is one feature I notice that is generally missing in Cargo Cult Science.  That is the idea that we all hope you have learned in studying science in school—we never explicitly say what this is, but just hope that you catch on by all the examples of scientific investigation.  It is interesting, therefore, to bring it out now and speak of it explicitly.  It’s a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty—a kind of leaning over backwards.  For example, if you’re doing an experiment, you should report everything that you think might make it invalid—not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you’ve eliminated by some other experiment, and how they worked—to make sure the other fellow can tell they have been eliminate

I’ve selected psychology as my Cargo Cult Science field of choice primarily because it’s one of the most obviously fraught subjects and due to my own brief academic detour into this field after dropping out of engineering (a very questionable decision in hindsight).

In psychology, papers are either p-hacked garbage that doesn’t replicate or, more commonly, have no scientific value due to their methodology or subject of study. These papers aren’t necessarily “false” per se, as much as they don’t produce any valuable knowledge by design, serving mainly to advance the careers of the authors through journal publications and the gathering of press and citations. I consider all these Cargo Cult Science.

While p-hacking and the replication crisis are widely acknowledged, albeit not solved, problems, I argue that these are minor compared to the four major issues that undermine psychology: reliance on surveys, disregarding indexicality, studying nouns, and using statistics.

This post will delve into the weaknesses in psychological research outside of clinical psychology. In a subsequent post, I plan to explore the problems with “mental illnesses” (especially depression) and their treatment.

One of the most prominent paradigms in psychology, popularized by Daniel Kahneman’s seminal book “Thinking, Fast and Slow,” revolves around the concepts of biases, nudging, and priming. Kahneman, in collaboration with Amos Tversky, laid the groundwork for the field now known as behavioral economics.

This line of research studies how “irrational” people behave in various scenarios and suggests that subtle cues can be used to “nudge” or “prime” individuals toward making more “rational” decisions. It operates on the assumption that human behavior is predominantly governed by unconscious and irrational forces, a concept Kahneman refers to as System 1, which eludes our direct control.

Daniel Kahneman explains this view as follows:

When I describe priming studies to audiences, the reaction is often disbelief. This is not a surprise: System 2 believes that it is in charge and that it knows the reasons for its choices. Questions are probably cropping up in your mind as well: How is it possible for such trivial manipulations of the context to have such large effects? Do these experiments demonstrate that we are completely at the mercy of whatever primes the environment provides at any moment? Of course not. The effects of the primes are robust but not necessarily large. Among a hundred voters, only a few whose initial preferences were uncertain will vote differently about a school issue if their precinct is located in a school rather than in a church—but a few percent could tip an election. 

The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.

I, like Science Banana in her post Against Automaticity, will argue that disbelief is not only an option but that it is the correct option.

The automaticity frame, where people are seen as helplessly malleable via their unconscious System 1, also infiltrated common beliefs about advertising mechanisms. Kevin Simler addresses this in his essay Ads Don’t Work That Way, where he critiques the prevalent notion that advertisements primarily work by manipulating our emotions. Traditional advertising theory suggests that ads forge positive links between a product and emotions subtly influencing our purchasing decisions over time.

However, Simler introduces an alternative, more “rational” theory he terms “cultural imprinting.” This theory posits that advertising informs consumers about the social signals they broadcast when purchasing a product:

If I’m going to bring Corona to a party or backyard barbecue, I need to feel confident that the message I intend to send (based on my own understanding of Corona’s cultural image) is the message that will be received. Maybe I’m comfortable associating myself with a beach-vibes beer. But if I’m worried that everyone else has been watching different ads (“Corona: a beer for Christians”), then I’ll be a lot more skittish about my purchase.

This explanation has the advantage of not only making more intuitive sense but also of treating consumers not as agents wholly governed by subconscious emotional associations.

In contrast, the typical emotional inception theory appears almost ridiculous: the idea that mere exposure to appealing imagery associated with a product will inherently foster a positive association and drive purchase decisions. Simler highlights how extraordinary the claim is:

This meme or theory about how ads work — by emotional inception — has become so ingrained, at least in my own model of the world, that it was something I always just took on faith, without ever really thinking about it. But now that I have stopped to think about it, I’m shocked at how irrational it makes us out to be. It suggests that human preferences can be changed with nothing more than a few arbitrary images. Even Pavlov’s dogs weren’t so easily manipulated: they actually received food after the arbitrary stimulus. If ads worked the same way — if a Coke employee approached you on the street offering you a free taste, then gave you a massage or handed you $5 — well then of course you’d learn to associate Coke with happiness.

I reference this example to highlight two critical points. First, it demonstrates that it is feasible to find a more logical and rational explanation for a phenomenon that looks “irrational” at first. Second, it reveals how after examining an alternative explanation, the widely accepted beliefs can seem very unconvincing.

Research within the “Bias, Nudging, and Priming” paradigm typically falls into one of two categories: it either fails to replicate, or the observed effects can be more convincingly explained by rational behavior, which may appear unusual within the confines of an artificial experimental environment. Gerd Gigerenzer’s paper, The Bias Bias in Behavioral Economics, provides an in-depth analysis of the issues surrounding various biases that are said to “replicate”:

Behavioral economics began with the intention of eliminating the psychological blind spot in rational choice theory and ended up portraying psychology as the study of irrationality. In its portrayal, people have systematic cognitive biases that are not only as persistent as visual illusions but also costly in real life—meaning that governmental paternalism is called upon to steer people with the help of “nudges.” These biases have since attained the status of truisms. In contrast, I show that such a view of human nature is tainted by a “bias bias,” the tendency to spot biases even when there are none. This may occur by failing to notice when small sample statistics differ from large sample statistics, mistaking people’s random error for systematic error, or confusing intelligent inferences with logical errors. Unknown to most economists, much of psychological research reveals a different portrayal, where people appear to have largely fine-tuned intuitions about chance, frequency, and framing.

We cannot go through the whole paper, which I highly recommend reading, but we will discuss two illustrative examples:

Framing effects were said to violate “description invariance,” an “essential condition” for rationality (Tversky and Kahneman, 1986, p. S253). People were also said to lack insight: “in their stubborn appeal, framing effects resemble perceptual illusions more than computational errors” (Kahneman and Tversky, 1984, p. 343). Similarly, when a study found that experienced national security intelligence officers exhibited larger “framing biases” than did college students, this finding was interpreted as a surprising form of irrationality among the experienced officers (Reyna et al., 2014).

Consider a classical demonstration of framing. A patient suffering from a serious heart disease considers high-risk surgery and asks a doctor about its prospects. The doctor can frame the answer in two ways:

Positive Frame: Five years after surgery, 90% of patients are alive.

Negative Frame: Five years after surgery, 10% of patients are dead.

Should the patient listen to how the doctor frames the answer? Behavioral economists say no because both frames are logically equivalent (Kahneman, 2011). Nevertheless, people do listen. More are willing to agree to a medical procedure if the doctor uses positive framing (90% alive) than if negative framing is used (10% dead) (Moxey et al., 2003). Framing effects challenge the assumption of stable preferences, leading to preference reversals. Thaler and Sunstein (2008) who presented the above surgery problem, concluded that “framing works because people tend to be somewhat mindless, passive decision makers” (p. 40).

However, as Gigerenzer explains, viewing people as “mindless” is not the only way to interpret the results. The opposite is likely the case:

For the patient, the question is not about checking logical consistency but about making an informed decision. To that end, the relevant question is: Is survival higher with or without surgery? The survival rate without surgery is the reference point against which the surgery option needs to be compared. Neither “90% survival” nor “10% mortality” provides this information.

There are various reasons why information is missing and recommendations are not directly communicated. For instance, U.S. tort law encourages malpractice suits, which fuels a culture of blame in which doctors fear making explicit recommendations (Hoffman and Kanzaria, 2014). However, by framing an option, doctors can convey information about the reference point and make an implicit recommendation that intelligent listeners intuitively understand. Experiments have shown, for instance, that when the reference point was lower, that is, fewer patients survived without surgery, then 80–94% of speakers chose the “survival” frame (McKenzie and Nelson, 2003). When, by contrast, the reference point was higher, that is, more patients survived without surgery, then the “mortality” frame was chosen more frequently. Thus, by selecting a positive or negative frame, physicians can communicate their belief whether surgery has a substantial benefit compared to no surgery and make an implicit recommendation.

The same is true for the framing effect in the “Asian disease problem,” which presents a scenario where the United States faces an outbreak expected to kill 600 people. Two programs to address the disease are proposed, each described in either a positive or negative light.

Positive Frame

If Program A is adopted, 200 people will be saved.

If Program B is adopted, there is a 1/3 probability that 600 people will be saved and a 2/3 probability that no people will be saved.

Negative Frame

If Program A is adopted, 400 people will die.

If Program B is adopted, there is a 1/3 probability that nobody will die and a 2/3 probability that 600 people will die.

Tversky and Kahneman observed that people’s preferences varied with the framing: Program A was favored under positive framing, while Program B was chosen more often under negative framing. This outcome, contradicting the notion of stable preferences, “demonstrates” that people tend to avoid risk when a positive frame is presented (preferring the certain saving of 200 lives) and seek risk in the face of potential losses (opting for the chance to save all 600 lives). Tversky and Kahneman argued that since both scenarios are logically equivalent, farming should not alter people’s preferences.

But if we again assume that people make intelligent inferences about left-out information, this discrepancy is not irrational. Note that the “certain” options are both incomplete while the “risky” options include numbers for both those who will be saved and those who will die. We can include this missing information and rewrite the positive frame as:

If Program A is adopted, 200 people will be saved and 400 people will not be saved.

If Program B is adopted, there is a 1/3 probability that 600 people will be saved and a 2/3 probability that no people will be saved.

Gigerenzer notes the perhaps surprising differences when including this logically irrelevant information:

If any of the proposed explanations given—people’s susceptibility to framing errors, their risk aversion for gains and risk seeking for losses, or the value function of prospect theory—were true, this addition should not matter. But in fact, it changes the entire result. When the information was provided completely, the effect of positive versus negative frames disappeared in one study after the other (e.g. Kühberger, 1995; Kühberger and Tanner, 2010; Mandel, 2001; Tombu and Mandel, 2015). Further studies indicated that many people notice the asymmetry and infer that the incomplete option means that at least 200 people are saved because, unlike in Program B, the information for how many will not be saved is not provided (Mandel, 2014). Under this inference, Program A guarantees that 200 or more people are saved, as opposed to exactly 200. Thus, people’s judgments appear to have nothing to do with unstable preferences or with positive versus negative framing. The incompleteness of Program A alone drives the entire effect. When supplied with incomplete information, people have to make intelligent inferences. Doing so is not a bias.

Gigerenzer’s critique of other biases often follows this pattern. He shows that the standard definition of “rationality” (rational choice theory) doesn’t make sense in the complex, context-dependent reality in which people make decisions. It is therefore no wonder when people’s choices deviate from that theory, but it is not a “bias.” Gigerenzer advocates for a theory he terms “ecological rationality,” which emphasizes the role of context in human decision-making.

To illustrate this point further, we can look at another well-known example. Many experiments show people believe that a sequence like HHTH (where “H” denotes heads and “T” denotes tails) is more likely than HHHH when flipping a coin multiple times. Behavioral economists criticize this belief, arguing that all sequences are equally likely.

However, Gigerenzer shows that this is not necessarily correct. When we look at a finite sample larger than the pattern, the sequence HHTH is indeed more likely to occur than HHHH. Only when we are concerned with an infinite sample (impossible in real life) or a sample that is exactly the length of the pattern we are looking for (in this case 4) are the two patterns equally likely. The reason for this is as follows. For either of these patterns to occur we first need two heads in a row. Then, when searching for the HHHH pattern, we need a third heads. Consider the unlucky case where we flip the coin and it shows tails. We would have to start the pattern from the very beginning – looking for our first heads again. Now examine the other pattern: We are looking for HHTH, so we want tails in our third throw. The difference is that in the unlucky case of an H, we don’t need to start from the very beginning because heads can be seen as the first heads we need in our HHTH pattern. When we get unlucky, we get to start at the second heads. In contrast, when looking for the HHHH pattern, the unlucky T doesn’t help us at all and we need to look for our first heads again.

Human intuition is therefore not flawed as suggested by conventional bias research. In real-world scenarios, where infinite samples don’t exist and sample sizes don’t exactly match the pattern length, people’s intuitions about probability patterns are accurate: HHTH occurs more often than HHHH. The “mathematically correct” answer is mostly wrong in practice. Instead of calling this result a “bias” or “irrational”, we should question the artificial nature of the experiment.

For those who are unconvinced, Gigerenzer proposes the law-of-small-numbers bet:

You flip a fair coin 20 times. If this sequence contains at least one HHHH, I pay you $100. If it contains at least one HHHT, you pay me $100. If it contains neither, nobody wins.

Do not take it.

In his response to Banana’s critique titled “Here’s Why Automaticity Is Real Actually,” Scott Alexander acknowledges the debunking of several striking findings in the field but maintains that certain “cognitive biases are real, well-replicated, and have significant explanatory value.”

I find his argument less than persuasive, particularly regarding his assertion:

[Cognitive biases] exist. They can be demonstrated in the lab. They tell us useful things about how our brains work. Some of them matter a lot, in the sense that if we weren’t prepared for them, bad things would happen.

Contrary to Scott Alexander, I hold that cognitive biases can only be “demonstrated” in a laboratory setting, offering little insight into the actual workings of our brains, have little to no explanatory value, and lack of preparedness for these “biases” does not result in negative consequences in real life.

Alexander cites hyperbolic discounting as an example of a cognitive bias that could lead to adverse outcomes if not recognized:

Hyperbolic discounting is a cognitive bias. But when it affects our everyday life, we call it by names like “impulsivity” or “procrastination”.

Hyperbolic discounting is not a bias in any meaningful sense of the term. To understand his claim, consider the Wikipedia entry for Hyperbolic discounting:

Given two similar rewards, humans show a preference for one that arrives in a more prompt timeframe. Humans are said to discount the value of the later reward, by a factor that increases with the length of the delay. In the financial world, this process is normally modeled in the form of exponential discounting, a time-consistent model of discounting. Many psychological studies have since demonstrated deviations in instinctive preference from the constant discount rate assumed in exponential discounting.Hyperbolic discounting is an alternative mathematical model that agrees more closely with these findings

What Scott calls a “bias” is the fact that people do not have a time-consistent discount rate. An example taken from Time Discounting and Time Preference: A Critical Review by Frederick, Loewenstein, and Donoghue might help:

For example, Richard Thaler (1981) asked subjects to specify the amount of money they would require in [one month/one year/ten years] to make them indifferent to receiving $15 now. The median responses [$20/$50/$100] imply an average (annual) discount rate of 345 percent over a one-month horizon, 120 percent over a one-year horizon, and 19 percent over a ten-year horizon

The term “bias” denotes the fact that even though the exponential discounting model predicts that the annual discount rate should be the same no matter the time frame, it isn’t.

A more accurate interpretation of this finding would be to acknowledge it as evidence that a well-known economic model fails to accurately predict human behavior. In other words, when humans don’t conform to your theory, search for the problem in the theory, not in the humans.

This broad pattern of calling people’s actions and choices “biased” when they do not agree with some famous mathematical theory explains much of what Gigerenzer calls “The Bias Bias.” The underlying assumption of such labeling is that deviations from theoretical models indicate “irrationality,” potentially leading to adverse outcomes. This assumption forms the basis for the argument in favor of “nudging” individuals towards certain behaviors.

However, the claim that these so-called biases lead to negative consequences is not supported by any evidence. On the contrary, as discussed, real-life scenarios often demonstrate that people’s intuitive decisions can be more “rational” than those guided by naive application of mathematical models. In many cases, these intuitive heuristics take into account a surprising amount of complexity of the real world that is difficult to model mathematically.

Scott Alexander’s selection of cognitive biases that he argues are well-replicated serve as a further example of this phenomenon. Remember, these are the best he could come up with:

The conjunction fallacy is not only well-replicated, but easy to viscerally notice in your own reasoning when you look at classic examples. I think it’s mostly survived critiques by Gigerenzer et al, with replications showing it happens even among smart people, even when there’s money on the line, etc. But even the Gigerenzer critique was that it’s artificial and has no real-world relevance, not that it doesn’t exist.

Loss aversion has survived many replication attempts and can also be viscerally appreciated. The most intelligent critiques, like Gal & Rucker’s, argue that it’s an epiphenomenon of other cognitive biases, not that it doesn’t exist or doesn’t replicate.

The big prospect theory replication paper concluded that “the empirical foundations for prospect theory replicate beyond any reasonable thresholds.”

The conjunction fallacy is obviously a consequence of the artificial lab setting where subjects make informed guesses that are often correct in the real world but look foolish in the experiment. I doubt even Scott would claim it has any real-world applications. Moreover, as is often the case, the effect vanishes when expressed in frequencies as opposed to probabilities or when participants are forced to slow down:

In one experiment the question of the Linda problem was reformulated as follows:

There are 100 persons who fit the description above (that is, Linda’s). How many of them are:

Bank tellers? __ of 100

Bank tellers and active in the feminist movement? __ of 100

Whereas previously 85% of participants gave the wrong answer (bank teller and active in the feminist movement), in experiments done with this questioning none of the participants gave a wrong answer.

The other two points on Scott’s list are almost identical as prospect theory relies heavily on loss aversion. Prospect theory acknowledges that people’s choices can deviate from traditional expected utility theory – such as valuing losses more heavily than equivalent gains – but it does not offer an underlying reason for these preferences. It merely labels these deviations “biases.”

In the case of loss aversion, I have to side with Nassim Taleb, who wrote in Skin in The Game:

The he flaw in psychology papers is to believe that the subject doesn’t take any other tail risks anywhere outside the experiment and will never take tail risks again. The idea of “loss aversion” have not been thought through properly –it is not measurable the way it has been measured (if at all mesasurable). Say you ask a subject how much he would pay to insure a 1% probability of losing $100. You are trying to figure out how much he is “overpaying” for “risk aversion” or something even more stupid, “loss aversion”. But you cannot possibly ignore all the other present and future financial risks he will be taking. You need to figure out other risks in the real world: if he has a car outside that can be scratched, if he has a financial portfolio that can lose money, if he has a bakery that may risk a fine, if he has a child in college who may cost unexpectedly more, if he can be laid off. All these risks add up and the attitude of the subject reflects them all. Ruin is indivisible and invariant to the source of randomness that may cause it.

More formally, loss aversion may be rational in non-ergodic systems.

As I explained in An Introduction to Ergodicity four years ago, an ergodic system is technically defined as a system that has the same behavior averaged over time as it has averaged over the space of all the system’s states.

More intuitively, it’s the difference between rolling a dice and Russian roulette. When rolling a dice, it doesn’t matter whether one person rolls a dice 100 times or if we let 100 people roll one dice each, making it ergodic. In contrast, Russian roulette is non-ergodic. If we let one person play Russian Roulette 100 times (assuming the chance of death is, say, 1/6), he will certainly die (1-(5/6)^100 = 0.9999999879), but if we let 100 people Russian roulette once, we only expect 1/6 x 100 = 16.6666 to die.

The concept of ergodicity is crucial to understanding phenomena like the Gambler’s Ruin, which states that a gambler, even when making bets with positive expected value, will eventually go bankrupt. This outcome arises because, in a non-ergodic system, the effect of accumulating losses over time can lead to a point of no return. In The Gambler’s Ruin, this point is bankruptcy. Similarly, in Russian roulette, it’s death.

Taleb’s point above is that in a world where the risk of bankruptcy exists (non-ergodic), it is rational for people not to value losses equally as gains. For instance, in experiments where subjects are reluctant to take a fair coin toss bet with a potential gain of $1200 or a loss of $1000 (an expected value of $200), their hesitance reflects an understanding that multiple losses in similar bets could lead to financial ruin in real life. The “mistake” participants make is to not separate the hypothetical nature of the experiment (where going bankrupt doesn’t matter) from their real life and financial situations.

From this perspective, behavior that might seem irrational or overly cautious in an isolated laboratory setting (especially with the use of play money) can be entirely rational in the context of real-world uncertainties. In situations where ergodicity does not apply, “loss aversion” is not so much a cognitive bias as it is a practical strategy for not accidentally going bust.

Just like “Hyperbolic discounting”, “loss aversion” is mainly evidence of a flawed mainstream economic model, not a “bias” requiring correction. In these instances, people’s choices often display a level of “rationality” that accounts for complex circumstances, which mathematical models too easily overlook.

Scott Alexander cites other examples, such as the impact of default tipping options in services like Uber, as evidence for the reality of “nudging” and “priming.” It is obviously correct that the default tipping options influence how much customers tip. In fact, it is so obvious that it’s almost laughable as Scott himself admits:

But you shouldn’t need to hear that scientists have replicated this. If you’ve ever taken a taxi, you should have a visceral sense of how yeah, you mostly just click a tip option somewhere in the middle, or maybe the last option if the driver was really good and you’re feeling generous.

This form of “nudging” is not at all similar to the surprising “How is it possible for such trivial manipulations of the context to have such large effects?” that Kahneman is famous for. I may similarly observe that what’s in my fridge “nudges” my decision of what I will eat later today. Scott seems to believe that this is evidence in favor of automaticity.

When discussing the placebo effect, it’s important to distinguish between two commonly conflated phenomena. The first is the role of placebos in clinical trials. In this context, a placebo serves to filter out noise, the self-healing of the body, regression to the mean, and other effects that shouldn’t be attributed to the medication. The placebo is explicitly chosen because it has no therapeutic effect, simulating “doing nothing.” In this sense, the “placebo effect” is not located in the pill, which does nothing, but is a consequence of spontaneous recovery and other effects that are classified as noise in clinical trials. This version of the placebo effect – reversion to the mean – is obviously real and not strange at all.

However, the popular understanding of the placebo effect differs. It refers to the belief that a person can experience tangible health benefits from taking a “placebo” (a sugar pill), simply due to the belief of receiving treatment. Moreover, these benefits are said to go beyond doing nothing. The effect, in this case, is located in the pill itself. This interpretation has been extended to suggest that, “A placebo can work even when you know it’s a placebo” a claim that leads one to question the sanity of everyone involved.

Studies comparing the effects of a placebo (a sugar pill) to doing nothing frequently show that the mystical “placebo effect” is non-existent. Even proponents of the mystical placebo effect cautiously note its limitations:

placebos won’t work for every medical situation—for example, they can’t lower cholesterol or cure cancer. But they can work for conditions that are defined by “self-observation” symptoms like pain, nausea, or fatigue.

The mystical placebo effect is “real” for conditions that can only be observed subjectively. A rather curious effect indeed. Even then, the effect probably doesn’t exist.

I posit that for the same category of conditions, thinking “I will get better in 10 min” also “helps.” It’s important to note that helping, in this case, means people answer a survey differently next time, which brings us to our next topic.

Many findings in psychology rely on surveys, often because they are the most accessible, cost-effective, or the only possible method for collecting certain types of data. However, like our friendly Banana, I consider surveys to be bullshit:

It is not the case that knowledge can never be obtained in this manner. But the idea that there exists some survey, and some survey conditions, that might plausibly produce the knowledge claimed, tends to lead to a mental process of filling in the blanks, of giving the benefit of the doubt to surveys in the ordinary case. But, I think, the ordinary survey, in its ordinary conditions, is of no evidentiary value for any important claim. Just because there exist rare conditions where survey responses tightly map to some condition measurable in other ways does not mean that the vast majority of surveys have any value.

Banana points out that there exists no historical instrument similar to that of the survey. Like statistics, which we will discuss later, surveys are a modern invention. Moreover, from a legal perspective, surveys can be classified as hearsay. Unlike a direct witness, who can be cross-examined to establish the veracity of his claims, a survey-taker can not be questioned. There exists no way to establish the credibility and accuracy of the claims. We don’t have the opportunity to probe deeper, to ask follow-up questions, or to even ensure that the respondent has correctly understood the question in the intended manner. This limitation is significant because, as we have seen above in the Asian Disease example, people make all sorts of assumptions beyond what the scientists intended. What makes this flaw potentially fatal is that we don’t notice when something goes wrong:

in the case of surveys, even if all assumptions fail, if all the pieces of the machine fail to function, data is still produced. There is no collapse or apparent failure of the machinery. But the data produced are meaningless—perhaps unbeknownst to the audience, or even to the investigators.

Banana identifies four critical components necessary for a survey to yield meaningful data: attention, sincerity, motivation, and comprehension.

She writes about attention:

In order for the survey to be meaningful, the subjects must be paying attention to the items. This is the first indication that the relationship between survey giver and survey taker is an adversarial one. The vast majority of surveys are conducted online, often by respondents who are paid for their time. (Just to give you an idea, I happened to see figures in recent publications of fifty cents for a ten-minute task and $1.65 for fifteen minutes.)

This adversarial relationship is likely part of the reason survey-takers are not always sincere. Banana attributes the “lizardman constant” (a percentage of respondents who agree that the world is run by lizardmen), and the findings that gay teens are much more likely to be pregnant, to this lack of sincerity. Respondents may simply want to have fun and give ridiculous answers, especially since they are anonym, instead of sincerely answering all questions.

Even if the survey-takers are attentive and sincere, their motivation is not clear:

A related adversarial motivation is making a point. In the normal course of conversation, in ordinary language use, one forms opinions about why the speaker is saying what she is saying, and prepares a reply based as much on this as on the words actually said. In surveys, survey takers may form an opinion about the hypothesis the instrument is investigating, and conform his answers to what he thinks is the right answer. It’s a bit subtle, but it’s easy to see in the communicative form of twitter polls. When you see a poll, and your true answer doesn’t make your point as well as another answer, do you answer truthfully or try to make a point? What do you think all the other survey respondents are doing? This is not cheating except in a vary narrow sense. This is ordinary language use—making guesses about the reasons underlying a communication, and communicating back with that information in mind. It’s the survey form that’s artificial, offered as if it can preclude this kind of communication. And even when a survey manages to hide its true hypothesis, survey takers still may be guessing other hypotheses, and responding based on factors other than their own innocent truths. 

Last, we come to the most obvious problem: comprehension. Unfortunately, even when avoiding complex terms, comprehension may often be tricky:

Consider another survey question, from the General Social Survey:

Taken all together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?

No one can accuse this of using big words (although my friend points out that “not too happy,” despite its colloquial meaning, is exactly as happy as you’d want to be). But in its simplicity, it exemplifies the complexity of the phenomenon of comprehension.

Consider what comprehension means here. It presumes first that the authors of the survey have encoded a meaning in the words, a meaning that the words will convey to the survey takers. More importantly, it presumes that this corresponds to “the real meaning” of the words—a meaning shared by the audience of the survey’s claims. What would the “real meaning” be in this very simple case? How are things these days? Are you very happy, pretty happy, or not too happy? What informs your choice? Would you have answered the same a month or a year ago? Fifteen minutes ago? How does your “pretty happy” compare with another person’s “pretty happy”? Happy compared to what? How would you predict that your family members would answer? Do they put a good face on things, or do they enjoy complaining? Would their answers correspond to how happy you think they really are? What about people from cultures you’re not familiar with? This is a three-point scale. Would you be able to notice a quarter of a point difference? What would that mean?

Now consider this graph:

Imagine the hypothetical headlines one could base upon this survey such as “The Class Divide in Happiness is Growing!”, “Inequality in Happiness on the Rise!”, or “Higher Education Leads to Happier Lives!”

Now think about what it would take to uncover the fact that such a headline was based on a bullshit survey. One would need to read the article. Then, one would need to find the study the article is based on and read not only the abstract but the whole paper. Finally, one would have to take a look at the instrument they used to measure happiness which may be found in another paper. Nobody besides weird internet autists does that, which is why surveys are so popular.

Surveys are the most popular tool for creating non-indexical knowledge. In short, indexical knowledge is knowledge that is dependent on or embedded in a particular context. The opposite is objective, universal, or global knowledge.

Science Banana again:

Science has celebrated many victories in wresting objective knowledge from indexical observations, but these victories are rare, and are the product of concerted group activity backstage, rather than a natural consequence of scientific ritual. Overconfidence in the global word game, especially in social sciences, threatens the production and appreciation of the genuine kind of indexical knowledge that humans are geniuses at producing and using. The framing of “cognitive bias” privileges the global knowledge game at the expense of a richly situated rationality (what Gigerenzer and others call “ecological rationality”).

The preference for global as opposed to indexical knowledge in the social sciences can be understood as a part of the famous physics envy. The creation of universal knowledge that holds under basically all conditions is something physicists have been celebrated and admired for. Unfortunately, psychology is not well situated to create this kind of knowledge.

By trying to remove indexicality to create global knowledge, researchers created the flawed “bias” field we discussed above: Choices that are “rational” when considered in context become “irrational” when researchers try to remove that context.

Banana demonstrates this problem using the “behaviorally-specific questions” that are used in the National Intimate Partner and Sexual Violence Survey (2012) by the Center for Disease Control:

This certainly seems like an improvement from the “happiness” survey. Here, at least, we are working with real behavior, not some vague subjective unmeasurable state. Based on “behaviorally-specific” surveys, the CDC concludes:

Physical violence by an intimate partner was experienced by almost a third of women (32.4%) and more than a quarter of men (28.3%) in their lifetime. State estimates ranged from 25.4% to 42.1% (all states) for women and 17.8% to 36.1% (all states) for men.

But we should by now have learned to be wary of non-indexical knowledge in the social sciences. When Amy Lehrner and Nicole Allen investigated the validity of the behaviorally-focused CTS in “Construct validity of the Conflict Tactics Scales: A mixed-method investigation of women’s intimate partner violence” the result was quite surprising:

Descriptions of play fighting included behaviors coded as severe assault by the CTS, such as kicking, punching, or resulting in injury to self or partner. An exemplar of this pattern is participant 106 (level 4), who endorsed significant frequency and severity of violence on the CTS, including items such as throwing something, twisting arm or hair, pushing, shoving, grabbing, slapping and two instances of having had an injury for herself and her boyfriend. When asked about times when arguments or fights have ever “gotten physical,” she replied:

P: Honestly, not when we’re fighting have we ever been physical. I mean when we’re joking around and you know, things like that?  We like to just play around and like pretend to beat each other up. But it’s not anything that would really inflict pain on each other. And if it is, it’s very minor and it’s accidental. But…

Q:  Not in the context of like an argument or a conflict?

P:  No. I honestly cannot remember a single time where we were fighting and it had gotten physical. Usually if anything we’re on the phone or we are 5 feet away from each other.

When questioned more directly about her CTS endorsements she responded, “I think I might have been thinking in context of playing sort of thing. . . it probably, I thought it was in the context of just at all?. . . And when we are like at all, you know, physical like that it’s just like all in fun.” When asked for examples she reported:

A:  Well I’ll either just like go up to him – “let’s fight” or something – and then I’ll just like you know lightly punch him or something like that. And then you know he would like pick me up and throw me on the couch, and like start tickling me. Usually that’s what ends up stopping it is that he’ll just keep tickling me until I can’t breathe anymore. But I mean that’s pretty much it.

Q:  And so you’ll get to that point where you’re trying to really overpower him and see if you can do it, and you never can?

A:  Yeah. Usually the only thing I can do is like pinch him or something?  I pull his hair and he’ll stop tickling me. But that’s about it.

Q:  So when are the times where you’d be in the mood to say, “hey let’s fight?”

A:  I don’t know, a lot of times when I’m at home, I get bored and there’s really not a lot to do in my town. So watching TV and movies gets kind of old. So we’re just you know, something to do? Like kind of like a brother/sister type of thing, where you just have nothing else to do so let’s beat up on each other kind of…

It turns out that close to 20% of the participants categorized as violent by the CTS described this type of non-serious “fun-violence.” Needless to say, generally nobody bothers to check whether the answers of survey participants actually mean what researchers think they mean.

Moreover, there exists a telephone game where often highly indexical findings get increasingly more abstracted every time they are cited, eventually reaching the form of global knowledge. Banana demonstrates this in her examination of the widely publicized fact that “Victims leave and return to a relationship an average of seven times before they leave for good.” At the end of Banana’s long and impressive investigation, she discovers the likely origins of this fact:

I tracked down what seems to be one of the only existing copies of Hilberman & Munson, and here is The Fact in its context, in a section late in the short paper called “Treatment Considerations”:

The work with these sixty women has been without benefit of protection or shelter. It is slow, frustrating, and intensely anxiety-provoking, with the knowledge that a client’s self-assertion could prove fatal. It is helpful to keep in focus the following two realities. First, the client’s progress may seem small or inconsequential, for example, when a wife asserts her right to attend church. For the middle-aged woman who has been battered and isolated for twenty years however, this may be a monumental step forward. Second, it may take an average of four or five separations before the last hope for changes within the marriage ends and a decisive separation occurs. The clinician can expect a series of stormy trial separations and should not view this as a treatment “failure.” The battered woman is well aware of what she should or ought to do, and if she feels she has disappointed or failed the clinician, she will be reluctant to return to treatment in the future. The clinician’s insight and understanding will make it possible for her to resume the psychotherapeutic work when she is ready to do so.

(Emphasis mine.)

Again, note that this is not presented as statistical information about a population, but in a very qualified way. It “may take” an average of four or five separations – or it may not. The paper gives no indication that this version of The Fact is based on data collected from the study, rather than the author’s impression.

In addition to the odd temporality of the global knowledge game, it presents us with an odd sense of identity. Hilberman & Munson are cited for The Fact in general, without regard to their very unusual study population. They were studying a sort of Silent Hill town, a rural and very poor place. In a twelve month period, they say, “half of all women referred by the medical staff of a rural health clinic for psychiatric evaluation were found to be victims of marital violence.” That is quite a study population! But in the world of the global knowledge game, “people” are “people.” A statistic based on an impression of sixty women living a very particular local nightmare in the 1970s is laundered into a fact about women and domestic violence in general.

The Fact will obviously be cited until the end of times regardless of its validity mainly because it looks like the type of knowledge we come to expect.

A more formal version of the indexicality problem was articulated by Tal Yarkoni in his paper The Generalizability Crisis. He shows that psychologists fail to “operationalize verbal hypotheses in a way that respects researchers’ actual generalization intentions.” In short, psychologists generalize in ways, and to an extent, they are statistically speaking not allowed to.

After scientists gather data with the help of a survey, the next step typically involves using statistical methods to extract knowledge from this data (removing any shred of indexicality). Here we run into more problems. First, the proficiency in statistics among psychologists is questionable at best, which is part of the reason for the aforementioned Generalizability Ciris. However, as Adam Mastroianni notes, pointing this out is a far too common critique:

In fact, one of psychologists’ favorite pastimes is to complain about how other psychologists don’t know how to do statistics. “It’s too easy for naughty researchers to juice their stats, so we should lower the threshold of statistical significance from p = .05 to p = .005!” “Actually, we use these new statistics instead!” “In fact, we should ban p-values entirely!”

The core issue with statistical analysis in research, even when applied correctly, is that it’s not a great way to discover valuable knowledge. It is, however, a great way to write a lot of papers. Statistical analysis, like a survey, always produces a result.

Adam notes that the use of statistics is a recent phenomenon:

Modern stats only started popping off in the late 1800s and early 1900s, which is when we got familiar and now-indispensable concepts like correlation, standard deviation, and regression.

The p-value itself didn’t get popular until the 1920s, thanks in part to a paper published pseudonymously by the head brewer at Guinness (yes, the beer company). So the whole ritual of “run study, apply statistical test, report significance” is only about 100 years old, and the people who invented it were probably drunk.

He outlines three primary practices employed by scientists prior to the advent of statistics: (1) They focused on identifying large-scale effects, as their methods were not refined enough to detect subtle ones; (2) They concentrated on exploring phenomena that were visibly apparent and undeniably occurring, yet lacked clear explanations; (3) They formulated bold, testable hypotheses about the functioning of the world.

The introduction of modern statistical techniques changed their approach because it allowed them to study tiny effects that may not actually exist. Researchers can then search for all the scenarios where an effect appears or disappears, creating “a perpetual motion machine powered entirely by an inexhaustible supply of p-values.”

Adam mentions the famous flag “priming” experiment as an example:

In 2008, researchers showed a random sample of participants a tiny picture of an American flag and found that people “primed” with the flag were more likely to vote for John McCain for president two weeks later, and had less positive attitudes toward Barack Obama’s job performance eight months after that. In 2020, the same researchers published another paper where they can’t find any effect of flag priming anymore, and wonder whether the effect has declined over time or never existed at all.

The reaction was a call for better statistics:

This “flag priming” study became one of the poster children for the replication crisis, and the overwhelming reaction is now, “People should use statistics correctly!” But that misses the point: studies like only exist because of statistics. In this case—

The effect is not ginormous. In fact, it’s so tiny—if it exists at all—that you can only see it with the aid of stats.

It’s not self-evident. It’s not like people sometimes see an American flag and go “I must go vote for a Republican right away!” and hurry off to their nearest precinct. You would only go looking for a wobbly, artificial phenomenon like this when you have statistical tools for finding it.

It doesn’t have a gutsy, falsifiable hypothesis behind it. If you run another flag priming study and fail to find an effect, you don’t have to eat your hat. You can just say, “Well of course there’s a lot of noise and it’s hard to find the signal. Maybe it’s the wrong context, or the wrong group of people, or maybe flags mean something different now.”

Adam concludes correctly that the real lesson should be to stop doing these kinds of studies. They produce nothing of value because their intent is not to explain anything but to publish a surprising “result” that gathers attention.

This is not a new critique. Paul E. Meehl described a similar problem in 1978 :

Theories in “soft” areas of psychology lack the cumulative character of scientific knowledge. They tend neither to be refuted nor corroborated, but instead merely fade away as people lose interest. Even though intrinsic subject matter difficulties (20 listed) contribute to this, the excessive reliance on significance testing is partly responsible, being a poor way of doing science.

More formally, psychology suffers from an epistemological flaw. David Deutsch addresses this fundamental issue with “explanationless” sciences in The Beginning of Infinity:

In explanationless science, one may acknowledge that actual happiness and the proxy one is measuring are not necessarily equal. But one nevertheless calls the proxy ‘happiness’ and moves on. One chooses a large number of people, ostensibly at random (though in real life one is restricted to small minorities such as university students, in a particular country, seeking additional income), and one excludes those who have detectable extrinsic reasons for happiness or unhappiness (such as recent lottery wins or bereavement). So one’s subjects are just ‘typical people’ – though in fact one cannot tell whether they are statistically representative without an explanatory theory. Next, one defines the ‘heritability’ of a trait as its degree of statistical correlation with how genetically related the people are. Again, that is a non-explanatory defin
Read More