Basic and Applied Social Psychology journal bans null hypothesis significance testing procedure

a thoughtful web.

Good ideas and conversation. No ads, no tracking. Login or Take a Tour!

Basic and Applied Social Psychology journal bans null hypothesis significance testing procedure · 9

mk · 3562 days ago

sciencebasedmedicine.org · #science · #science.mk · #statistics

Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).

Question 2. What about other types of inferential statistics such as confidence intervals or Bayesian methods?

Answer to Question 2. Confidence intervals suffer from an inverse inference problem that is not very different from that suffered by the NHSTP. In the NHSTP, the problem is in traversing the distance from the probability of the finding, given the null hypothesis, to the probability of the null hypothesis, given the finding. Regarding confidence intervals, the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval. Rather, it means merely that if an infinite number of samples were taken and confidence intervals computed, 95% of the confidence intervals would capture the population parameter. Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval. Therefore, confidence intervals also are banned from BASP. Bayesian procedures are more interesting. The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist. The Laplacian assumption is that when in a state of ignorance, the researcher should assign an equal probability to each possibility. The problems are well documented (Chihara, 1994; Fisher, 1973; Glymour, 1980; Popper, 1983; Suppes, 1994; Trafimow, 2003, 2005, 2006). However, there have been Bayesian proposals that at least somewhat circumvent the Laplacian assumption, and there might even be cases where there are strong grounds for assuming that the numbers really are there (see Fisher, 1973, for an example). Consequently, with respect to Bayesian procedures, we reserve the right to make case-by-case judgments, and thus Bayesian procedures are neither required nor banned from BASP.

Question 3. Are any inferential statistical procedures required?

Answer to Question 3. No, because the state of the art remains uncertain. However, BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem. However, we will stop short of requiring particular sample sizes, because it is possible to imagine circumstances where more typical sample sizes might be justifiable. We conclude with one last thought. Some might view the NHSTP ban as indicating that it will be easier to publish in BASP, or that less rigorous manuscripts will be acceptable. This is not so. On the contrary, we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research. We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking. The NHSTP has dominated psychology for decades; we hope that by instituting the first NHSTP ban, we demonstrate that psychology does not need the crutch of the NHSTP, and that other journals follow suit.

b_b thundara

tweet · print · htmlmarkup tips · 0

thundara · 3562 days ago · link ·

Cumol, Dendrophobe, b_b: Coincidentally just published today

+discuss+discuss

b_b · 3562 days ago · link ·

Question 3 is the best possible result, imo. Describe the data. What could be simpler? And for god's sake, please put the highlights of the description in the abstract. We're all educated enough to decide if your data are convincing without being told that they are because it passed a basic level smell test.

+discuss+discuss

Cumol · 3562 days ago · link ·

Just as Dendrophobe said, I only know NHST.

thundara suggested to increase the n-number. What about if someone sees a difference after repeating the test 3 times but did not get a significant result using any test? He then decided to repeat the experiment two more times and surprise surprise, the p value is under 0.05.

Is that considered p-value hacking?

Edit: I just checked the reddit frontpage and found this. A randomized double-blind placebo-controlled study on celiac sensitivity.

The p-value was p=0.034 and the money shot is this figure

Now the reddit hive-mind is going against the results. Funny that the community usually considered mainly skeptic also believes what they want to believe.

+discuss+discuss

–

thundara · 3562 days ago · link ·

Now the reddit hive-mind is going against the results. Funny that the community usually considered mainly skeptic also believes what they want to believe.

It's annoying how anxious the reddit community can be to know answers. That study you linked had 61 participants which may be enough to drive further inquiry, but hardly enough to make conclusions about an entire population / sub-groups within a population. Science is generally patient about finding answers, but redditors are quick to jump on any preliminary results before they have a chance to be replicated by other groups and in other populations.

+discuss+discuss

thundara · 3562 days ago · link ·

This would fall under the umbrella of multiple hypothesis testing. In your example, the person is running the experiment 5 times and they only see the p-value below 0.05 for two of them... purely statistically, the chance of that happening by chance if the null hypothesis is correct is equal to the chance of two tests or more being significant, which is:

nCr(5, 2) * 0.05^2 * 0.95^3 + nCr(5, 3) * 0.05^3 * 0.95^2 + nCr(5, 4) * 0.05^4 * 0.95^1 + nCr(5, 5) * 0.05^5 * 0.95^0 ~= 0.02, which would be significant (p < 0.05).

However, if they did not define the number of experiments to run in advance, then the results would be meaningless and yeah, it would be p-value hacking. Unfortunately, this crops up in all areas of research because there's a million reasons that an experiment could have failed and it's easier to bury the ones where things failed and submit the results where p < 0.05 (See also: Publication bias ).

This is also a super big issue in medical research, where clinical trials are frequently not registered beforehand or results are not submitted after being registered. So you end up with a bias towards studies showing a drug works, resulting in patients being exposed to undue risk (See also: Bad Pharma ).

Edit: Fixed bad math

+discuss+discuss

Dendrophobe · 3562 days ago · link ·

I've sort-of got a psychology degree (part of a double-major), and null hypothesis significance testing is all I know. I've known there were problems with it for a while, but I have no idea what the alternatives are (sure, Bayesian inference, but is that really a solution? I don't have the background to know,m but that journal doesn't think so). What are the alternatives?

+discuss+discuss

–

thundara · 3562 days ago · link ·

Null hypotheses aren't inherently bad if n is large and the researchers apply the correct test.

Part of the trouble comes with when they:

- Don't define their tests in advance (We looked at the data and then decided to apply a one-tailed t-test between these groups)

- Test multiple hypotheses testing (Is this subgroup A significantly different from subgroup B? What about A and C? B and C?)

- Use the wrong type of test (i.e. t-test on a population that isn't normally distributed)

Previous to mk's post, I'd have said (1) fix the issues above, (2) report the confidence interval instead, or (3) look at the data for "obvious" changes. Apparently (2) is problematic, too, so others may have a better answer >_>

+discuss+discuss

thundara · 3562 days ago · link ·

First I'd heard of the confidence interval criticism, but it makes sense. I'm a fan of the squint-test, where if you have to squint to see a difference, it probably doesn't exist (assuming small n). I have the suspicion that this is just the swing of the pendulum, and the journal will return to some numerical metric because scientists like parameterized results.

+discuss+discuss

–

mk · 3562 days ago · link ·

b_b and I lament it all the time. If you think about it, it's not a very scientific way of making up your mind about something. What is the difference between p = 0.05 and p=0.07? What was the experimental design? How are your data distributed, and what is your confidence in that? What size and type of effect are you talking about?

Simply passing or failing a Students t-test or ANOVA doesn't have nearly as much meaning as the weight typically ascribed to it. IMO it's all well and good to provide p values, but not to talk about "statistical significance" unless you have clearly defined the parameters of your definition, and have made a strong case for why they can be applied to your experiment (perhaps was used as an exclusion parameter during the course of experimentation). It's not meant to be a final analysis.