HN Gopher Feed (2017-09-07) - page 1 of 10 ___________________________________________________________________
Redefine statistical significance
177 points by arstin
https://www.nature.com/articles/s41562-017-0189-z.epdf___________________________________________________________________
iovrthoughtthis - 5 hours ago
When can we have scientific papers formatted for the web? Reading
pdf's with many tiny columns spread across each page puts me off
reading so much.
folli - 5 hours ago
Most journals have a HTML and a PDF version (as does Nature):
https://www.nature.com/articles/s41562-017-0189-zI prefer the PDF
version for print outs.
mjpuser - 5 hours ago
This is an interesting point considering the World Wide Web was
born from the need to share scientific info.
jimmar - 5 hours ago
Not a huge fan of this idea. For example, people who analyze
twitter data can get very small p-values because they analyze
millions of tweets even though the effects they find are very
small. See
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1336700
mhermher - 4 hours ago
This is why I think better multiple-test-correction methods are
more important and would lead better to a desirable outcome than
just lowering alpha.
amelius - 4 hours ago
I don't see the problem as long as you clearly separate
significance and effect-size.
seanwilson - 1 hours ago
This bothers me a lot of media reporting. The headline will be
something like "X is good/bad for you" with a tiny effect size
but the way it's reported makes you assume it's a large effect
size. Usually the effect size won't be discussed in any depth
or at all, they just want to sum it up in black and white. If
the effect size is tiny it's probably just noise that would
disappear in a higher quality future study.
moultano - 4 hours ago
I'd rather hear about small things that are true than large
things that are false.
cropsieboss - 4 hours ago
The things is that these small things might just be noise from
some confounding
factors.http://jaoa.org/article.aspx?articleid=2517494For
example, here sample size is huge, USA population gets
significant increased risk, while EU population does not.
Mixing the two together would result in a smaller but still
significant increased risk.Given the size, it's quite clear
that USA population has many other confounding factors that
cannot be eliminated by mathematics alone (there is no
control).
SubiculumCode - 5 hours ago
This is fine, but without other simultaneous changes, will do harm
to young scientists. We need credit for publishing null results, or
stop judgment on the basis of publication number. Would lead to
larger, more well powered studies (good), but this tends to lead to
acquiring multiple measures which can be inappropriately data-
mined, and leads to large grants to established investigators, but
fewer grants to new investigators.
jerrytsai - 4 hours ago
Definitely. The main problem is that in the current system no one
is being rewarded for good science, but for showing something
interesting, bolstered by a declaration of (statistical)
significance. The incentives are not aligned with societal
objectives.Good science requires a tension between hypothesis
generation and skepticism. Perhaps if we rewarded the _debunking_
of findings as much as we do the discovery of findings, things
would change.
imh - 6 hours ago
>For a wide range of common statistical tests, transitioning from a
P value threshold of ? = 0.05 to ? = 0.005 while maintaining 80%
power would require an increase in sample sizes of about 70%.This
proposal is a great pragmatic step forward. Like they say in the
paper, it doesn't solve all problems, but it would be an
improvement with reasonable cost and tremendous benefits.>Such an
increase means that fewer studies can be conducted using current
experimental designs and budgets. But Fig. 2 shows the benefit:
false positive rates would typically fall by factors greater than
two. Hence, considerable resources would be saved by not performing
future studies based on false premises.
stdbrouw - 5 hours ago
In some fields like psychology, power is more likely to already
be 10% or 20% for the majority of studies, and in fact P-hacking
and low standards for evidence would be far less harmful if power
were higher, because low power leads to inflated effect size
estimates. Additionally, power calculations are always just a
guess and easy to fudge, so it's pretty much a given that current
statistical power would not be maintained with more stringent
critical values. See http://andrewgelman.com/2014/11/17/power-06
-looks-like-get-u...So this proposal is really the opposite of
pragmatic. Pragmatic would be requiring effect size estimates and
confidence intervals in all published papers. It is surprising
how many papers will talk about highly significant effects
without actually discussing how large the estimated effect is
thought to be, which gives authors a lot of leeway when
exaggerating the importance of their findings.
reilly3000 - 4 hours ago
The cost of increasing sample size is significant; this is a
trade-off that allows smaller projects to still conduct
valuable research.
Klockan - 4 hours ago
Studies with low significance are no better than anecdotes.
csours - 1 hours ago
https://xkcd.com/552/Alt-Text: "Correlation doesn't imply
causation, but it does waggle its eyebrows suggestively and
gesture furtively while mouthing 'look over there'."I think
you could replace correlation with "small studies" or
"anecdotes". These are all things that suggest there may be
some effect, and there may be merit in further study.
stdbrouw - 3 hours ago
This is not true at all. Small studies are simply less
accurate. Would you prefer to conduct further research on a
cure for cancer that's been shown to lead to recovery in
0-50% of cases (the confidence interval includes 0%, so not
significant) or in .5-1% of cases (significant)?
maxerickson - 3 hours ago
So ultimately the issue is that push button statistics don't
work?
autokad - 5 hours ago
almost all problems has to do with data and selection of data.
changing the p-value threshold wouldn't help.
coverband - 5 hours ago
I upvoted you in principle, but working with a tighter
threshold would also make choosing self-serving data samples
more obvious, if not more difficult.
tw1010 - 6 hours ago
This still doesn't feel satisfying. Part of me is still not really
happy with the philosophical foundations of statistics. Does anyone
know of any legitimately competing theory to statistics? Maybe
something that doesn't rest on the same types of mathematics that
Fisher and crew relied on when all this started? Pure mathematics
has come a long way in the last fifty years but little has seeped
into the applied world.
tnecniv - 4 hours ago
> Part of me is still not really happy with the philosophical
foundations of statistics.You mean you aren't happy
with...probability?
168 - 6 hours ago
How & why exactly are you unhappy with statistics?
drabiega - 5 hours ago
It's likely that the problems with statistics are inherently due
to the nature of knowledge, so alternative formulations are not
likely to help much.
[deleted]
robterrin - 3 hours ago
Uh oh. If you don't watch out you'll end up a Bayesian.
http://www.stat.columbia.edu/~gelman/research/unpublished/p_...Is
this along the lines of what you were hoping to find? Here's
more: http://andrewgelman.com/2016/12/13/bayesian-statistics-
whats...
ArchReaper - 5 hours ago
lol what the hell are you on about
scryder - 5 hours ago
Statistics arises from a set of axioms, assumed truths, which can
be used to prove all other things in the field.You can take a
look at the three axioms people use to justify statistics. If you
are willing to accept them, all else that relies on them (without
using new axioms) must be
true:https://en.wikipedia.org/wiki/Probability_axiomsThis same
logic is used to justify development in pure mathematics: choose
a set of axioms which you accept as ground truths, and prove
things using them. As long as you are unable to prove your axioms
are contradictory, and the axiom choice seems acceptable, then
the work that you've done (with respect to them) is
philosophically justified.
BeetleB - 4 hours ago
Please don't treat probability and statistics as one.
tw1010 - 5 hours ago
Statistics and probability are different things. I'm fine with
the foundations of probability.
mturmon - 3 hours ago
Just for reference, not everyone is OK with the foundations
of probability - what you might call "conventional
mathematical probability" as axiomatized by Kolmogorov. See
http://www2.idsia.ch/cms/isipta-ecsqaru/ for the most recent
in a series of workshops.One entry into this set of ideas is
what Peter Walley has called the "Bayesian dogma of
precision" - that every event has a precise probability, that
every outcome has a known cost. There are real-world
situations when these probabilities cannot be assessed, or
may not even exist; same for utilities.Some examples are in
betting and markets (asymmetric information, bounded
rationality), and in complex simulation environments having
so many parameters and encoded physics that the
interpretation of their probabilistic predictions is unclear.
rgejman - 5 hours ago
Animal experiments will get A LOT more expensive. Will there be a
concomitant increase in agency funding to offset the increased
costs?
siginfo - 49 minutes ago
They do briefly mention "the relative cost of type I versus type
II errors". Both errors (Type I - false positive, Type II - false
negative) have some cost associated.Money saved by using a small
sample size is wasted trying to replicate a false positive result
and by groups around the world that rely on that false
result.Requiring larger sample sizes would mean fewer experiments
are carried out but we will have more confidence in the positive
results produced. The outcome is fewer experiments wasted on
following up on false positives. None of this requires a change
in funding.
gattilorenz - 6 hours ago
It can hardly hurt, but it is still a stop gap measure. It won't
solve the publication bias people will still change the hypothesis
or the test after measurements are done.I think the situation would
improve with better teaching of philosophy of science and
statistics (this would educate better reviewers too).
epistasis - 5 hours ago
It can hurt, in that it can slow the spread of information. If
you perform 70% fewer different types of experiments because you
have to hit p=0.005 instead of p=0.05, then you explore in fewer
directions.This is a classic tradeoff between exploration and
exploitation in active learning.If your view of the world is that
there are only a very few hypotheses worth exploring, and you
have a good lay of the scientific land, then requiring higher bar
of proof is probably good.If it's a new field that's extremely
complex and where very little is known of the governing
principles, then requiring very high stats could severely slow
progress and waste lots of research dollars.I completely agree
that rather than setting arbitrary barriers for significance, it
would seem much better to let people actually understand what was
found, at whatever significance it was. Even setting up the null
model to get a p-value requires tons and assumptions. The better
test is reproducibility and predictive models that can be
validated or invalidated. That's where the science is, and not in
the p.
nonbel - 5 hours ago
> "It can hurt, in that it can slow the spread of
information."I am not at all in favor of this proposal, but one
thing it may do is stem the tidal wave of misinformation.
btilly - 3 hours ago
Yours is a theoretical concern.The very practical concern is
that entire areas of research have been based on studies
replicated and backed up entirely through p-hacking and
selectively publishing only papers with positive results. This
is a proven issue today. See
https://en.wikipedia.org/wiki/Replication_crisis for more.It
may be that there is a pendulum that needs to swing a few times
to get to a good tradeoff. But it is clear, now, which
direction it needs to swing.
epistasis - 1 hours ago
I disagree 100%, having read that Wikipedia page and its
sources.It's something that affects a few fields, not all
science. And the problem has been completely 100%
overblown.If the problem is that things aren't replicating,
changing the p-value cutoff for significance isn't going to
fix everything. It can just as easily be a bad null model
that's the problem,in which case you can't trust any p-value.
The MRI scan problem was closely related to that.It's a
field-specific and null-model specific thing. Broadly
changing the a p-value cutoff for everybody isn't going to
fix this issue.
[deleted]
BeetleB - 4 hours ago
>I think the situation would improve with better teaching of
philosophy of science and statistics (this would educate better
reviewers too).This is necessary, but not sufficient. What's
needed is a way to know for sure that the hypothesis was not
changed after data collection. I think predeclaring the
hypothesis is the way to go.
gattilorenz - 2 hours ago
Yeah, but at the end you can still fabricate the data, remove
"outliers",... Plus it's almost impossible to imagine a world
where, before any experiment in any field, you predeclare
it.Not that education can fix all these (you can't prevent
evil), but if reviewers and journals and conferences started to
accept more the negative results, the incentive in lying would
quickly decrease. And people would probably start to "disprove"
interesting theories, instead of trying to "prove" niche
results...
gboudrias - 2 hours ago
> It won't solve the publication bias people will still change
the hypothesis or the test after measurements are done.As a
Psychology student, this is a well-known initiative:
https://cos.io/prereg/(Though I can't confirm or deny its
widespread usage.)The publication bias is harder, and pre-
registration won't solve this. But I think this is a separate
issue, and it's important to address each issue in its own
right.I've seen the proposal from TFA before and with my very
limited knowledge, I'm still fairly certain it will never come to
pass in Psychology, as nearly half of all modern studies have
reproducibility issues (!). It would be beneficial to our field,
in the way that a band-aid is beneficial to a gaping wound, but
it would require a lot more rigor than has been evidently been
displayed so far (and more rigor is more work, and time is
limited).So... Don't hold your breath.(Sorry if my comment sounds
pessimistic, I don't know much and I'm open to being corrected. I
still have enough critical thought to be skeptical of some
researchers' dedication to intellectual rigor.)
agentofoblivion - 6 hours ago
Agree. It doesn't stop p-hacking, it just makes it harder.
Definitely treating the symptom instead of the disease. We
ultimately need institutional and cultural change, but it's not
obvious how to do that in the short term, so making it harder to
claim significance might be a step in the right direction.On the
other hand, you might expect that new discoveries by nature have
less data since data is likely more expensive for brand new
research, and by extension a lower likelihood of meeting these
sorts of stringent statistical requirements. Decreasing the
p-value threshold may be counter-productive if we dismiss
legitimate new discoveries due to essentially economic
constraints with data gathering, which would have the impact of
making it less likely to get funding to pursue the problem in
more depth, thereby slowing the advance of discoveries.
jjoonathan - 5 hours ago
> It doesn't stop p-hacking, it just makes it harder.I could
see the reverse happening, where higher p-value standards lead
to normalization of deviance in the form of worse p-hacking.
felippee - 5 hours ago
I agree with that. In my prediction legitimate, honest
research with p~0.05 will become unpublishable, while
p-hacked bullshit will prevail in glory.
PeterisP - 2 hours ago
If you do a honest preliminary study and get a p~0.05
result, then repeating the study with 70% larger sample
size should get you p~0.005; but if you've p-hacked, it
won't.
felippee - 2 hours ago
But sometimes getting 70% larger sample will cost you so
much money, that you'd need another grant, one you'd get
after you'd publish your p=0.05 result, which given the
new rules you'd not be able to publish...Statistics is
just another method and almost every method can be hacked
or abused. Science is not about putting checkmarks in
tables but about reading and understanding ideas and
reproducing the results. Tweaking some numeric values is
not going to help the review process which is
fundamentally broken these days.
s17n - 5 hours ago
It used to be possible to have have a successful academic career
without publishing much - for example, one of my philosophy profs
in college (at a top 10 school) had never published anything after
his dissertation (he got his phd in the early 60s).Of course, this
system only worked because academia was a bastion of the male WASP
elites that didn't have much pretense of serving the broader
public. But at least you didn't have the torrent of mediocre
papers that you see today.
rebuilder - 5 hours ago
So... what the academics did wasnt very helpful, but at least
they didn't do much of it?
marcosdumay - 4 hours ago
Those ones teached. Wether is was helpful or not depend on how
good was the teaching and how useful is the knowledge. Not
everybody in a teaching institution should be required to push
humanity's knowledge forward.But then you get the problem of
selecting those people without easily measured objective
indicators. That's why it worked reasonably well when those
were slightly low paying jobs restricted to a caste.
adekok - 5 hours ago
> academia was a bastion of the male WASP elites that didn't
have much pretense of serving the broader publicHave things
really changed? I suspect there are fewer males, but any job
that demands 20 years of full-time concerted effort is likely to
be dominated by men. Similarly, the western world is
overwhelmingly caucasian, so again... the best predictor (now as
then) is that white male professors will be represented
disproportionately.> at least you didn't have the torrent of
mediocre papers that you see today.That certainly is true. Stats
for the humanities and social sciences are that 80% of the papers
have zero citations. i.e. they have no contribution to the
greater body of human work.In Physics (my background), most
papers have 2-3 citations, and only a small percentage have 1 or
fewer.I would say that if a discipline is dominated by uncited
papers, then that discipline is probably a waste of time. And
the professors who work in it are a net drain on society.
tnecniv - 4 hours ago
As a note, WASP refers to old money families with ties going
back to the colonial era, not just middle-class/wealthy white
dudes in America. Also, at least in STEM departments, you will
see plenty of non-white names.> In Physics (my background),
most papers have 2-3 citations, and only a small percentage
have 1 or fewerDoes that account for self-citations?
dragonwriter - 3 hours ago
> WASP refers to old money families with ties going back to
the colonial eraNo, it refers to White Anglo-Saxon
Protestants, an ethno-religious group that cuts across socio-
economic class divides and includes plenty of people that are
neither old money nor descended from families that have been
in the US since the founding (and excludes some old-money,
from-the-founding families.)
PeterisP - 2 hours ago
It does exclude many large subpopulations from the many
immigration waves coming after the colonial era. All the
irish, italian, polish, jewish people; the vast majority of
19th and early 20th century (very large!) immigrants and
their descendants aren't WASPs.
nzjrs - 4 hours ago
An equally plausible interpretation is that universities have
been transformed from teaching institutions into paper factories.
cameronraysmith - 40 minutes ago
I've had some luck showing John Kruschke's Bayesian estimation
supersedes the t-test (BEST) and this simple demonstration
http://www.sumsar.net/best_online/ to people.
logicallee - 4 hours ago
Can someone explain why this three-page article has 72 "authors"?
That works out to about as much writing per author as this comment.
Klockan - 4 hours ago
Easy, in academia you can be the (co)author of a paper you've
never even read.
arstin - 4 hours ago
Given the kind of paper this is, I assume the names should be
understood as an endorsement. Sorta like signatures on a
petition.
JepZ - 4 hours ago
> The choice of any particular threshold is arbitrary [...]Sounds
scientific, doesn't it?> [...] we judge to be reasonable.And
tomorrow someone else judges it differently?Maybe they should not
try to redefine significance but simply introduce something called
'well-reproducible' or so.
md224 - 31 minutes ago
Just curious: would this have an effect on testing the efficacy of
new drugs? I'd hate to see a false negative result for a drug that
could actually help people...
leemailll - 6 hours ago
Change p value from 0.05 to 0.005 won't stop p-hacking. And this
might also lead to more grunted graduate students as they then will
have to increase sample numbers to satisfy new test, which
inevitably increase the already painful long time span for projects
to get published
taeric - 6 hours ago
To be fair, this is a pragmatic, not a technical, solution.
Similarly, we limit the speeds we allow in residential areas not
because it prevents wreckless driving, but because it decreases
the actual risk of it.Similarly, the technical solution involves
technology that does not require drivers and has no risk of human
error anymore. The pragmatic solution is to just limit the
acceptable speeds.
nkrisc - 5 hours ago
A bit of humorous pedantry: We seek to prevent reckless
driving. "Wreckless" driving is what we're trying to promote.
taeric - 5 hours ago
Thanks for the pedantry. It is amusing to me, because the
"w" looks way more correct to me. Can't say why, though.
noxToken - 4 hours ago
Though reck is a word, it isn't commonly used, whereas
wreck is a pretty elementary word. Reck and wreck are
homophones. That's probably your answer.
kharms - 5 hours ago
I've not spent much time in academia, but it was my impression
that p-hacking is driven primarily by ignorance rather than
willful deceit. If that is indeed the case, it would indeed limit
p-hacking as there's usually a finite number of variables being
looked at.Edit: or rather, it would limit false positives that
show up as a result of accidental p-value hacking, if not the
process itself.
std_throwaway - 4 hours ago
Science would benefit from a little less noise.
eelkefolmer - 3 hours ago
Its time to ditch significance levels altogether and use Bayesian
inference or analysis.
analog31 - 17 minutes ago
I'm concerned that prior-hacking will become the new p-hacking.
aheilbut - 6 hours ago
No one in biology would be able to publish anything.
pfortuny - 6 hours ago
Imagine psychology... The end of a science.
Strilanc - 1 hours ago
Funny, I was thinking the opposite.Imagine psychology... done
properly. The beginning of a science.(I realize that
"beginning" is too harsh, but psychology does have very serious
problems with replicability. At the moment, it deserves its
tarnished reputation.)
aimager - 6 hours ago
it's too far away
kingkawn - 6 hours ago
Or it never was one
nonbel - 5 hours ago
There are so many problems with this:1) The p-value filter leads
to publication bias.-You should publish your results anyway, or
the study wasn't designed/performed correctly. The raw data and
description of methods should be valuable.2) The null hypothesis
is (almost) always false anyway.-Everything in bio/psych/etc has
a real (not spurious) non-zero correlation with everything else,
so the significance level just determines how much data needs to
be collected to reject it.3) Rejection or not of the null
hypothesis does not indicate whether the theory/explanation of
interest is correct, so is inappropriate for deciding whether a
result is interesting to begin with.-Usually the null hypothesis
is very precise and the "alternative statistical hypothesis" that
maps to the research hypothesis is very vague, so many
alternative research hypotheses may explain the results.
marcosdumay - 4 hours ago
I would add that 4) Studies with a large p-value but that are
not contradicted by others are much more valuable than studies
that have a small p-value but all contradict each other.