## Monday, March 28, 2016

### A review of "Derailment", Diederik Stapel's autobiography

In 2011, social psychologist Diederik Stapel was accused of faking his data.  As allegations mounted, Stapel admitted to fraud and was fired his university post.  The incident was widely covered in the news (I blogged about it here) and is one of the precipitating events for the current conversation in psychology about reproducibility.

Two years after the scandal broke, Stapel wrote an autobiography, "Ontsporing", or "Derailment".  An English translation of this autobiography is now freely available for anyone to read.

If Stapel is an admitted liar, why should we take his autobiography seriously?  The thing is, although less severe forms of scientific misconduct are more common, outright fraud is
rare -- about 1.97% of scientists admit to committing fraud in surveys on the subject (Fanelli, 2009).  Rarer still is for someone admit to fraud publicly and then go on to write about their experiences.  Stapel's autobiography therefore has value in that it provides a window into the psychology of someone who did not just tiptoe the shallows of scientific misconduct, but who dove in headfirst.

In other words, I thought I might learn something about why Stapel decided to commit fraud by reading about his experiences in his own words.  So I did.

## Thursday, March 3, 2016

### Reproducibility is more than the RPP

A bit less than a year ago, in one of the biggest events in psychology in recent memory, a group of researchers published the Reproducibility Project: Psychology, a landmark effort to reproduce 100 findings in psychology journals.  The major result was that, depending on how you measure "reproducibility", between 39% and 47% of the original results were successfully reproduced.  Today, a new comment on the RPP has been published that makes the bold claim that the reproducibility estimates reported by the original team were drastically wrong, and are "statistically indistinguishable from 100%".

Although the commentary does raise some good points -- they note that some of the studies in the reproducibility project depart from the original studies in ways that are likely problematic -- I also think it's easy to lose sight of the broader context when critiquing a single project.  (For those interested, there also may be some problems with the basic claims of the critique)

## Thursday, February 11, 2016

### Effect stability: (2) Simple mediation designs

In my last post, I described how a significant estimate need not be close to its population value, and how, using a clever method developed by Schönbrodt and Perugini (2013), one can estimate the sample size required to achieve stability for an estimator through simulation.

Schönbrodt's and Perugini's method defines a point of stability (POS), a sample size beyond which one is reasonably confident that an estimate is within a specified range (labeled the corridor of stability, or COS) of its population value.  For more details on how the point of stability is estimated, you can read either my previous post or Schönbrodt's and Perugini's paper.

By adapting Schönbrodt's and Perugini's freely available source code, I found that, in two-group, three-group, and interaction designs, statistical stability generally requires sample sizes around 150-250.  In this post, I will apply this same method to simple mediation designs.

## Friday, February 5, 2016

### Effect stability: (1) Two-group, three-group, and interaction designs

When planning the sample size to estimate a population parameter, most psychology researchers choose the size that could allow an inference that the parameter is non-zero -- in other words, researchers attempt to maximize statistical significance.  However, both practical and scientific interest often centers around whether the estimate is good or stable -- that is, close to its population parameter.

These two criteria, significance and stability, are not the same.  Indeed, with a sample size of 20, a correlation of $r$=.58, which has a $p$-value of .007, could plausibly range between .18 and .81.