p-checker The one-for-all p-value analyzer

Enter test statistics here:


Save input as CSV file

Test-specific options

Usually, only one effect size should be extracted for each sample. Manually choose the focal effect size, or use this checkbox to only include only the first ES of each study.

General options

If the t value is reported as 2.1, it could also be 2.14999 which has been rounded down. If you want to be maximally generous, you can check this box, and all test statistics are automatically increased by X.XX4999.


For information about TIVA, see replicationindex.wordpress.com.
For information about p-curve, see http://p-curve.com/.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143, 534–547. doi:10.1037/a0033242

Quick Start

Enter test statistics in the text field on the left. You can just enter the plain statistics (see the first four lines of the example), or you can add additional information:
  • Everything before a colon is an identifier for the paper. For optimal parsing, it should have the format XXXX (YYYY) ZZ. Everything before the year in parenthesis (i.e., XXXX) is an ID for the paper. Everything after the year is an ID for the study within that paper. Example: AB&C (2013) Study1. Test statistics with the same paper and study ID belong together (this is relevant for the R-Index).
  • By default a critical two-tailed p value of .05 is assumed; for one-tailed tests you can add ; one-tailed (or shorter: ; 1t) to set the critical p-value to .10
  • You can also directly define the critical p: ; crit = .10, for example.
  • You can check whether a p value has been correctly reported when you provide the reported p value, for example p < .05, or p = .037.
  • In general, all options should be written after the test statistic and be separated by semicolons, e.g. A&B (2001) Study1: t(88)=2.1; one-tailed; p < .02.


Possible test statistics:
  • t-values: t(45)=3.4
  • F-values: F(1, 25)=4.4
  • Z-values: Z=2.02
  • chi2-values: chi2(1)=2.4
  • r-values: r(188)=0.276
The numbers in the brackets generally are degrees of freedom. In the case of correlations (r-values), the df often are not explicitly provided in the results sections; they are N-2.
If two numbers are provided for chi2, the first are the dfs, the second is the sample size (e.g., chi2(1, 326) = 3.8)
in the case of z-values and p-values, the number in parentheses is the sample size (e.g., p(52) = 0.02)

Manual

Data extraction:

  • All p-values can be extracted, both from focal hypothesis tests and from ancillary analyses, such as manipulation checks. But only p values are extracted for which precise dfs are reported (i.e., results such as “Fs < 1, ps > .50” are not extracted).
  • Format:
    • Study ID: teststatistic; [optional] reported p value; [optional] critical p value; [optional, if one-tailed testing] one-tailed
      • [optional] reported p value: e.g., p = .03, or p < .05
      • [optional] critical p value: e.g., crit = .10, or crit = .08
      • [optional, if one-tailed testing]: write the keyword one-tailed, or just one, or 1t
    • The colon separates study ID from everything else
    • If the study ID starts with an underscore, this test statistic is not a focal test (e.g., from a manipulation check, a pre-test, or an ancillary analysis for possible alternative explanations), and will not be included in R-Index or p-curve analyses (but it will be included in the test for correct p-values)
    • The first datum after the colon must be the test statistic
    • All optional informations are separated by semicolons; can be given in any order
    • At the end of a line a comment can be written after a # sign (everything after the # is ignored)
  • Examples:
    • M&E (2005) S1: t(25) = 2.1; p < .05; one-tailed
    • M&E (2005) S2: F(1, 45) = 4.56; p = .03 # wrong p value?
    • M&E (2005) S3: chi2(1) = 3.7; crit=.10
    • _M&X (2011) S1: r(123) = .08; p = .45 # this was a manipulation check (see underscore)
  • Be careful if you copy & paste the results from a PDF:
    • Sometimes there are invisible special characters. They are shown in the app as weird signs and must be removed.
    • The minus sign sometimes looks a bit longer (an “em-dash”). This should be replaced with a standard minus sign.
  • Which tests to select in the presence of interactions? Some hints from Simonsohn et al.’s (2014) p-curve paper:
    • “When the researcher’s stated hypothesis is that the interaction attenuates the impact of X on Y (e.g., people always sweat more in summer, but less so indoors), the relevant test is whether the interaction is significant (Gelman & Stern, 2006), and hence p-curve must include only the interaction’s p-value. […] Simple effects from a study examining the attenuation of an effect should not be included in p-curve, as they bias p-curve to conclude evidential value is present even when it is not.”
    • “When the researcher’s stated hypothesis is that the interaction reverses the impact of X on Y (e.g., people sweat more outdoors in the summer, but more indoors in the winter), the relevant test is whether the two simple effects of X on Y are of opposite sign and are significant, and so both simple effects’ p-values ought to go into p-curve. The interaction that is predicted to reverse the sign of an effect should not be included in p-curve, as it biases p-curve to conclude evidential value is present even when it is not.”

Sign of effects

You can provide the sign of a test statistic (e.g., t(123) = -2.8). This, however, is ignored in R-index, TIVA, and p-curve, which only use the p-values and ignore the sign. Hence, for these analyses by now it is implicitly assumed that all effects go into the predicted direction.
The meta-analysis tab, in contrast, respects the sign.

Special cases

  • Significant studies are used to determine the success rate in the R-index analyses. But sometimes marginally non-significant p values (e.g., p = .051) are falsely rounded downwards and cross the critical boundary only due to this error (i.e., they are reported as “p < .05)”). In this case, the ES does not count for a "success" (see column “significant” in the R-Index tab), as the actual p-value is not significant. But, if the ES has been (falsely) interpreted as significant by the original authors, the critical value can be slightly increased, so that the ES is also counted as a success in the R-Index analysis. In this case, increase the critical level to crit = .055 for example. This decision (whether "near significant" studies, that are falsely interpreted as significant, should be included as "successes" in the R-index analysis) should be made a priori.

Reproducible Analyses

Copy the link below the text entry area for a reproducible analysis. This way you can share any p-value analysis in a single link!

Roxygen-style header for your analysis

You can add title, subtitle, details, and an URL for your analysis in the syntax:
#' @title The title of your analysis
#' @subtitle by Slartibartfast
#' @details Go and replace the examples in the text box!
#' @url http://shinyapps.org/apps/p-checker/

Technical Details

The Egger's test and PET-PEESE are implemented as:
PET <- lm(d~d.se, data, weight=1/d.var)
PEESE <- lm(d~d.var, data, weight=1/d.var)

PET <- rma(yi = d, vi = d.var, mods=d.se, method="DL")
PEESE <- rma(yi = d, vi = d.var, mods=d.var, method="DL")

Terms of Use

Have fun playing around with p-checker! This web application provides several tests for publication bias/p-hacking/indicators for data-dependent analyses, whatever term you prefer. Some of them are new, unpublished, and controversial to some extent; purpose of this app is to provide a unified place for trying out and comparing these methods. Please use the tests with care.
When you do an actual analysis, remember:
  • It is not OK to search for single papers which score low on a certain index ("cherry-picking"), and to single out these papers. Sampling variation applies to papers as well, and it can occur by chance that some rare combinations of results are found.
  • Always analyze papers with a defendable a priori inclusion criterion, e.g.: "All papers from an certain journal issue, which have more than 2 studies", or "The 10 most cited papers of a working group".
  • Disclose the inclusion rule.
  • Take care what p-values can be included. p-curve, for example, assumes the independence of p-values. That means, you usually only extract one p-value per sample.
  • In general: RTFM of the tests you do!

I strongly recommend to read Simonsohn et al.'s (2014) p-curve paper. They have sensible recommendations and rules of thumb which papers and test statistics to include in an analysis.

About

(c) 2018 by Felix Schönbrodt (www.nicebread.de).

Citation

Programming this app took a considerable effort and amount of time. If you use it in your research, please consider citing the app, and of course the creators of the statistical tests:

Schönbrodt, F. D. (2018). p-checker: One-for-all p-value analyzer. Retrieved from http://shinyapps.org/apps/p-checker/. The source code of this app is licensed under the open GPL-2 license and is published on Github.


This Shiny app implements the p-curve (Simonsohn, Nelson, & Simmons, 2014; see http://www.p-curve.com) in its previous ("app2") and the current version ("app3"), the R-Index and the Test of Insufficient Variance, TIVA (Schimmack, 2014; see http://www.r-index.org/), and tests whether p values are reported correctly.

p-curve code is to a large extent adapted or copied from Uri Simonsohn (see here). TIVA code adapted from Moritz Heene; original fasterParser and several GUI functions by Tobias Kächele.

Citation of tests

Programming this app took a considerable effort and amount of time. If you use it in your research, please consider citing the app, and of course the creators of the statistical tests:

Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 1088–1101.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. Bmj, 315, 629–634.

Schönbrodt, F. D. (2015). p-checker: One-for-all p-value analyzer. Retrieved from http://shinyapps.org/apps/p-checker/.

Schimmack, U. (2014). Quantifying Statistical Research Integrity: The Replicability-Index. Retrieved from http://www.r-index.org

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143, 534–547. doi:10.1037/a0033242

Stanley, T. D., & Doucouliagos, H. (2014). Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5, 60–78. doi:10.1002/jrsm.1095

Disclaimer / Validity of the results

I cross-validated the results with p-curve.com and did not find differences (unsurprisingly, as I use Uri's code for p-curve to a large extent). With a single click (see the "Export" tab) you can transfer the test statistics to p-curve.com and cross-validate the results yourself. I also checked the results with the R-Index Excel-sheet and did not find differences so far.
Nonetheless, this app could contain errors and a healthy scepticism towards the results is indicated. I always recommend to perform some plausibility checks. Feel free to go to the source code and check the validity yourself. If you suspect a bug or encounter errors, please send me an email with your test statistics and a description of the error.

Comments

Any detected bugs, comments, and feature requests are welcome: felix@nicebread.de
https://osf.io/3urp2/ Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143, 534–547. doi:10.1037/a0033242

Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6, 291–298. doi:10.1177/1745691611406923

Lakens, D. (2014). Professors are not elderly: Evaluating the evidential value of two social priming effects through p-curve analyses. doi: http://dx.doi.org/10.2139/ssrn.2381936. Retrieved from http://ssrn.com/abstract=2381936

Release notes / Version history

Version 0.7 (2018-01-15)

Changes:
  • Added PET-PEESE and Egger's test. Renamed tab to "Meta-Analysis". PLEASE read the disclaimer on top of the meta-analysis tab: The test statistics are converted to Cohen's d wherever possible, based on the formulas provided by Borenstein, Hedges, Higgins, & Rothstein (2011). Warning: These effect size conversions are based on approximative formulas; furthermore the app always assumes equal cell sizes and other simplifications. Although these proxies work good under many conditions, this quick meta-analytic overview cannot replace a proper meta-analysis! For the same reasons, the results here might slightly differ from published results from a proper meta-analysis.
    You should rather see this as a quick prototyping/screening tool.
  • Fixed some signs and one-tailed stats in the glucose demo data.
  • Removed "Export" tab (the export link now is below the text entry field on the left side).
  • Slightly updated design
  • Fixed a bug in TIVA: p-values are now one-tailed (as it should be). Thanks to Aurelien Allard and Katie Corker for reporting the bug.

Version 0.6.2 (2016-10-04)

Changes:
  • Changed TIVA computation to log(p), which allows much smaller p-values (thanks to Rickard Carlsson @RickCarlsson for pointing out the bug).
  • Added power posing p-curve data from Joe Simmons and Uri Simonsohn (see http://datacolada.org/37)

Version 0.6.1 (2016-06-14)

Changes:
  • New "test statistic": You can now directly enter p-values (optionally with df in parentheses), based on a suggestion by Roger Giner-Sorolla. If df are provided, an effect size is computed based on a approximative conversion formula by (see here).
    Examples:
    • p=.034
    • p(48)=.024

Version 0.6 (2016-02-22)

Changes:
  • Added 33% (or other) theoretical p-curve in plot
  • Moved comparison-power-slider to standard controls

Version 0.5 (2016-02-15)

Changes:
  • Included Begg's test for publication bias
  • Fixed bug in effect size plot
  • "Send to p-curve" now links to app4
  • Much improved parser (at least 100x faster)

Known issues:

  • TODO code clean-up: Clearly separate the inference functions from UI functions