adversarial collaboration: A good-faith effort to resolve scientific debates by jointly carrying out research. Originally proposed by Mellers, Hertwig, and Kahneman in 2001 as a substitute for the more common approach of publishing commentaries and rejoinders on articles. Researchers are expected to clearly express each others theoretical views, collectively design studies to test diverging predictions, and publish the results from the studies that are performed. Sources: Mellers, B., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate conjunction effects? An exercise in adversarial collaboration. Psychological Science, 12(4), 269–275. https://doi.org/10.1111/1467-9280.00350
alpha level: The threshold chosen in Neyman-Pearson hypothesis testing to distinguish test results that lead to the decision to reject the null hypothesis, or not, based on the desired upper bound of the Type 1 error rate. An alpha level of 5% it most commonly used, but other alpha levels can be used as long as they are determined and preregistered by the researcher before the data is analyzed.
alpha spending function: A specification of how to total alpha level will be distributed across multiple looks at the data in a sequential design.
a-priori power analysis: A calculation of the sample size that is required to achieve a desired statistical power (or Type 2 error rate) when testing a hypothesis with a specific statistical test, given the alpha level and an effect size of interest.
auxiliary hypotheses: Premisses or assumptions that are taken for granted and relied upon when testing a hypothesis. A a negative test result therefore only means that our hypotheses or one of the auxiliary hypotheses must be false. Source: Hempel. (1966). Philosophy of Natural Science. Pearson.
Bayes factor: The Bayes factor measures the strength of evidence for one model (e.g., the null hypothesis) relative to another model (e.g., the alternative hypothesis); this ratio of probabilities or densities of the observed data under the two models is the amount by which one’s belief in one hypothesis versus another should change after having collected data .
causal enquiry: Establishing that A is the cause of B, or assessing whether A it the cause of B.
Cohen's d: The standardized mean difference of an effect, computed by dividing the mean difference by the standard deviation, which allows the effect size to be compared across different measures.
common language effect size: Also known as the probability of superiority, the common language effect size is a percentage that expresses the probability that a randomly sampled person from one group will have a higher observed measurement than a randomly sampled person from the other group (for between designs) or (for within-designs) the probability that an individual has a higher value on one measurement than the other.
compromise power analysis: In a compromise power analysis the sample size and the effect are fixed, and the error rates of the test are calculated, based on a desired ratio between the Type I and Type II error rate.
computational reproducibility: Computational reproducibility means that anyone can recreate the reported results (such as test results, tables, and figures) on the basis of available data, analysis, and other necessary files.
confidence interval: An interval around an estimate that, in the long run, will capture the population value a desired percentage of the time.
critical effect size/minimal statistically detectable effect
degrees of freedom
directional test: A statistical test where the null hypothesis consists of all values smaller than a specific value (e.g., x ≤ 0) and the alternative hypothesis consists of all values in the opposite direction (e.g., x > 0). Also referred to as a one-sided test.
equivalence testing: A statistical procedure to reject the hypothesis that the data are as extreme or more extreme than an effect of interest. Equivalence tests such as the TOST (two-one-sided tests) procedure can be used to falsify the claim that effects exist that are large enough to matter.
false negative: a decision error where the null hypothesis is not rejected, even though there is a true effect in the population.
false positive: a decision error where the null hypothesis is rejected, even though there is no true effect in the population.
false positive report probability/false positive risk
falsifiability: The extent to which a claim can be disproven, or falsified. The falsifiability of claims is an essential requirement in philosophies of science based on methodological falsificationism.
follow-up bias: A term used to indicate that because power analyses which use effect size estimates from pilot studies will not be performed if the effect size is close to zero (which would lead to infeasibly large sample sizes), on average power analyses based on effect sizes from pilot studies will biased, returning too small sample sizes, which leads to underpowered studies.
HARKing: Hypothesizing after the results are known. The practice of developing a hypothesis after looking at the data, but presenting the hypotheses as if it was developed before looking at the data. HARKing leads to tests that, unbeknownst to a naive reader, lack severity.
highest density interval
institutional review board
intersection-union testing approach
interval hypothesis/range prediction
Lindley's paradox: The statistical fact that it is possible to reject the null hypothesis (e.g., p < 0.05) when the data provide stronger evidence (e.g., as indicated by a likelihood ratio) for the null hypothesis than for the alternative hypothesis.
maximum likelihood estimator
meta-regression: The application of regression models in meta-analysis to estimate properties of underlying model generating the distribution of effect sizes, such as the mean effect size, or the degree of funnel plot asymmetry. Examples include Egger's regression and PET-PEESE.
minimal statistically detectable effect: The smallest effect size that, if observed, would lead to a rejection of the null hypothesis.
minimum effect test: A statistical hypothesis test where the value that is tested against is not a null effect, but a smallest effect size of interest.
Neyman-Pearson approach: An approach to statistical inferences where observed data is used to make decisions about the rejection or non-rejection of hypotheses while controlling the maximum error rate.
nil null model
number needed to treat
Open science: A set of practices for reproducibility, transparency, sharing, and collaboration based on openness of data and tools that allows others to reuse and scrutinize research.
positive predictive value
post-hoc power analysis
p-value: The probability of the observed data, or more extreme data, if the null hypothesis is true. The lower the p-value, the higher the test statistic, and less likely it is to observe the data if the null hypothesis is true.
preregistration: The practice of registering properties of a study in an online database that is (or can be made) publicly accessible. Can be used to communicate a study is performed (as is common practice in randomized controlled trials) and to communicate which properties of the study (such as the hypotheses, experimental design, and statistical analysis plan) were planned before the researchers had access to the data.
probability density function: The probability density function (pdf) is a function that completely characterizes the distribution of a continuous random variable. It provides the likelihood that the value of a random variable will fall between a certain range of values. Source: https://www.statlect.com/glossary/probability-density-function
probability of superiority:
randomized controlled trial
ROPE procedure: An estimation based approach where a posterior distribution is compared to a region of practical equivalence to determine whether the data is close enough to the absence of a meaningful effect, conceptually similar to equivalence testing.
sample selection bias
sensitivity power analysis: Computation of the statistical power that is achieved given a chosen alpha level and sample size
sequential analysis: Repeatedly testing the same hypothesis in interim analyses as data is collected while controlling the Type 1 error rate across all analyses performed.
severity: The extent to which a claim has been well-probed, or severely tested. Data support a claim to the extent that a test of this claim had a high probability of not corroborating the claim if the claim was false.
small telescopes approach
smallest effect size of interest
trim-and-fill: The trim and fill method aims to estimate the number of studies that are missing in a meta-analysis due to bias. Tim-and-fill aims to correct funnel plot asymmetry by trimming (removing) smaller studies, estimating the true effect size, and filling (add) studies assumed to be missing due to bias. The method is based on a nonparametric data augmentation technique. In addition, the trim-and-fill method aims to adjust the effect size estimate from the meta-analysis based on the augmented effect sizes. This adjustment does not adequately correct for bias, and the method can therefore not be used to calculate a 'bias corrected' effect size estimate. Trim-and-fill can be used to identify the presence of bias that leads to funnel plot asymmetry. Trim-and-fill was developed by Duval and Tweedie (2000a, 2000b). Sources: https://handbook-5-1.cochrane.org/chapter_10/10_4_4_2_trim_and_fill.htm and https://www.metafor-project.org/doku.php/plots:funnel_plot_with_trim_and_fill
two one-sided tests (TOST) procedure
Type 1 error rate
Type 2 error rate
true negative: a correct decision where the null hypothesis is not rejected when there is no true effect in the population.
true positive: a correct decision where the null hypothesis is rejected when there is a true effect in the population.
union-intersection testing approach
version control: systems where changes in data (such as code or text) are systematically recorded. Version control makes it possible to go back to previous versions and identify when changes were introduced, and prevents multiple individuals working on the same code from accidentally overwriting each other's changes. Subversion (SVN) and git are two of the most popular centralized version control systems.