# 14 Computational Reproducibility

Technology has greatly improved how scientists work. The internet has made it easy to share information – including data, materials, and code – and new software and online platforms have emerged to facilitate the workflow of scientists . One important goal of a scientific workflow is to make sure that the final work you publish is computationally reproducible. Computational reproducibility means that when you use the same data as in the published article, you can reproduce the same results. In other words, if the authors of the published article send you their data and code, you should be able to get the exact same numbers as they report in their article. Current research on the computational reproducibility of scientific articles suggests it is often not possible to run the original code on the data to reproduce results . Sometimes the code will simply not run on the data, or not all analyses are part of the code.

However, computational reproducibility is important, both for other scholars to be able to verify your results, and to build on your results. We could consider computational reproducibility a minimum standard of your own workflow. However, meeting this standard requires training. When I was a PhD, we often had a problem known as ‘data rot’. When I submitted an article for publication, and received the reviews after several months, I could not always easily reproduce my own analyses. For example, I might not have stored how I dealt with outliers, and could not exactly reproduce the original results. Sometimes, ‘data rot’ had eaten away at either my data or my analysis code, and it no longer worked.

Obviously, there is no such thing as ‘data rot’. The problem was I did not use a reproducible workflow. In this chapter, we will learn what a computationally reproducible workflow looks like, and how you can share computationally reproducible results with your published paper. The goal of applying a computationally reproducible workflow to your projects is to allow someone else (or yourself, one year from now) to take your data, run your code, and get exactly the same results as you reported in your work.

Although there are multiple ways to achieve a fully reproducible workflow, in this chapter I aim to introduce you to what I believe might be one emerging standard in psychological research. Through an example, you will learn to work with a version control system (such as GitHub, which integrates nicelywith the Open Science Framework) as you are programming in R, which stores previous versions of files. You will then learn how to write a completely reproducible data analysis script(including figures), that you can save as an HTML file or a PDF file, using RMarkdown. Finally, we will take a look at Code Ocean, a novel online platform that allows you to share computationally reproducible code online, making it extremely easy for others to run (small variations of) your code. While you will not learn how to become an experienced programmer by the end of this chapter, you will see what a fully reproducible workflow would look like, and get some initial experience with tools you will most likely want to explore more in the future.

Getting software and code to work on your system might be a challenge, and regrettably, I can’t offer ICT support. Differences between Windows, Linux, and Apple operating systems means you might need to search the internet for solutions to problems you run into – this is very normal, and even experienced programmers do this all the time. If you get stuck, you can check what you did against what the code should look like by visiting the public versions of part of this example:

GitHub repository: https://github.com/Lakens/reproducibility_assignment

OSF project: https://osf.io/jky8s/

Code Ocean container: https://codeocean.com/capsule/2529779/tree/v1

## 14.1 Step 1: Setting up a GitHub repository

In this assignment we will use github, but an open source alternative is to use GitLab. If you haven’t created a GitHub account before, do so now. Go to https://github.com/ and create an account. Git is a version control system for tracking changes in computer files and coordinating work on those files among multiple people. Version control allows you to track changes to files and revert back to previous versions if needed. GitHub and GitLab are web-based hosting services that make it easier to use version control with Git. We will be using GitHub because it is what I am most familiar with, and it integrates with slightly more tools, but feel free to use GitLab instead.

If you have an account, you can create a new repository. A repository is a collection of folders and files that make up your project. In the top-right of the GitHub page, click the + symbol, and select ‘New repository’ from the dropdown menu.

The first thing to do is name your repository. When it comes to naming folders and files, it is important to follow best practices for file naming:

• Keep names short, but clear. data_analysis_project is easier to understand for others than dat_an_prjct

• Do not use spaces. Options include:

• Underscore: this_is_a_file.R (this is my personal favorite)

• Camelcase: ThisIsAFile.R

• Dashes: this-is-a-file.R

• No spaces: thisisafile.R

• If you want to number multiple sequential files, do not use 1_start, 2_end, but use leading zeroes whenever you might number more than 10 files, so for example 01, 02, etc., or 001, 002, etc.

• Do not use special characters such as $#&*{}: in file names. • If you want to use date information, use the YYYYMMDD format. Let’s name our repository: reproducibility_assignment You can add a short description (e.g., ‘This is an assignment to practice an open and reproducible data analysis workflow’). If you are an academic or student, you can get an academic account, which gives some extra options, such as keeping repositories private: https://education.github.com/pack Click the checkbox before ‘Initialize this repository with a README’. A readme file is a useful way to provide a more detailed description of your project, that will be visible when people visit your GitHub project page. It can also contain instructions on how to reproduce analyses, such as which files to run in which order, and any changes to files that need to be made as the files are run. You are also asked whether you want to add a license. Adding a license is a way to easily communicate to others how they can use the data, code, and materials that you will share in your GitHub repository. Note that not making a choice about a license is also a choice: if you do not add a license your work is under exclusive copyright by default, which means others can’t easily re-use it. You can learn more about licenses, but for now, a simple choice is the MIT license, which puts only very limited restrictions on reuse, but more restrictive licenses also exist. You can select the choice of license (such as the MIT license) from the dropdown menu. It lets people do anything they want with your code as long as they provide attribution back to you and don’t hold you liable. There are also creative commons licenses that you can use when you are sharing something else than software, such as research materials (for example, this educational material is shared under a CC-BY-NC-SA 4.0 license). We are now ready to create the repository. Click It might feel unintuitive, but it is important to remember that you are not expected to directly interact with your new GitHub repository through the GitHub website. The repository page will give you information about the contents of the repository, and the history of the files in the repository, but it is not particularly easy to add files or download files directly through the website. The idea is that you use other software to interact with your GitHub repository. ## 14.2 Step 2: Cloning your GitHub repository into RStudio R Studio can communicate with GitHub. To allow RStudio to work together with GitHub, you first need to set up the system. A detailed explanation for different operating systems is provided here. First, download Git: https://git-scm.com/downloads for your operating system, and install it (you can accept all defaults during the installation process). If you haven’t done so already, download and install R: https://cran.r-project.org/, and download and install the free version of R Studio (scroll down for the installers): https://www.rstudio.com/products/rstudio/download/. In R Studio, go to Tools > Global Options, and select the Git/SVN menu option. Check if the Git executable (“git.exe”) has been found automatically. If not, you will need to click the ‘ Browse…’ button and find it manually. It will always be in the location where you installed Git. Click the ‘Create RSA Key…’ button. A window will appear: You can close the window. Still under the RStudio options, click the blue hyperlink ‘View public key’. A window will appear, telling you that you can use CTRL+C to copy the key. Do so. Go to GitHub, and go to settings and then select the option SSH and GPG keys: Click ‘New SSH key’ Enter a name (e.g., RStudio) and paste the key in the correct window. Click ‘Add SSH Key’. This will allow you to send code from R Studio to your GitHub repositories without having to enter your GitHub login name and password every time. In other words, R Studio is now connected to your GitHub account and repository. You are now ready to create a version controlled project in R Studio. Restart RStudio. In RStudio, go to File>New Project: You get three choices. Choose the ‘Version Control’ option: Choose the ‘Git’ option: We will be cloning the online GitHub repository we created. Cloning is a term used in Git that means creating a local copy of all files in your repository on your computer. You can copy-paste the URL from your GitHub repository (e.g., https://github.com/Lakens/reproducibility_assignment). If you copy-paste this URL in the top field, it will automatically create a Project directory name that is similar to the name you gave your project on GitHub. You can select a folder on your computer by clicking the ‘Browse’ button to indicate where you want to save the local copy of your repository. Click ‘Create Project’. R will quickly download the files from your repository, and open the new project. You will see that the project creation was successful because the ‘Files’ tab in the RStudio interface shows we have downloaded some files from our GitHub repository (the README.md and LICENSE files). RStudio also created a .Rproj file and a .gitignore file. The project file is used to store information about the project, and that is required to use GitHub. We can also see this is a version control project in the top right of the interface, where there is now a ‘Git’ tab. If we click it, we see: We see a range of buttons, such as the Diff, Commit, Pull, and Push buttons. These will be used to interact with GitHub. Many computer programmers interact with GitHub through the command line, such as:$ git commit -m "This is a git commit message"

Learning to use git through the command line is not needed for most people who just want basic version control. Here, I will exclusively focus on version control and git through the menu options in RStudio. It is now time to create a file for which we want to control the versions we make of it.

## 14.3 Step 3: Creating an R Markdown file

R Markdown files provide a way to save and execute code, while at the same time allowing you to create reports of your data analysis (and even full scientific articles that you can submit for publication!). A complete introduction to R Markdown is available here. The main strength of R Markdown documents is that they allow you to create a fully reproducible document. This means that you do not just have some analysis code in an R script, but a manuscript that combines text and code and that you can compile to create a PDF or html version of the manuscript. HTML or PDF files have the advantage that people can read them with regular software. The R Markdown file contains code that performs the analyses each time the document is compiled. Instead of copy-pasting values from your analysis software into a word document, you combine code and text in the RMarkdown file to create a manuscript where every number or figure can be traced back to the exact code that generated it. That has the advantage that everyone can use your RMarkdown file and generate the same document (e.g., your manuscript) as you.

You can still make errors in the analysis if you use R Markdown files. The important difference is that you will be making programming errors that are stored in the R Markdown document. Compared to a typo when copying numbers from your analysis to a word document, errors in your analysis in your RMarkdown file will lead to the same document. Because the document is reproducible, all errors are reproducible as well. It is impossible to prevent all errors, but it is possible to make them reproducible. This will make it easier to identify and correct errors. I understand you might worry about others seeing your errors if you allow them to see exactly what you have done. But we all make mistakes, and it is important for science to be able to identify and correct these mistakes. An important aspect of moving to a more reproducible workflow, and sharing all files underlying your manuscript publicly, is that we will have to learn to accept that we all make errors, and appreciate people who correct them .

Let’s start by creating a new R Markdown document in R Studio by clicking New File > R Markdown…

This gives you a new window where you can specify the title of your RMarkdown document and an author name. Enter the title ‘Main Analysis’, and feel free to change the Author subfield to anything you prefer. RMarkdown files can be compiled (also referred to as ‘knitted’) into an HTML file, a PDF document, or a word document. To generate PDF files you need to install MiKTex which we won’t do for this example (a good tutorial how to install MiKTeX is available here). So leave the default output format to HTML and click OK.

Let’s start by saving the new file: Click the save button, and save the file under the name ‘main_analysis.Rmd’. Because we are working in an R Studio project, the file will automatically be saved in the same folder as all other files in this project. If you look at the files tab in the bottom right pane, you will see the new file appear. Now let’s take a look at the R Markdown file.

The R Markdown file by default includes several sections to get you started. First, there is a header section. In the header section, there is code that determines how the final document is rendered. This section is sensitive, in the sense that it needs to be programmed exactly right – including spaces and tabs – so it is not recommended to change it too much without looking up detailed documentation on how to change this section. If you want the technical details: An R Markdown file is fed to knitr software, which creates a normal markdown file, which then uses pandoc software to generate the specific document you requested. All of this happens automatically.

The header is followed by a set-up section where you can define general options for the entire R Markdown file. Then, we see the two main sections: Markdown code, which is a markup language in plain text formatting syntax that can be easily converted into HTML or other formats. Then, we see R code that is used to analyze data or create figures. To see the final result of this code, hit the

Knit button in the toolbar at the top of the pane.

Either a new window will appear that allows you to view the HTML file that was created, or your document will appear in the ‘viewer’ tab in RStudio. You see the formatted HTML document that combined both text and the output of R code.

Close the window – we are now ready to analyze our data.

## 14.4 Step 4: Reproducible Data Analysis in R Studio

Delete all text from ## R Markdown on down – only keep the header and set-up sections of the default document.

First, we need to analyze some data. We will download this data directly from an existing GitHub repository I created. Students in an introduction to psychology course performed a simple Stroop experiment. During the Stroop experiment participants named the colors in a congruent trial (e.g., the word 'red' written in a red font) and incongruent trial (e.g., the word 'red' written in a green font). The time they took to name all words was recorded in seconds (e.g., 21.3 seconds) for both the congruent and incongruent trial. There are four columns in the dataset:

• Participant Number

• Response Time for Congruent Stimuli

• Response Time for Incongruent Stimuli

• Year of Data Collection

Click the button ‘+C Insert’ to insert code – a dropdown menu will be visible. Select R.

In the R Markdown file, you’ll see a new section of R code that starts with three backticks followed by {r} and ends with three backticks. You can also just create these sections by manually typing in these two lines.

Copy-paste the code below – make sure to get all the text – and paste it between the start line and the end line of the R code chunk.

stroop_data <- read.table("https://raw.githubusercontent.com/Lakens/Stroop/master/stroop.txt",
sep = "\\t", header = TRUE)

write.table(stroop_data, file = "stroop.csv", quote = F, row.names = F)

After copy-pasting the text, the code section should look like this:

This code creates a data.frame called ‘stroop_data” that contains data, and then saves this data in a .csv file called ‘stroop.csv’. Click the Knit button to look at the document:

You should see something like:

This might not look very impressive – but the real action is in the file pane in the bottom right part of the screen. Close the window showing the HTML output and look at the file pane. You should now see several files:

One file is stroop.csv – this is our data file of the Stroop data that we downloaded from the internet and saved to our project folder, using R code.

There is really no need to keep downloading the file from the internet when we can also just load it from the local folder. So let’s change the code. We won’t completely delete this code – we will just comment it out by placing a # in front of it. This way, we can still remember where we downloaded the file from, but we won’t use the code.

Because it is always important to provide comments in the code you write, add this explanation above the line where we downloaded the code:

Then, select the lines of code in the chunk, and press (on Windows) CTRL+SHIFT+C (or click ‘Code’ in the toolbar and then ‘comment/uncomment lines’). This should add # in front of all lines, making it comments instead of code that is executed every time. You should end up with:

Now we need to add a line of code that we will run, and with which we will load the stroop.csv dataset from the local folder. Underneath the last commented out line of code, but within the R code block, add:

stroop_data <- read.csv("stroop.csv", sep = " ", header = TRUE)

Click save, or press CTRL+S, to save the file. Knit the file. We see:

Close the HTML file. We’ve done quite a lot of work. It would be a shame if this work was lost. So this seems to be the perfect time to save a version of our R Markdown file, not just locally, but also on GitHub.

## 14.5 Step 5: Committing and Pushing to GitHub

It is time to store our changes in the cloud, on GitHub. This process takes two steps. First, we record the changes to the repository (aka the code and files we created), which is called ‘commit’. This does not require an internet connection, because we are just recording the changes locally. However, then we want to make sure these recorded changes are also stored on GitHub, which requires you to push the files to GitHub.

If we look at the Git tab in the top right pane in R Studio, we see the Commit button, the Push button, and we see a bunch of files. The status of these files is indicated by two question marks in yellow. These question marks indicate these files are not yet tracked by GitHub. Let’s change this.

Click the commit button. A menu opens. You can choose to ‘stage’ the changes that have been made. Staging basically mean selecting which of the files you want to record, or commit. You can do this in several ways, such as double clicking each file, or selecting all files and clicking ‘Enter’. When staging all files, the yellow question marks change to a green ‘A’ symbol. Every commit should be accompanied by a commit message where you describe which changes you have made – after all, we are recording our changes. You can type in anything you want – it is common to write something like ‘initial commit’ the first time you commit changes. The menu should look like the screenshot below:

Now we are ready to commit these changes. Click the ‘Commit’ button. A new window opens that shows all changes that have been committed. We see that 5 files have changed. You can close this window and close the previous menu.

R Studio now reminds you that there is a difference between the local copy of your repository, and the remote version of the repository on GitHub. In the Git tab you see a reminder: “Your branch is ahead of ‘origin/master’ by 1 commit.”.

This means the files we have updated and recorded on our computer with a commit are not yet synchronized with the remote repository on GitHub. We can solve that by ‘pushing’ (aka synchronizing) the changes to the remote repository. Simply click the push button:

Another pop-up window appears:

This window informs us there were no errors, and we successfully pushed the changes to the remote version of the repository. You can close this window.

You can check that you successfully pushed all files to GitHub by visiting the GitHub page for your repository in the browser. You should see something like:

Congratulations on your first GitHub push! If you want to read a more extensive introduction to Git, see Vuorre & Curley (2018).

## 14.6 Step 6: Reproducible Data Analysis

So far, we have only read in data. The goal of an R Markdown file is to create a manuscript that contains a fully reproducible data analysis. In this chapter, I cannot teach you how to analyze data in R (but I can highly recommend learning it – there are plenty of excellent online resources). Instead of programming from scratch, visit this raw text version of the R Markdown file that will analyze the Stroop data. In the website, select all text (CTRL+A), copy it (CTRL+C). Then go to your main_analysis.Rmd file in R Studio. Select all text (CTRL+A) and press delete. That’s right – delete everything. You don’t need to worry about losing anything – you have a version controlled file in your GitHub repository, which means you can always go back to a previous version! In the (now empty) main_analysis.Rmd file, press CTRL+V and paste all text. The file should look like the first screenshot below.

This R Markdown file does a number of things, which we will explain in detail below. For example, it will automatically install libraries it needs, load the data, and create a report in HTML. You can press the Knit button, and the HTML document should load. You should see output as in the second screenshot below.

It is important to note that none of the numbers that are in this text are static, or copy-pasted. They are all calculated at the moment that the document is created, directly from the raw data. The same is true for the figures, which are created from the raw data the moment the manuscript is compiled. If you have access to the .Rmd (RMarkdown) file, you can perfectly reproduce the reported data analysis.

Since we have made substantial changes, this is the perfect moment to commit and push the changes to GitHub! Go to the Git tab in the top right pane. Click ‘Commit’. The window below will open. If the main_analysis.Rmd file is selected, you will see red and green chunks of text. These tell you what was old (red) and what is new (green).

Select all files that have changed, and ‘stage’ them (for example by pressing enter). The checkboxes in front of the files, under the ‘Staged’ column, should be checked.

Type in a commit message, such as ‘update mean analysis’ in the ‘commit message’ field. Press the ‘Commit’ button. Close the window that pops up to inform you about the result of the commit. Then click ‘push’. Close the window that informs you about the push command, and close the commit window. You can always visit the GitHub repository online and look at the full history of your document to see all changes that have been made.

Let’s take a look at some sections of our new R Markdown document. First the header:

This sets general (global) options for the code chunks in the R Markdown file. The echo, warning, and message = FALSE hide the code chunks, warning messages, and other messages, where the ‘include=true’ will make all figures appear in the text. You can set some of these variables to TRUE, and hit Knit to see what they change. Sometimes you might want to share the HTML file with all code visible, for example when sharing with collaborators.

If you scroll down, you can see the introduction text, the code that generates the first figure, and the code that performs the analyses. These variables are used in the Results section. Let’s look at this section:

This section shows how you can mix text and R code. The start of this code is normal text. The *M* is still normal text (the * and * make sure the M is italicized, just as further down the ~av~ indicates these letters should be subscript), but then you see R code. In R Markdown you can embed R code within r. Any R code within the two backticks will be executed. In this case, the mea of the Congruent reaction times is calculated, and rounded to 2 digits. You can see this number in the text.

Learning to program takes time. You can see some things are quite tricky to program. For example, the code:

## 14.9 Some points for improvement in computational reproducibility

We recently tried to computationally reproduce Registered Reports published in the psychological literature . We noticed some issues that, if solved, would easily improve the computational reproducibility of your work.

First, always add a codebook to data files. We already noted this above, and yes, it is a bit of work and not that fun to do, but it is very essential to include a codebook when you share data. Data is easier to understand and more reusable if variables and their values are clearly described. Researchers should ensure that the codebook and variable names are in the same language as the article.

Second, annotate code so it is clear what the code does. Well-annotated code makes clear what the analysis code does, in which order scripts should be run if there are multiple scripts (e.g., to pre-process the raw data, compute sum scores, analyze the results, and generate graphs), and which output each section of analysis code generates. Sometimes it might even be helpful to, from the final manuscript, copy-paste the sentences in the results section back into the code file, so it is very clear how sentences in the manuscript relate to the code file. It also helps to clearly structure code (e.g., using a README) so others know which output analysis code creates and in which order code should be run.

Third, check whether the code you shared still reproduces all analyses after revisions - researchers often make changes during the peer review process, but forget to update their analysis files.

Finally, remember that most code in R relies on specific libraries (also called packages). List all the packages that the code needs to run at the top of the script. Because packages update, it is necessary to report the version numbers of packages that were used (for example using packrat, or copying the output of the sessionInfo() function as a comment in the script). Remember that folder names and folder structures differ between computers, and therefore you should use relative locations (and not absolute paths like “c:/user/myfolder/code”). RStudio projects and the 'here' package provide an easy way to use relative paths. When multiple scripts are used in the analysis, include the order in which scripts should be performed on the data in a README file.