Chapter 1 Introduction

Title image: Reproducibility in Cancer Informatics

1.1 Target Audience

The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research and have not had training in reproducibility tools and methods.

This course is written for individuals who:

  • Have some familiarity with R or Python - have written some scripts.
  • Have not had formal training in computational methods.
  • Have limited or no familiar with GitHub, Docker, or package management tools.

Intro to Reproducibility: For individuals who: Have some familiarity with R or Python - have written some scripts. Have not had formal training in computational methods. Have limited or no familiarity with GitHub. Advanced Reproducibility: For individuals who: Have completed the intro course and/or Have used GitHub somewhat.

1.2 Topics covered:

This is a two part series:

Discussed in the Introductory Reproducibility in Cancer Informatics course: Organizing your project, using notebooks, Making your project open source with GitHub, using notebooks, managing package versions, writing durable code, documenting analyses, understanding the importance of code review.

Discussed in the sequel course: Advanced Reproducibility for Cancer Informatics: Getting comfortable with GitHub concepts and workflow. Utilizing version control. Engaging in code review. Using a Docker image. Modifying a Docker image. Using Automation (github actions).

1.3 Motivation

Cancer datasets are plentiful, complicated, and hold untold amounts of information regarding cancer biology. Cancer researchers are working to apply their expertise to the analysis of these vast amounts of data but training opportunities to properly equip them in these efforts can be sparse. This includes training in reproducible data analysis methods.

Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (Beaulieu-Jones and Greene 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively.

Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers’ time so they don’t have to reinvent the proverbial wheel for methods that everyone in the field is already performing.

1.4 Curriculum

This course introduces the concepts of reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to increase the reproducibility of data analyses. The course also introduces tools relevant to reproducibility including analysis notebooks, package managers, git and GitHub.

This course will demonstrate how to: Understanding why usability and documentation is vital, Identifying your user community, Building documentation and tutorials to maximize the usability of developed tools, Obtaining feedback from your users

The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses.

Goal of this course:
Equip learners with reproducibility skills they can apply to their existing analyses scripts and projects. This course opts for an “ease into it” approach. We attempt to give learners doable, incremental steps to increase the reproducibility of their analyses.

What is not the goal
This course is meant to introduce learners to the reproducibility tools, but it does not necessarily represent the absolute end-all, be-all best practices for the use of these tools. In other words, this course gives a starting point with these tools, but not an ending point. The advanced version of this course is the next step toward incrementally “better practices”.

Reproducibility is on a continuum. Goal of the course is to move learner’s skills toward creating reproducible analyses. This graph shows a two sided arrow with a gradient. On the very left is a ‘not repeatable analysis’ it was ran once. To the right of that is an analysis that ‘re-runs sometimes’. To the right of this, is an analysis that ‘Re-runs reliably in most contexts’.  And all the way to the right is a ‘perfectly reproducible analysis’ that ‘Re-runs in every situation and gets the same result every time’. In red lettering we note that every analysis is started by being run once but no analysis is ‘perfectly reproducible’.

1.5 How to use the course

This course is designed with busy professional learners in mind – who may have to pick up and put down the course when their schedule allows.

Each exercise has the option for you to continue along with the example files as you’ve been editing them in each chapter, OR you can download fresh chapter files that have been edited in accordance with the relative part of the course. This way, if you decide to skip a chapter or find that your own files you’ve been working on no longer make sense, you have a fresh starting point at each exercise.


Beaulieu-Jones, Brett K, and Casey S Greene. 2017. “Reproducibility of Computational Workflows Is Automated Using Continuous Analysis.” Nature Biotechnology 35 (4): 342–46. https://doi.org/10.1038/nbt.3780.