Title

Chapter 4 Making your project open source with GitHub

4.1 Learning Objectives

This chapter will demonstrate how to: Understand that git and GitHub are tools that help your analyses be conducted reproducibly and in an open source manner. Create a GitHub account. Set up a GitHub repository for your analyses.

git is a version control system that is a great tool for creating reproducible analyses. What is version control? Ruby here is experiencing a lack of version control and could probably benefit from using git.

Ruby is looking at her computer with a lot of folders with different variations on similar names. Ruby asks herself: Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?

All of us at one point or another have created different versions of a file or document, but for analysis projects this can easily get out of hand if you don’t have a system in place. That’s where git comes in handy.

There are other version control systems as well, but git is the most popular in part because it works with GitHub, an online hosting service for git controlled files.

4.1.1 GitHub and git allow you to…

4.1.1.1 Maintain transparent analyses

Open and transparent analyses are a critical part to conducting open science. GitHub allows you to conduct your analyses in an open source manner. Open science also allows others to better understand your methods and potentially borrow them for their own research, saving everyone time!

Ruby’s computer shows a virus and has a temperature. Ruby says ‘Oh no! I lost data on my computer! Good thing all the work I have toiled on for years is on GitHub!’ The GitHub cat is in a cloud with a download sign with Ruby’s code.

4.1.1.2 Have backups of your code and analyses at every point

Life happens, sometimes you misplace a file or your computer malfunctions. If you ever lose data on your computer or need to retrieve something from an earlier version of your code, GitHub allows you to revert your losses.

Ruby’s computer shows a virus and has a temperature. Ruby says ‘Oh no! I lost data on my computer! Good thing all the work I have toiled on for years is on GitHub!’ The GitHub cat is in a cloud with a download sign with Ruby’s code.

4.1.1.3 Keep a documented history of your project

Overtime in a project, a lot happens, especially when it comes to exploring and handling data. Sometimes the rationale behind decisions that were made around an analysis can get lost. GitHub keeps communications and tracks the changes to your files so that you don’t have to revisit a question you already answered.

Ruby holds a magnifying glass and says 'Why did we write the code this way? I don’t remember… Good thing through git tracking I can look into this file’s history and remind myself how it became this.'

4.1.1.4 Collaborate with others

Analysis projects highly benefit from good collaborations! But having multiple copies of code on multiple collaborators’ computers can be a nightmare to keep straight. GitHub allows people to work on the same set of code concurrently but still have a method to integrate all the edits together in a systematic way.

Ruby and Avi are both working on the code. Because they are both using git version control, they are able to merge their changes to the code base. And now the main code base contains both of their changes!

4.1.1.5 Experiment with your analysis

Data science projects often lead to side analyses that could be very worth while but might be scary to venture on if you don’t have your code well version controlled. Git and GitHub allow you to venture on these side experiments without fear since your main code can be kept safe from your side venture.

Ruby says ‘I’m not sure if this side analysis I’m working on is a good idea or not, but I want to test it. Good thing I can make a separate branch and keep my original code safe from my experimenting.’ Her computer shows her main code and a branch off of it that says ‘test analysis’. After time and work goes by she may decide to incorporate her test analysis with her main code

4.2 Get the exercise project files (or continue with the files you used in the previous chapter)

Get the Python project example files

Click this link to download.

Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.

Get the R project example files

Click this link to download.

Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.

4.3 Exercise: Set up a project on GitHub

Go here for the video tutorial version of this exercise.

Now that we understand how useful GitHub is for creating reproducible analyses, it’s time to set ourselves up on GitHub.

Git and GitHub have a whole rich world of tools and terms that can get complex quickly, but for this exercise, we will not worry about those terms and functionalities just yet, but focus on getting code up on GitHub so we are ready to collaborate and conduct open analyses!

  • Go to Github’s main page and click Sign Up if you don’t have an account.
  • Follow these instructions to create a repository. As a general, but not absolute rule, you will want to keep one GitHub repository for one analysis project.
    • Name the repository something that reminds you what its related to.
    • Choose Public.
    • Check the box that says Add a README.
  • Follow these instructions to add the example files you downloaded to your new repository.

Congrats! You’ve started your very own project on GitHub! We encourage you to do the same with your own code and other projects!

Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form!