Chapter 4 Providing data
4.1 Learning Objectives
The first part of any analysis should be getting all the data needed to run it. Data come in all kinds of formats and sizes so while we can’t give specifics on how to share your data we can provide these guidelines:
4.1.1 Overview of data sharing
- The data to be shared does not contain PII (personal identifiable information) or PHI (protected health information) information.
- The data are accessible by a download script that is automatically downloaded when re-running the analysis.
- Every data file needed to run the analysis is available.
- The data are downloaded to files in an organized manner. For more about project organization, see this chapter from the Introduction to Reproducibility course.
4.1.2 A very general example of a data download bash script
As far as how to have your data downloaded, this will be dependent on where and how it’s stored online. The most general form of a data download script might look like this:
#!/bin/bash
# This is a template script for downloading data using the wget command
# See docs here: https://www.gnu.org/software/wget/manual/wget.html
mkdir <FOLDER_TO_SAVE_TO>
# To see wget options, use -h (the help flag)
wget -h
wget -O <FOLDER/FILE_TO_SAVE_TO> <URL>
You can download this general template download file here (Shapiro et al. 2021).
4.1.3 Examples of data download scripts
- Downloading data from GEO with GEOquery
- Data download script for multiple files of the same place
- Data download script - refine.bio example
For more about data sharing techniques, see the Ethical Data Handling for Cancer Research course.
References
Shapiro, Joshua A., Candace L. Savonen, Allegra G. Hawkins, Chante J. Bethell, Deepashree Venkatesh Prasad, Casey S. Greene, and Jaclyn N. Taroni. 2021. Childhood Cancer Data Lab Training Modules (version 2021-june).