Chapter 9 Data

Data is stored by Workspaces in two different locations: the Workspace Bucket and the Persistent Disk.

The Workspace Bucket is a special Google Cloud Storage bucket that is governed by the built-in AnVIL security policies. This durable and scalable storage location is suitable for both raw data as well as analysis outputs that need to be preserved and/or shared.

In contrast, Persistent Disks provide a working directory for Cloud Environments that run Jupyter, RStudio, and Galaxy. Input data can be localized to Persistent Disks for analysis while output data can be transferred to the Workspace Bucket for more reliable long term storage.

Image shows a schematic of the data storage locations in an AnVIL Workspace. The Google Bucket is highlighted with a number "one" and the Persistent Disk is highlighted with a number "two". The `gsutil` command connects the two storage locations and allows users to copy data back and forth. The Persistent Disk is used by RStudio, Jupyter, and Galaxy. Data can also be copied to the Persistent Disk from another Workspace or SRA dataset.

Data Tables provide a way to organize data and metadata, including URI links to storage buckets. These tables are a convenient way to organize input for analyses as well as tracking workflow outputs. More details can be found in the Terra documentation.

Image shows a schematic of the data storage locations in an AnVIL Workspace. The Data Table is highlighted with a number "three".