Chapter 13 Why use AnVIL?

The NHGRI AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space) is a project powered by Terra for biomedical researchers to access data, run analysis tools, and collaborate. Both biology researchers and educators can benefit from using AnVIL (anvil.terra.bio) for their research and in the classroom.

This guide acts as a resource answering the question “why use AnVIL?”. It will discuss the research, classroom, and general benefits of using AnVIL and point to related resources throughout.

The research, classroom, and general benefits of using the AnVIL are outlined. For research, the AnVIL platform can be accessed easily, there are a variety of available analysis solutions and tools, you can bring your own data or access cloud-hosted open or controlled access data; further your data and analysis are in the same place instead of having to move data between platforms. The analysis resources are scalable with researchers renting what they need. Additionally work on the AnVIL platform can be collaborative with workspaces having role-based permissions for group members and then these workspaces are shareable enhancing reproducibility. For the classroom, the AnVIL offers a unified computing system for learners where they will have an authentic experience with cloud computing that is increasingly becoming frequently utilized within today’s research environment. Learners can learn and practice with a variety of tools using relevant datasets and prepared exercises. The general benefits of using the AnVIL include the ability to control cloud-computing costs, being able to safely work with protected data, maintenance is handled by others not the users, training and support is available for users, and the AnVIL is a collaborative community with a development mindset.

13.1 Benefits of using AnVIL for research

Research benefits for the AnVIL include that the AnVIL platform can be accessed easily, there are a variety of available analysis solutions and tools, you can bring your own data or access cloud-hosted open or controlled access data; further your data and analysis are in the same place instead of having to move data between platforms. The analysis resources are scalable with researchers renting what they need. Additionally work on the AnVIL platform can be collaborative with workspaces having role-based permissions for group members and then these workspaces are shareable enhancing reproducibility

13.1.1 Ease of platform access

The primary means of accessing the AnVIL platform (anvil.terra.bio) is through a web browser - users do not need to download data or install software.

Accessing AnVIL beyond the web browser

The platform (anvil.terra.bio) provides a variety of graphical user interfaces (GUIs) as well as optional application programming interface (API) library/command line interfaces to interact with data, analysis solutions, and workflows. Bioconductor packages (AnVIL, AnVILGCP, AnVILWorkflow) offer additional methods for users to programmatically interact with and access AnVIL resources from within AnVIL, or stand-alone computing environments such as a user’s laptop.

13.1.2 Variety of analysis solutions

There are a variety of available analysis frameworks and tools available on the AnVIL, each with interoperability such that data and results in your workspace can be passed from tool to tool. Analysis solutions are available for interactive work or non-interactive batch processing (e.g., with WDL workflows).

AnVIL supports an assortment of frameworks and tools. Researchers can use their favorite tool to work with data interactively or through non-interactive batch processing. Due to this variety and interoperability with other platforms, researchers can stay within a single environment for their analysis without having to shift between platforms.

Interactive sessions

Interactive sessions are available with Jupyter, RStudio/Bioconductor, and Galaxy.

Use of Galaxy on AnVIL enables even more customizability with the ability to install specific tools/versions with Toolshed.

Non-interactive batch processing

Workflows can be user-supplied/written or imported with Dockstore and used to steer non-interactive pipelines and batch processing of data.

13.1.3 Data: yours or cloud-hosted open & controlled access

With AnVIL you can upload your own data (signified by the plus sign) or browse cloud-hosted open and controlled access datasets to identify relevant datasets, importing or requesting access to those for your workspaces. Several consortia have data available on AnVIL with data deposits growing.

AnVIL securely stores diverse, open and controlled access, cloud-hosted datasets with a browsable summary catalog so researchers can identify relevant datasets they may need to request access to.

Amount of data on AnVIL

As discussed in the flagship AnVIL paper, the AnVIL hosts data from >280,000 human genomes from >240 different cohorts spanning multiple consortia and major NHGRI projects. The AnVIL offers a browsable catalog of summary information about all of the datasets so that even if a user isn’t authorized to access the data itself, they can better determine if the data will be helpful for their research if they need to apply for authorization for access. AnVIL is working to facilitate data harmonization across studies, ensuring consistency and interoperability, which is critical for large-scale analyses. These efforts will increase the value of the AnVIL data and maximize its utility to the researcher community.

https://docs.google.com/presentation/d/1zq27o5gSWeiaEPqbCTTL3c_zvozN5lg_QNJYRVyM-SI/edit?slide=id.g36e681c6048_0_496#slide=id.g36e681c6048_0_496

AnVIL is a FedRAMP Moderate compliant platform

As a FedRAMP Moderate compliant platform, AnVIL maintains FedRAMP authorization of compliance to ensure as a cloud service provider, minimum security requirements are met for data processing, storage, and transmission of Protected Health Information (PHI) and Personally Identifiable Information (PII) where loss of confidentiality, integrity, and availability would result in serious adverse effect or non-life threatening harm. All steps necessary to maintain compliance, such as robust logging of access to data, periodic audits by third-party analysts, and monitoring for abnormal use patterns are managed and guaranteed by AnVIL.

13.1.4 Data & analysis in same place

The AnVIL portal is an entry point for all parts of AnVIL, allowing visitors to launch Terra to work on AnVIL, as well as access important announcements or documentation.

AnVIL is a unified computing environment for data storage, management, and analysis. The AnVIL portal serves as an entry point to access all parts of the AnVIL system as well as training materials and announcements.

13.1.5 Scalability

AnVIL is conducive to analysis at massive scale and for data exploration and training. Researchers get access to dedicated compute resources, avoiding queue time and lack of access at some institutions. Researchers can also launch light environments or run test analyses without incurring much cost or spending a lot of time to configure.

13.1.6 Rent needed resources

AnVIL allows you to rent the computational resources that you need for occasional high demand needs rather than obtaining and maintaining the same resources yourself or paying a subscription for an allocation/constant access (with little consistent use over time). AnVIL can provide different hardware and software setups, rather than preparing the environment yourself (or relying on an institutional core to do it and waiting in the queue).

Additional considerations

Other considerations that make renting computational resources from AnVIL appealing compared to obtaining and maintaining your own resources or upgrading an institutional allocation (HPC) include:

  • AnVIL is compliant with protected data. Some institutional HPCs may not be.
  • Once your group is initially set up on AnVIL, adding users (with specified permissions) is easier than trying to add a user to an allocation through an email chain.
  • Because AnVIL maintains docker images, the exact version of a tool is documented and available.
  • AnVIL scales well for large numbers of samples and won’t require long waits in queues to access limited, specialized resources; AnVIL also works well for small analyses where you may not want to connect to the HPC and set up a complicated environment there.

13.1.7 Role-based permissions

A workspace is the fundamental unit of activity in AnVIL with role-based access. The workspace will show the user’s access level (e.g., whether they are an owner or reader or writer who has the ability to run compute or share). This helps to enable collaboration. The Access level of the user is highlighted in this example workspace.

Group management can be utilized to control who can access specific data, analysis workspaces, and your billing resources. Workspaces provide a collaborative environment with role-based permissions. These permission include reading, writing, or owning with additional permissions for running compute and sharing. Especially within the contexts of working with sensitive data or large amounts of data, AnVIL’s role-based group management permission structure is instrumental.

13.1.8 Shareable workspaces

A workspace can serve as a shareable record of analyses. The ability to access the permalink to the example workspace is highlighted in this image.

Workspaces can contain data, metadata, and analysis tools, as well as documentation and history of workflow runs, additionally displaying important information such as when the workspace was created and last modified. AnVIL workspaces on the web can serve as shareable, reproducible records of analyses. Research conducted on the AnVIL platform has contributed to over 115 scientific publications citing the AnVIL paper, demonstrating its role in advancing genomic and biomedical research.

Examples of AnVIL workspaces shared in publications

AnVIL workspaces have previously been shared in publications to demonstrate reproducible science.

13.1.9 Repository compliant with DMS Policy

The AnVIL serves as a cloud data repository compliant with the Data Management and Sharing (DMS) Policy. Data access controls can be specified to limit data access and use.

By submitting their data to AnVIL, not only can researchers meet the requirements of DMS Policy, they can also contribute to the expanding network of NIH funded data housed in the AnVIL, furthering scientific discovery.

What is DMS Policy?

The National Institutes of Health (NIH) Data Management and Sharing (DMS) policy requires that all NIH-supported research which generates scientific data (barring ethical, legal, or technical factors limiting data sharing) must create a plan and budget for data management and sharing.

While some funding announcements for some research may specify which repository should be used to comply with DMS policy, mostly the NIH does not specify where data should be stored.. The NIH does provide an interactive table listing NIH supported data repositories than can be used and suggests that researchers use a repository most appropriate for the data generated from their research.

How is data access managed and requests granted?

AnVIL utilizes the Data Use and Oversight System (DUOS) to efficiently expedite data access and management while maintaining security. Researchers can explore datasets hosted in the AnVIL cloud and request access using DUOS. Data use limitations can be set if necessary with Data Access Committees reviewing access requests.

13.2 Benefits of using AnVIL in the classroom

For the classroom, the AnVIL offers a unified computing system for learners where they will have an authentic experience with cloud computing that is increasingly becoming frequently utilized within today’s research environment. Learners can learn and practice with a variety of tools using relevant datasets and prepared exercises.

AnVIL provides all the advantages of a cloud computing environment, such as version control and offering a unified computing system without providing physical computers with certain specifications. Additionally, AnVIL provides students with authentic experience working in the cloud – which is becoming common in today’s research environment. Students can also gain experience with a variety of tools (e.g., Galaxy, RStudio, Jupyter notebooks, WDL workflows) all in one place while working with relevant datasets and prepared exercises.

Instructor Guide Available

See more in our instructor guide on why AnVIL is a good choice for your classroom.

AnVIL provides several prepared exercises providing relevant biology background and datasets using a variety of tools to help train students within a cloud computing environment. 4 examples include the Single Cell with Bioconductor AnVIL Demo, the analysis of a SARS-CoV-2 genome with Galaxy on AnVIL (part of the Genomic Data Science Community Network or GDSCN), the GDSCN BioDIGS R Data Package, and an RNA-Seq mini CURE.

13.3 Overall benefits of AnVIL

The general benefits of using the AnVIL include the ability to control cloud-computing costs, being able to safely work with protected data, maintenance is handled by others not the users, training and support is available for users, and the AnVIL is a collaborative community with a development mindset.

13.3.1 Ability to control costs

Cloud computing is not free and estimating costs may seem daunting to those considering use of the AnVIL. However, Terra provides thorough, transparent documentation explaining data storage and cloud computing costs and has been working to improve transparency and management of costs for AnVIL users through cost reporting, cost controls and estimates, and cost optimizations. Additionally, in order to debug or benchmark your work, analyses or workflows can be tested with smaller scale test datasets or light environments without incurring much cost or spending a lot of time to configure environments.

13.3.2 Work with protected data safely

Due to AnVIL maintaining compliance with FedRAMP policies, clinical data containing PHI and PII can be safely and securely stored and analyzed on AnVIL. This includes the ability to export data from clinical data collection and management tools like REDCap and import it into AnVIL Terra Tables.

13.3.3 Maintenance is handled

Since AnVIL handles the support and maintenance of the platform (including the hardware and software), you can focus on performing your work on AnVIL rather than setting up and maintaining the platform, freeing up effort for your science. This is immensely valuable for researchers who do not have deep institutional IT and system administrator support for research infrastructure.

13.3.4 Training and support is available

To equip researchers and students to work on the AnVIL, the AnVIL team

Training and support are available for users in multiple formats. Users can browse the AnVIL collection to see what is available. Examples include the AnVIL Getting Started book (available online), a moderated support forum, and live AnVIL Demos that are also recorded and posted to YouTube.

13.3.5 Collaborative community

The AnVIL has a collaborative community with a development mindset, soliciting and listening to feedback and supporting and maintaining forums and training for discussion and help.

AnVIL has begun hosting community conferences to collaboratively innovate during CoFests! and to discuss research performed with the platform. The community can work directly with the AnVIL team to understand current development, feature requests, and a roadmap or future directions for the platform.

Additionally, AnVIL values and routinely solicits user feedback to improve the user experience and provide the most beneficial features and enhancement for biomedical research. Feedback is gathered:

13.4 Conclusion

All of this together describes how the AnVIL provides secure, cost-effective genomic analysis at scale, and is a useful cloud-based platform for training and research.

In addition to the general benefits of working on the cloud (such as learners and analysts being able to work with identical environments with pre-installed bioinformatic tools), AnVIL provides a strong community with support and training material, bridging the gap between training to research on human genomic datasets using a secure platform for controlled-access data storage, management, and analysis. And though bioinformatic tools are pre-installed, users are given the ability to install additional software and tools as needed and can pay to scale resources resulting in quicker runtimes as needed. Additionally, AnVIL has collaboration tools for workgroups.