Chapter 4 Computing Resources
In this chapter we will describe the basics about data size and computing capacity. We will discuss the computing and storage requirements for many types of cancer related data, as well as options to perform informatics work that might require more intensive computing capacity than your personal computer.
4.1 Data Sizes
Recall that the smallest unit of data is a bit which is either a zero or a one. A group of 8 bits is called a byte, and most computers and phones, and software programs are constructed or designed in a way to accommodate groups of bytes at a time. For example a 32-bit machine can work with 4 bytes at a time and a 64-bit can work with 8 bytes at a time. But how big is a file that is 2 GB? When we sequence a genome, how large is that in terms of binary data? Can our local computer work with the size of data that we would like to work with?
First let’s take a look at how the size of binary data is typically described and what this actually means in terms of bits and bytes:
Now that we know how to describe binary data sizes, let’s next think about how much computing capacity typical computers have today.
4.2 Computing Capacity
We have discussed a bit about CPUs and how they can help us perform more than one task at a time, but how many tasks can the CPU of an average computer do simultaneously these days? How much memory and storage do they typically have? What size of files can a typical computer handle? This information is sometimes called the specs of a computer.
These values will probably change very soon, and different computers vary widely, but currently:
- Laptops can often perform 4-8 CPU tasks at once, and typically range from 4-16 GB in memory and 250 GB-1 TB of storage.
This means that typical laptops can multitask quite well, have in some cases 16 gigabytes for random access memory to allow the CPU to work on relatively large tasks (as we can see from the previous table that GB are actually pretty large when you think about it), and possibly 1TB for the hard drive (and or SSD), meaning that you can store thousands of photos and files like PDFs, word documents etc. It turns out that you can store around 30,000 average size photos with 250GB, so a 1TB laptop can store quite a bit of data. Therefore overall, typical laptops today are pretty powerful compared to previous computers. Note that some programs require 16 or even 32 GB of memory to run.
Desktops can perform and store similarly and sometimes to a degree better than a laptop for a similar price. Since less work needs to be done to make the desktop small and portable, sometimes you can get better storage and performance for the same price as a laptop. However, desktops often have better graphics processing capacity and displays and that might make up for the price difference (Antonio Villas-Boas 2019). This might be important to consider if you are going to need to visually inspect many images. Another benefit is that you can also sometimes find desktops with larger memory and storage options right off the shelf than typical laptops. It is also generally easier to add more memory to a desktop than it is to add to a laptop (Antonio Villas-Boas 2019). However of course, desktops certainly aren’t super portable!
Some phones can compete with laptops by performing 6 CPU tasks at once and storing 6 GB in memory and 250 GB of storage.
Check out this link to compare the prices of different macs and this link to compare specs for PC computers from HP.
If you want to get really in-depth comparisons for PC or windows machines, check out this link (“UserBenchmark: Core I7-11700k Build Comparisons” n.d.).
4.2.1 Checking your computer capacity - Mac
So what about your computer? How do you know how many cores it has or how much memory and storage it has?
If you have a Mac, you can click on the apple symbol on the far left of your screen. Then click on the “About This Mac” button.
You might see something like this:
First we see the operating system is called Mojave. Next we see that the processor (which we now know is the CPU) is a 2.6 GigaHertz (GHz) Intel Core i7 chip. This means that the processor or CPU can process 2,600,000,000 operations in a second (this is called a clock cycle) (“What Is a Clock Cycle? - Definition from Techopedia” n.d.). That’s a lot compared to older computers which had clock cycle rate or clock rate in the MegaHertz range in the 1980s (“Clock Rate” 2021)! If we look up more about this chip we would learn that it has 4 cores and has hyper-threading, which allows it to effectively perform 8 tasks at once (“What Is Hyper-threading HP® Tech Takes” n.d.). Next we see that there is 16 Gigabytes of memory - this is how much RAM it has and also 2133 MegaHertz (aka 2.133 GHz) of low power double data rate random access memory (LPDDR3), this means that the RAM can process 2,133,000,000 commands every second (Josh Covington 2017; Mukherjee 2019). If you are interested you can checkout more about what this means at this blog post Scott Thornton (2021). However, generally the amount of RAM is more important for assessing performance (Josh Covington 2017; Mukherjee 2019).
If we click on the storage button at the top, we can learn about how much storage is available on the computer. If you hover over a section, it tells you what file are accounting for that section of storage that is already being used.
4.2.2 Checking your computer capacity - Windows/PC
If you have a PC or Windows computer, the steps may vary depending on your operating system, but try the following:
- click the “Start” button - which looks like 4 squares together
- click “Settings” button (gear-shaped)
- click “System”
- click “About”
See this link for more information.
Here we can see that this computer has an Intel(R) Core(TM) i7-4790K CPU @ 4.00 GHz 4.00 GHz chip and 16 Gigabytes of RAM. If we look up this chip we can see that it has 4 cores and 8 threads (due to hyper-threading) allowing for 8 tasks at a time.
To find out more information about your storage click the “Storage” button within the “System” tab.
Here we can see that this computer has 466 GB + 465 GB = 932 GB across the two drives. The C drive is typically for the operating system, and the D drive is typically where you would install application programs and save files. There are 1000 GB in a TB, thus, this computer has about the same storage as the Mac that we just looked at.
4.3 File Sizes
Now let’s think about the files that we might need for our research, how big are files typically for genomic, imaging, and clinical research?
Recall this table from earlier about digital data size units:
4.3.1 Genomic data file sizes
Genomic data files can be quite large and can require quite a bit of storage and processing power.
Here is an image of sizes of some common file types:
4.3.2 Imaging Data File Sizes
Imaging data, although often smaller than genomic data, can start to add up quickly with more images and samples.
Here is an table of average file sizes for various medical imaging modalities from Liu et al. (2017):
[source]
Note that depending on the study requirements, several images may be needed for each sample. Thus data storage needs can add up quickly.
4.3.3 Clinical Data File Sizes
Really large clinical datasets can also produce sizable file sizes. For example the Healthcare Cost and Utilization Project (HCUP) National (Nationwide) Inpatient Sample (NIS) contains data on more than seven million hospital stays in the United States with regional information.
According to the NIS website it “enables analyses of rare conditions, uncommon treatments, and special populations” (“NIS Database Documentation” n.d.).
Looking at the file sizes for the NIS data for different states across years, you can see that there are files for some states, such as California as large as 24,000 MB or 2.4 GB (“NIS Database Documentation” n.d.). You can see how this could add up across years and states quite quickly.
4.3.4 Checking file sizes on Mac
If you own a Mac and want to check the size of a particular file, look at your file within a finder window. You can open a new finder window by clicking on the button that looks like a square with two colors and a face, typically in the bottom left corner if your dock or the strip of icons on your screen to help you navigate to different application programs.
Once you open a finder window, you can navigate to one of your files.
If you have the view setting that looks like 4 lines, you will get information about the size of each file.
You can right click on a file and click the “Get Info” button. This will give your more specific information.
4.3.5 Checking file sizes on PC/Windows
In a similar manner to checking file sizes on a Mac, with a Windows or PC computer, you can navigate to files by first opening the File Explorer application by typing this in the search bar next to the “start” button.
Then navigate to a file of interest which will show information about the size in one of the columns to the right, if you hover over the file name, you will get more specific information.
4.4 Computing Options
4.4.1 Personal computers
These are computers that your lab might own, such as a laptop, a desktop, used by one individual or maybe just a few individuals in your lab.
If you are not performing intensive computational tasks, it is possible that you will only need personal computers for your lab. However, you may find that this changes, and you might require connecting your personal computers to shared computers for more computational power and or storage.
4.4.3 Computer Cluster
In a computing cluster several of the same type of computer (often in close proximity and connected by a local area network with actual cables or an intranet rather than the internet) work together to perform pieces of the same single task simultaneously (“Computer Cluster” 2022). The idea of performing multiple computations simultaneously is called parallel computing (“Parallel Computing” 2021).
There are different designs or architectures for clusters. One common one is the Beowulf cluster in which a master computer (called front node or server node) breaks a task up into small pieces that the other computers (called client nodes or simply nodes) perform (“Beowulf Cluster” 2022).
For example, if a large file needs to be converted to a different format, pieces of the file will be converted simultaneously by the different nodes. Thus each node is performing the same task just with different pieces of the file. The user has to write code in a special way to specify that they want parallel processing to be used and how. See here for an introduction about how this is done “How to Supercharge Your Bash Workflows with GNU Parallel” (2019).
It is important to realize that the CPUs in each of the node computers connected within a cluster are all performing a similar task simultaneously.
See here for more information (De Doncker and Hussein, n.d.).
4.4.4 Computer Grid
In a computing grid are often different types of computers in different locations work towards an overall common goal by performing different tasks (“What Is Grid Computing? How It Works with Examples” n.d.).
Again, just like computer clusters, there are many types of architectures that can be rather simple to very complex. For example you can think of different universities collaborating to perform different computations for the same project. One university might perform computations using gene expression data about a particular population, while another performs computations using data from another population. Importantly each of these universities might use clusters to perform their specific task.
Both grids and clusters use a special type of software called middleware to coordinate the various computers involved. Users need to write their scripts in a way that can be performed by multiple computers simultaneously. Users also need to be conscious of how to schedule their tasks and to follow the rules and etiquette of the specific cluster or grid that they are sharing (more on that soon!).
See here and herefor more information about the difference between clusters and grids (Lithmee 2018; “Difference Between Grid Computing and Cluster Computing” 2019).
4.4.5 “Cloud” computing
More recently, the “Cloud” has become a common computing option. The term “cloud” has become a widely used buzzword (Cha 2015) that actually has a few slightly different definitions that have changed overtime, making it a bit tricky to keep track of. However, the “cloud” is typically meant to describe large computing resources that involve the connection of multiple servers in multiple locations to one another (“Cloud Computing” 2022) using the internet. See here for a deeper description of what the term cloud means today and how it compares to other more traditional shared computing options (“What’s the Difference Between Cloud and Virtualization?” n.d.).
Many of us use cloud storage regularly for Google Docs and backing up photos using iPhoto and Google. Cloud computing for research works in a similar way to these systems, in that you can perform computations or store data using an available server that is part of a larger network of servers. This allows for even more computational dependability beyond a more simple cluster or grid. Even if one or multiple servers is down, you can often still use the other servers for the computations that you might need.
Furthermore, this also allows for more opportunity to scale your work to a larger extent, as there is generally more computing capacity possible with most cloud resources (“Cloud Computing Vs. Traditional IT Infrastructure Leading Edge” n.d.).
Companies like Amazon, Google, Microsoft Azure, and others provide cloud computing resources. Somewhere these companies have clusters of computers that paying customers use through the internet. In addition to these commercial options, there are newer national government funded resource options like Jetstream (described in the next section). We will compare computing options in another chapter coming up.
4.5 Conclusion
We hope that this chapter has given you some more perspective on how large medical research data files can be, as well as given you more familiarity with how well your computer might be able to accommodate the files that you might work with. We also hope that this chapter has provided you with some more awareness about computing options that might be available to you, should you need more capacity than your current computer.
In conclusion, here are some of the major take-home messages:
- A bit is the smallest binary digital data unit. It is a single 0 or 1.
- A byte is a group of 8 bits, file sizes are typically described using units based on bytes.
- A typical fancy laptop today might allow for up to 1 TB of storage, however this can quickly get used up if you are working with large data files.
- Even if you have enough storage for a large file, you might not have enough RAM to actually work with a large data file. Your computer might be too slow to handle that type of work. In which case, you might want to consider using shared computing resources.
- A server (when describing hardware) is a single computer (typically a supercomputer if just one computer) or group of computers that others can share to help them perform more intensive computational tasks or store large amounts of data. People often connect to these over the internet, but servers can also be connected to by directly using wires in a local network (like in a department to different offices).
- The computers in a server are optimized for assisting users with computations or storing data.
- A supercomputer is a computer that has much more storage, memory, and computing capacity than a typical personal computer. Supercomputers are generally much more expensive than using a group of more typical computers that together would have the same collective computing and storage capacity.
- There are two general types of servers: clusters and grids. Cluster approaches work by having several computers working on pieces of the same task simultaneously in a method called parallel computing. Grid approaches work by having different types of computers working on different tasks.
- Cloud computing is essentially the use of many servers accessed through the internet. This is often more reliable because there are many servers to use, even if one other users are performing large tasks or if a server goes down. We will talk more about the pros and cons of this option in the coming chapters.
- If your institute doesn’t provide you access to a shared computing resource and you don’t want to use a commercial cloud option, you could consider options like Xsede and or Jetstream2, which is a national resource that you can request access to.