Basics of Data import for biological data
How to get started in visualising your own data in R?
Obviously there are many different types of data you might collect and visualise using GGPlot that are not covered anywhere else. This page gives some tips in tricks on how to get started. Normally, when I get started I have a plot in mind I would like to make, but the biggest hurdle in getting there is often how to import your data. For most lab-based experiments data is collected in ways that are not immediately compatible with import into R. For example, plate reader data is often in the form of a plate layout in excel, where each well corresponds to a specific sample, and the filename contains information on which experiment it is, while flow cytometry data has 1 file for each sample, and maybe the folder name contains this information. However, for most datasets, the following 5 rules of thumb might help!
1. Learn from packages + google everything
[ probably someone already tried to do the same ]
My first tip is to always google everything you might want to try. R contains so many statistical packages that for every type of data i've ever come across, there is some amazing person somehwere who has already written a whole manual for you to use.
2. Use a design table to keep an overview of which file is coming from where and facilitate data import
If you're already familiar with examining other people's analysis or have ever done some analysis on large datasets you've probably seen design tables before. This specifies for each file in you analysis where it is, and what its role is in your analysis. Always making a design table also helps you to later figure out which samples you used in this specific graph, and where you stored them.
I normally start by collecting my files in a specific folder, and calling:
design_table <- data.frame(fileNames = list.files("path/to/files"), fileDirectory = "path/to/files")
Using regular expressions
Most of the time, from the filename you can tell which sample is which. Yes.... I know you can! But you can tell R as well, for example using grepl() or str_extract(). These functions by default use regular expressions. These are kind of "broad" search words where you can search for things like:
any word starting with an X.
*anyting* ending with ".bam"
4 digits preceding "myName"
.. the sky is the limit here, these are just some random examples.
With grepl, you can search which strings contain such patterns, and with str_extract you can extract this type of information.
For example, if you have a list of files where each file corresponds to a condition treated with 0nM, 10nM, 100nM and 1000nM of a compound. However, this is probably not the only information stored here in this filename. Maybe it's something like "20221007_Dorien_PlateReader_Exp1sec_luci_0nM_exp3.xlsx". Using regex, you can tell str_extract to extract all (+) digits (\\d) (numerical characters) directly in front of the text "nM" with the pattern "\\d+nM".
For example:
> str_extract("20221007_Dorien_PlateReader_Exp1sec_luci_0nM_exp3.xlsx", "\\d+nM")
[1] "0nM"
> str_extract("20221007_Dorien_PlateReader_Exp1sec_luci_1000nM_exp3.xlsx", "\\d+nM")
[1] "1000nM"
For anything regex I always end up googling how to exactly specify my search conditions and this guide from the R repository is very helpful. Regex language is slightly different per programming language, so if you do end up googling it you should keep it to the R regex syntax.
However, if this is not possible, you can make a table in excel specifying the filenames manually. Just , if you do that, make sure to always very carefully check if you did it right. (Which you should also do with regex actually so check always what you put in and what is going out!
Do not use space in your filenames or anywhere in your file
This also makes it much easier to import data, because any whitespace is by default a splitting character for some import functions (tab and whitespace both). I also avoid space at all cost in column names for this reason. Another good reason is when you specify a column name in R that contains whitespace, calling dataframe$column_name does not work, and you have to do dataframe$`column name` which I just find annoying. If you're already familiar with examining other people's analysis or have ever done some analysis on large datasets you're probably familiar with a design table. This specifies for each file in you analysis where it is, and what its role is in your analysis. Always making a design table also helps you to later figure out which samples you used in this specific graph, and where you stored them.
3. Collect your data in long format, if possible
So, now we have the files you want to import (if not, please check out tip #4) and you want to do something with it.
A lot of tables we generate have columns and rows, and if there are more than 2 variables we usually intuitively make multiple tables side-by-side? above-and-below? side-by-side and above-and-below..? However, for anything with ggplot (and anything with >2 variable types) ,the long format is much easier to work with.
Long means you have 1 observation per row, and all information relating to what this observation is, is in the extra columns (here Cell type and Concentration). You can imagine with more than 2 variables this can be extended much more easily and always in the same way, namely, by adding an extra column! With a wide table as shown on the left, this is much harder. This also makes merging data with each other much easier, because all the identifying information is in 1 row per observation.
So, if possible, collect your data in long format. If this is not possible, use the pivot() functions from dplyr.
4. Writing functions to import multiple files
I know many people who are getting started with R who do not or very little use functions, or for loops. You really should, because it will make your scripts much more clear and easier to use! Both for yourself and others who are reading your scripts. Some advantages:
There is 1 place where the code is written and you always use it in the same way.
If you change something, this immediately applies to all your imported datasets
...(this also applies to other repeated tasks other than data importing of course)
I personally quite like for-loops because I find them much easier to read than lapply()-based approaches after going back to old scripts a few months later. However, for-loops have a name for being quite slow in R, but this is mostly important if your datasets are very large.
My functions for data import often read tables that I specified in my design table to either dataframes or lists (depending on the type of data I'm reading into R). In a dataframe, all columns need to be the same to use Rbind, or each column needs to be the same length. The advantage of lists is that each element can be its own dataframe with its own size, and for more complicated variable classes (special types of variables) that come with some packages (gatingset from CytoML, or GRanges for anything with genomics) they can often also be combined in lists.
In my import functions I typically use a structure where I first initialise the data to be imported by importing the first file. The for loop then iterates over all the rows in the design table and sticks the newly imported files to the imported_data. So with every loop iteration, I add a new element to the imported data. In dummy R this looks something like this:
my_new_function_for_data_import <- function(design_table){
# import the first file (1st row in your design table)
imported_data <- function_for_single_file_import(design_table[1,])
for(i in 2:nrow(design_table)){
new_imported_file <- function_for_single_file_import(design_table[i,])
imported_data <- rbind(imported_data, new_imported_file)
#or, if your imported data is a list, you can make a list of lists
imported_data <- c(imported_data, new_imported_file)
}
return(imported_data)
}
myDataset <- my_new_function_for_data_import(design_table)
5. Check your input data
This is maybe quite obvious but I think it cannot be stressed enough that you should always check every step of the process. If you are writing functions, make sure you have checked every step it is taking. In addition, especially if you are on a system with a comma as decimal separator, check your numbers and make sure they are the actual numbers you wanted to import (and not way too big or too small)! Lastly, make sure all numbers are actually imported as numbers, and if not, convert them with as.numeric().