R for Bioinformatics: Tools, Applications, Examples

R is an open-source programming language specifically used for statistical computing and graphics. It is one of the widely used programming languages in bioinformatics. It is able to manipulate and analyze large datasets quickly and easily. It provides an extensive library of statistical and graphical methods, making it easy to visualize data and present it in a clear and concise way. R also provides a wide range of tools and techniques for analyzing biological data.

R Programming Language in Bioinformatics. Image Source: Respective Tools Websites.

Table of Contents

Advantages of R for bioinformatics

Some of the advantages of using R programming in bioinformatics include the following:

R is an open-source language. It is an accessible option for everyone, including bioinformatics researchers.
R has a wide range of statistical tools and packages that can be used to analyze bioinformatics data.
R has a large and active community of users and developers constantly creating new tools and packages specific to bioinformatics research needs.
R can function on various operating systems, making it a cross-platform language.

Tools for R Programming in Bioinformatics

R provides many packages that are designed specifically for working with genomic data. Some of these tools include:

1. Bioconductor

Bioconductor is an open-source and open-development software project for computation biology. It is a collection of R packages for bioinformatics, which includes tools for data visualization, statistical analysis, and genomic data analysis.

To install Bioconductor:

source("https://bioconductor.org/biocLite.R")

For installing specific packages like Biostrings and GenomicRanges:

biocLite(c( "Biostrings", "GenomicRanges"))

Some of the major Bioconductor packages used in bioinformatics are:

GenomicRanges is a Bioconductor package that provides tools for storing, manipulating, and analyzing genomic intervals.

DESeq2 is a package for differential gene expression analysis, commonly used in RNA-seq data analysis.

Biostrings package provides efficient data structures and algorithms for working with biological sequences, including DNA, RNA, and protein sequences. This package is particularly useful for analyzing high-throughput sequencing data, such as whole-genome sequencing or transcriptome sequencing.

Example:

Here is a simple example of a code for performing pairwise sequence alignment using Biostrings:

library(Biostrings)

seq1 <- DNAString("ATGGTGACCTGACGTCGAGGTAGCCAGCTGACTAGGACGTAGGCT")
seq2 <- DNAString("ATGGTGACCTGACGTCGAGCTAGCCAGCTGACTAGGACGTAGGCT")

alignment <- pairwiseAlignment(seq1, seq2)

print(alignment)

2. ggplot2

ggplot2 is a popular R package for data visualization, which can be downloaded from the Comprehensive R Archive Network (CRAN). It provides a set of functions that allow users to easily create a wide range of graphs to explore and visualize their data. It is able to create aesthetically pleasing and informative graphs.

The following basic template is used to create a ggplot:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

For example: To creates a line plot of gene expression over time, with the time on the x-axis and the expression levels on the y-axis:

library(ggplot2)

ggplot(gene_expression, aes(x=time, y=expression)) + geom_line()

Here, ‘gene_expression’ is the data, ‘aes’ (aesthetic) function is used for mapping and the ‘geom_line’ function is used to create line plot.

3. Shiny

Shiny is an R package widely used in bioinformatics for creating web-based tools and applications that allow users to interact with and visualize genomic data. It can visualize genomic data, perform statistical analysis, and create interactive reports.

4. dplyr

dplyr is an R package for data manipulation with functions for filtering, selecting, summarizing, and arranging data.

For example: To select and filter required data:

To select ‘gene’, ‘sample’ and ‘organism’ columns from a dataframe ‘rna’, the ‘select’ function is used:

library(dplyr)

select(rna, gene, sample, organism

The ‘filter’ function can be used to select only the rows of the data frame where the sex column is equal to “Male”:

filter(rna, sex == "Male")

Applications of R in Bioinformatics

R programming is used in bioinformatics for various applications, from data visualization and statistical analysis to genomics and machine learning. Some of the applications of R programming in bioinformatics are:

R programming can be used to create graphs and charts, essential for exploring and interpreting complex biological data. Some of the popular visualization packages in R include ggplot2 and shiny.
R programming provides various statistical tools and techniques for analyzing biological data.
R programming provides tools for data manipulation, which are essential for working with large biological datasets. The dplyr package tools can clean and preprocess data, making it easier to analyze and interpret.
R programming provides many packages specifically designed for working with genomic data, such as Bioconductor and the GenomicRanges package.

References

Giorgi, F. M., Ceraolo, C., & Mercatelli, D. (2022). The R Language: An Engine for Bioinformatics and Data Science. Life, 12(5). https://doi.org/10.3390/life12050648
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
https://uclouvain-cbio.github.io/WSBIM1207/sec-dplyr.html
https://uclouvain-cbio.github.io/WSBIM1322/sec-vis.html
https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/GenomicRanges/doc/GenomicRangesIntroduction.pdf
https://www.biocode.ltd/r1
https://www.datacamp.com/tutorial/intro-bioconductor
Create elegant data visualisations using the grammar of graphics. (n.d.). Retrieved from https://ggplot2.tidyverse.org/
Shiny. (n.d.). Retrieved from https://shiny.posit.co/