Chapter 1 What is R and why use it?

NOTE: these notes are from a lecture and have not been re-written much yet

R is a powerful piece of software used for data science, from routine data analysis with conventional statistics to machine learning. In this chapter I will briefly introduce the advantages of using R, why you might want to learn it, and also indicate some alternatives and adjuncts you could consider.

1.1 How do we typically use software in science?

Most scientists rely on both general and specialized pieces of software for various parts of their work. For data entry they likely use spreadsheet software Excel or Google Sheets. For data analysis they might use one of many options, such as GraphPad Prism, Minitab, SAS, SPSS, or STATA. For making plots, many people will export their results back to Excel, while others use specialized software like SigmaPlot.

Many scientists also use specialized programs. For example, in ecology, and environmental science and geology many researchers make an analyze maps using GIS (Geographic Information System) in ArcGIS or QGIS, and scientists in many fields use Mathematica to build mathematical models.

Since software can be expensive, some scientists will rely on Excel for all of their work. Excel can do many things, but it can’t do everything all the specialized types of software can do. Moreover, its very limited in the range of statistics it can do and graphs it can make.

1.2 What does R do?

R is amazing because it has been explicitly developed to do several things very well, particularly statistics, math, making great-looking figures, and writing computer programs to automate these tasks. Additionally, R has been extended by developers to bea powerful tool for data cleaning and organization, to be used as a GIS, and as an integrated word processor and website maker for publishing work.

1.3 Why use R

In addition to is many capabilities, R has the advantage this it

  • is free for everyone, always
  • is used by statisticians to develop new statistical techniques, so new techniques often come out 1st in R
  • has a user community known for valuing cooperation and inclusiveness
  • is supported by a for-profit company, RStudio, that produces many free programs and services to support the community

1.4 Who uses it?

R continues to increase in popularity. Among data scientists it is second only to Python, which borrows many of its data analysis ideas from R. Among academics it has eclipsed SAS in many fields. It is also used by analyses in many large companies, such Facebook, and by journalists looking for stories in or reporting on large volumes of data. (See http://blog.revolutionanalytics.com/2014/05/companies-using-r-in-2014.html for further discussion).

1.5 R and computational reproducibility

One factor potentially contributing to R’s popularity – or at least a major bonus for using it – is ease of use for making analyses reproducible. All commands in R are typed out and the best way to do this is in a static script file from which you send commands to R to execute. This creates a record of your analyses. This feature is shared by other programs such as SAS and Stata, and other programming languages such as Matlab and Python. The advantage of R is that the script files are simply plain text files which anyone can open and - if they’ve downloaded R, which is free - they can run.

Developers have also created numerous tools for creating reproducible analysis workflows and which allow R to be used in all data-related aspects of a project, from data cleaning to formatting journal submissions. What this means is that without becoming an expert programmer you can set up your work so that you can re-run all of your data cleaning, analyses, and graph building with a single command in R. This makes what you’ve done auditable, transparent, and easy to re-use for future work.

1.6 Alternatives to R

R has many advantages, but it has one critical issue: the learning curve. R is a command-line driven analysis tool, which means you type out specific commands for almost everything single thing R does. Excel is pretty user friendly, and several stats programs similarly use point-and-click interfaces, such as SPSS, JMP, and Stata SAS also requires a lot of command writing, but is generally consider more user friendly than R.

Recently, two free point-and-click statistical analysis programs have been release that are built on R but require no programming. JASP (“Just another statistics program”) has an emphasis on Bayesian statistics, particularly Bayesian hypothesis testing using Bayes factors (an approach increasing in popularity, especially in psychology, but which some Bayesians, like Andrew Gelman, disavow). While JASP is based on R, it does not currently allow access to the underlying R code.

Jamovi has a similar spirit as JASP (indeed, it was founded by developers who had worked on JASP) but is more transparent about the underlying R code being used to run the analysis.