Chapter 4 What is R and why use it?
[these notes are from a lecture and have not been re-written much yet]
R is a powerful piece of software used for data science and data analysis. In this chapter I will briefly introduce the advantages of using R, why you might want to learn it, and also indicate some alternatives and adjuncts you could consider.
4.1 How do we typically use software in science?
Most scientists rely on both general and specialized pieces of software for various parts of their work. For data entry they likely use spreadsheet software Excel, though increasingly Google Sheets. For data analysis they might use one of many options, such as GraphPad Prism, Minitab, SAS, SPSS, or STATA. For making plots, many people will export their export their results back to Excel, while others use specialized software like SigmaPlot. Many scientists also use specialized programs; in ecology many researchers do GIS in ArcGIS or QGIS, mark-recapture analysis in Program MARK or Distance, use RAMAS or Vortex for population viability analysis, or build custom mathematical programs in MatLab or Python. If they do multivariate statistics like [ordination](https://en.wikipedia.org/wiki/Ordination_(statistics) the may use a specialized stats program like PC-ORD.
Since software can be expensive, some scientists will rely on Excel for all of their work. Excel can do many things, but it can’t do everything all the specialized types of software can do. Moreover, its very limited in the range of statistics it can do and graphs it can make.
4.2 What does R do?
R is amazing because it has been explicitly developed to do several things very well, particularly statistics, math, making great-looking figures, and writing computer programs to automate these tasks. Additionally, R has been extended by developers to be able to be a powerful tool for data cleaning and organization, to be used as a GIS, and as an integrated word processor and website make for publishing work.
4.3 Why use R
In addition to is many capabilities, R has the advantage this it is
- free anyone, always
- used by statisticians to develop new statistical techniques, so new techniques often come out 1st in R
- used by almost all ecological statisticians to develop new techniques (mark recapture, distance sampling)
4.4 Who uses it?
R continues to increase in popularity. Among data scientists it is second only to Python. Among academics it has eclipsed SAS in many fields. It is also used by analyses in many large companies, such Facebook, and by journalists looking for stories in or reporting on large volumes of data
see http://blog.revolutionanalytics.com/2014/05/companies-using-r-in-2014.html for further discussion.
4.5 R and computational reproducibility
One factor potentially contributing to R’s popularity, or at least a major bonus for using it, is ease of use for making analyses reproducible. All commands in R are typed out and the best way to do this is in a static script file from which you send commands to R to execute. This creates a record of your analyses. This feature is shared by other programs such as SAS and Stats, and other programming languages such as Matlab and Python. The advantage of R is that the script files are simply plain text files which anyone can open and - if they’ve downloaded R, which is free - they can run. Developers have also created numerous tools for creating reproducible analysis workflows and which allow R to be used in all data-related aspects of a project, from data cleaning to formatting journal submissions. What this means is that without become an expert programmer you can set up your work so that you can re-run all of your data cleaning, analyses, and graph building with a single command in R. This makes what you’ve done auditable, transparent, and easy to re-use for future work.
4.6 Alternatives to R
R has many advantages, but it has one critical issue: the learning curve. R is a command-line driven analysis tool, which means you type out specific commands for almost everything single thing R does. Excel is pretty user friendly, and several stats programs similarly use point-and-click interfaces, such as SPSS, JMP, and Stata SAS also requires a lot of command writing, but is generally consider more user friendly than R.
Recently, two free point-and-click statistical analysis programs have been release that are built on R but require no programming. JASP (“Just another statistics program”) has an emphasis on Bayesian statistics, particularly Bayesian hypothesis testing using Bayes factors (an approach increasing in popularity, especially in psychology, but which some Bayesians, like Andrew Gelman, disavow). While JASP is based on R, it does not currently allow access to the underlying R code.
Jamovi has a similar spirit as JASP (indeed, it was founded by developers who had worked on JASP) but is more transparent about the underlying R code being used to run the analysis.