**Week**: 30

**Language**: The R Project

**IDE(s):** R

*History (official):*

(From the R Project “What is R?” page)

“*R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes*

*an effective data handling and storage facility,*

*a suite of operators for calculations on arrays, in particular matrices,*

*a large, coherent, integrated collection of intermediate tools for data analysis,*

*graphical facilities for data analysis and display either on-screen or on hardcopy, and*

*a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.*”

** **

Once again, this is what you get when programmers write your sales materials – nothing but facts.

Boring, tediously informative facts.

*History (real):*

In the Olden Days (™), if you wanted a computer to do your math homework, you had to use FORTRAN. It wasn’t what you might call ‘interactive’. You wrote your code, submitted it to the mainframe, which compiled and ran it. Assuming you didn’t have any typos, you got a printout of the results. (FORTRAN was my first programming language back in 1977. We used punch cards.)

This was always annoying and occasionally painful, but there were no good alternatives until the mid-70’s, when researchers at Bell Labs developed the programming language ‘S’. It was standard practice at the time to give programming languages single letter names.

I’m picturing the marketing meetings:

“How about ‘Bell Labs: We Don’t Have Time For This’?”

“Not bad. But I really like ‘Bell Labs: Smart But Terse’.”

“Love it!”

Moving on.

In the early 90’s, researchers at the University of Auckland, New Zealand developed a new version of S that they called R. Currently it’s being maintained by the R Development Core Team, with contributors from all over the world. The name R is not just a play on the name S, but is also a tribute to the original developers, Robert Gentleman and Ross Ihaka, who were known at university as “R & R”.

R is free and available for Linux, Windows and Mac OS X. The source code is also freely available so you can compile it for any platform you like.

*Discussion*:

I’ve been looking forward to this for some time. I teach undergraduate math and occasionally blog about math education so math software is a particular interest of mine. I’m a firm believer in letting machines do the grunt work of mathematics. If you understand the problem well enough to explain it to a computer, then by definition you understand the problem.

I downloaded the Mac version of R from the main project site. R is a command line based tool so I wasn’t that surprised when I started up the program and got a window with a command prompt:

** **

The window has a toolbar with easy access to common functions:

- Load data or a script file
- Open a new window (for charts and plots)
- Authorize R to run commands as root (system administrator)
- Show/hide R command history
- Set R console colors
- Open document in editor
- Create a new empty document
- Print this document
- Quit

Most of your work is done at the command line.

Before we get going, I’d like to perform the traditional “Hello World!”:

This term I’m teaching an introductory statistics course so I’ve got some sample data from classroom exercises to run R through it’s paces. R stores with data tables in variables called data frames. There are pre-loaded data frames available in the software with which to experiment. You can enter data manually or just load the data from an outside source.

My data is in spreadsheet format and there are a number of ways to import spreadsheet files directly into R. Since I use Google Sheets as my primary spreadsheet program, the easiest way for me was to save off my data in CSV format. (Excel files can be imported directly.)

R assumes files are in the current working directory. The default is my home directory so I changed that setting to where I’d saved the files.

My first test data was a simple table comparing car prices to the age of the car. I chose a specific make and model (Toyota Sienna) and pulled these numbers straight from AutoTrader.com. I converted the worksheet and loaded the data table into R:

Now I can work with it directly. First let’s give it a once-over using the summary() function:

This gives me my central tendency numbers amongst others. Now let’s do a quick plot using using the pairs() function.

I got two plots, one with age as the dependent variable and other with price. I didn’t tell R which was which so it did both. I can be more specific using the plot() function:

This tells R that price is dependent on the age of the car. This gives me a single chart:

Now I can calculate the correlation coefficient with cor() to see how strongly the two sets of data relate to each other:

So price is negatively correlated with the age of the car, which fits what the chart told us. Older cars cost less, in other words. It’s a pretty strong correlation, too, at 85%.

Now we’d like to do some prediction so we’ll perform a linear regression on the data. First create a data structure with the regression data, then pull a summary:

Now you have a processed data set and you can continue working with it.

We can get data in and manipulate it but how do we get it out? For text data, such as the correlation summary, you can just copy and paste it from the R gui window. The plots appear in a separate window. I was able to click on the image, select Copy from the Edit menu and paste directly in a document.

R is a very powerful, interactive language for scientific and math computing. So why would you use it instead of a spreadsheet?

Frankly, if you’re not a full-on numbers nerd, you may just want to stick with spreadsheets. But they’re a general purpose tool and R has more math functionality. You can write your own functions and even groups of functions (called *packages*) to extend R even further.

Another advantage of R is automation. Anything you can type in at the command prompt can be saved into a file, letting you easily set up long, complex sets of calculations that can be loaded into your workspace with a single command. If you’re doing batch processing of multiple datasets, this can save a lot of time and effort.

The documentation is very good and there are plenty of tutorials and examples available at the project homepage and around the Web.