Many articles have been written on why R is better than Excel for data analysis. In this post, I will summarize the reasons why R is advantageous in most data analysis circumstances, with a focus on fantasy football analysis. Although this post focuses on the benefits of using R for fantasy football analysis, the benefits of R extend to many other domains (e.g., finance, time series, machine learning, social sciences; for a list of different domains of R packages, see here).
What is R?
R is a software package and language for statistical analysis and graphics.
When/Why R is Better than Excel
- Data Manipulation. R allows you to manipulate (e.g., subset, recode, merge) data quickly. Some R packages have been designed specifically for these purposes, e.g., plyr. Typically, a majority of the time spent on an analysis project is spent before the analysis—preparing the data. R is much more adept and efficient in data preparation than Excel. Fantasy data scraped from websites often require many steps in data processing to be ready for analysis, so R is ideal.
- Easier Automation. R uses a scripting language rather than a GUI, so it’s much easier to automate things in R than in Excel. This can save you loads of time, especially when you plan to re-run the same analysis multiple times (e.g., every new fantasy season).
- Faster Computation. Because of the automation provided by R scripts, many operations are much faster to perform in R than Excel.
- It Reads Any Type of Data. R can basically read any type of data (.txt, .csv, .dat, etc.). There are also R packages specifically designed to read JSON, SPSS, Excel, SAS, and STATA files. You can also scrape data from websites and execute SQL queries. Scraping websites can be useful for downloading fantasy projections from ESPN and other websites for data analysis.
- Easier Project Organization. In Excel, projects are often organized in different tabs of the same file. This can make the Excel file slow, clunky, and difficult to navigate. It is easier to keep a project organized when dealing with R scripts because different tasks or sub-projects can be stored in separate files stored in the same folder and linked together in the same project with RStudio. For an example folder structure for R projects, see here.
- It Supports Larger Data Sets. Excel has restrictions for how large your data can be. Even if if your data don’t exceed this maximum size, Excel can become slow with large data sets (especially after you add tabs, formulas, and references). R supports larger data, and can support big data with packages such as Hadoop.
- Reproducibility. R has features that make it much easier to reproduce the findings of your analysis, which is important for detecting errors. First, it’s easy to add comments to your scripts to make it clear what you’re doing. Commenting your code is crucial, and can serve as a translation for someone else looking at your code, or as a reminder of what you did 6 months ago! It is difficult to document steps you’ve done in Excel. Second, data and analysis are separated in R, allowing you to see the logical progression for data analysis in the R code. In Excel, however, data and formulas are together, and it can be difficult to follow the data analyst’s train of logic. Third, you can use version control with git a) to track (and revert) changes you make over time and b) to share your scripts with others to collaborate on projects as a community. Having more people examining your work can help find and fix errors and make other improvemnts. Excel files are binary files, so you can’t track changes to Excel files. The github site hosting the R scripts for this site is located here. Feel free to use the scripts and suggest improvements!
- Accuracy. Researchers have shown that Excel and other spreadsheets show important inaccuracies for basic analyses like linear regression. R was specifically designed for statistical analysis, so it is more precise and accurate for data analysis.
- Easier to Find and Fix Errors. Because R uses scripting rather than clicking, and allows comments and version control, one can see a history of the actions taken to achieve the result. This makes it easier to find and troubleshoot errors. In Excel, however, errors can be hidden in formulas in cells that can be difficult to find. Spreadsheet errors have led to widely-publicized mistakes, including disastrous financial losses, faulty government policies, and the wrong drugs being given to cancer patients. Humans make mistakes and mistakes in data analysis are inevitable, whether with spreadsheets or with R code. The bottom line is that it’s easier to find and fix these mistakes in R than it is in Excel, making it more likely that you’re getting an accurate result in R.
- It’s Free. Enough said.
- It’s Open Source. Unlike Excel and other proprietary software used for data analysis, R is not a black box. You can examine the code for any function or computation you perform. You can even modify and improve these functions by changing the code.
- Advanced Statistics. R has many more (and more advanced) statistics capabilities than Excel does. They also tend to be faster and more flexible. Part of the advanced capabilities of R owes to the fact that R is open source and many users have contributed packages for performing specialized functions. For example, this fantasy football draft optimizer uses the Rglpk package to find your optimal starting lineup of players that maximizes the team’s projected points while minimizing its downside risk.
- State-of-the-Art Graphics. R has advanced graphics capabilities (see here for examples and code for how to create them). You can create beautiful graphics using the base R package, or with the lattice or ggplot packages. People like to digest and understand statistics visually, and R provides a better tool for creating pretty visualizations than Excel does.
- It Runs on Many Platforms. You can use R on Windows, Mac, Linux, and Unix.
- Anyone (Including You) Can Contribute Packages to the Community to Improve its Functionality. In the chance there isn’t an R package that does what you need to do, you can write a function to perform the task and can contribute it as a package to the community for others to use and improve. The number of R packages contributed to the community is increasing at a rapid rate. Chances are, if there’s an analysis you need to do, an R package exists to do it.
For these reasons,
- R is considered the top language for data science. Also see here.
- R usage is increasing exponentially.
- The number of R packages is increasing exponentially. Also see here.
- More data scientists use R than any other software for predictive modeling competitions (including 50% of winners) on Kaggle.
- An increasing number of jobs require experience with R.
- R skills attract the highest IT salaries.
- Many companies use R.
- Microsoft (the owner of Excel) bought Revolution Analytics (the leading business software implementation of R for data analysis).
When to Use Excel
- Data Entry. I don’t necessarily recommend Excel for data entry, but it does have some nice features including data validation. Data validation ensures that the entered data conform to user-specified requirements (e.g., must be integer between 0-5), and can reduce data entry errors. The problem with saving data in Excel (.xls/.xlsx) format is that these formats don’t cooperate well with other software, so if you plan to do any data manipulation or analysis with other software (e.g., R), it gets tricky. You might be able to save your data as comma-separated values (.csv), a more widely used format that can be read by most other data software. In most cases, however, using a simple text editor or relational database management software is best for data entry.
- Basic Calculations. Excel can be faster for very simple calculations (e.g., descriptive statistics) or data manipulation (e.g., basic sort, filter, pivot table). Anything beyond simple calculations and data manipulation, however, will likely be better in R.
- Viewing/Scrolling Data at a Glance. Excel can be handy for a quick visual overview of your data. Eyeballing and scrolling through the rows and columns gives you a hands-on way to quickly explore your data visually. There are ways to see your data in R, as well, but a spreadsheet-like view of your data is not presented by default.
- Data Presentation and Summarization. Excel can be helpful for presenting data that need to be summarized in an aesthetically-pleasing tabular form. For more advanced graphical summaries, however, R is likely better.
How to Download R
How to Get Data from Excel into R
Although there are R packages for importing Excel data directly (see here, here, and here), it’s generally best to export the Excel data to a .csv file and then import the .csv file in R using the read.csv() function. Here’s the code for importing a .csv file in R and storing it in an object called ‘mydata’:
mydata <- read.csv("C:/excelData.csv")
Where Can I Learn How to Use R?
- Use this intro to R: http://www.statmethods.net/
- Here’s a guide to R for Excel users (and another one).
- Check out these resources for getting started with R: http://www.ats.ucla.edu/stat/
- Watch these videos from Coursera: http://blog.revolutionanalytics.com/2012/12/coursera-videos.html
- Post to the R mailing list or forums if you have questions
- Read other blogs on R-bloggers
- Read this site!
In summary, R has many benefits over Excel and other data analysis software. If you’re using Excel for data analysis, give R a try. You’ll be thankful you did.