Data Science with R

The John Hopkins' Coursera course on R Programming started two weeks ago and I thought I'd share some of my thoughts on the course and my growing knowledge of R.

The programming assignments in this course are, thankfully, challenging. I was concerned about the exercises being too simple in order to teach the syntax. The problems, however, have required

  • figuring out how generate filenames from numbers that needed leading zeros (easy)
  • how to combine sets of data across files (slighty harder)
  • how to extract vectors of cleaned data from data frames (which are collections of vectors)
  • how to compute correlations on data for those data sets where threshold limits are applied

If I had to do these sorts of things in C#, my most familiar language at the moment, it would be pure grunt work but I could get it done. With R, I had to make sure not only that I understood the syntax but that I performed the correct procedure to get a result. Sounds like programming but with a totally different language gestalt.

Having programmed (my chronology) in Pascal, C, C++, Visual Basic, C#, Javascript and Python, I have discovered that R is a language tailored to the data scientist, but with a lot of general-purpose programming features.

Data scientists often have to work with dirty data sets and they need to clean the data before applying statistical techniques. There is even a whole course dedicated to data cleaning in the JH Data Science Specialization!
Since this is primarily a quantitative science, there is the concept of NaN, or Not A Number, in R, just as there is in some general-purpose languages. R also supports the well-known NULL value. But because data scientists often work with data sets that have holes, or no data for some data values, R supports something called NA, which means Not Available (or missing values). The data sets used in the course, e.g., are, so far, CSV files that have plenty of NA values.

There are built-in functions for dealing with NAs such as is.na() which checks on whether a vector object contains NA values. Ah, that's another cool feature of R: vectors. Even single values are a vector of length 1. A vector is essentially a list of data values all of the same class or type. The reason vectors are so useful in R is that often you have multiple values (e.g., in a column) against which you want to perform the same function.

Here I will define a vector of class numeric containing five values, two of which are NA, and then I'll show one way to remove the NAs:

volumes <- c(3.75, NA, 82.01, 25.6, NA)
bad <- is.na(volumes)
volumes <- volumes[!bad]

What is actually happening is that the function is.na is applied to every element of the vector volumes, and it returns, not a single TRUE or FALSE, but another vector, called bad, which has a boolean value for each of the elements in volume. Finally, reapplying the bad vector (negated with '!') generates a vector containing only the good, or non-NA, elements.

Here is the same thing with a few printing steps interspersed in the R console to show what's going on. The console prompt is > and by just typing the name of an object, the console prints out its contents:

> volumes <- c(3.75, NA, 82.01, 25.6, NA)
> volumes
[1]  3.75    NA 82.01 25.60    NA
> bad <- is.na(volumes)
> bad
[1] FALSE  TRUE FALSE FALSE  TRUE
> volumes <- volumes[!bad]
> volumes
[1]  3.75 82.01 25.60
> 

Much of my education in R over the past two weeks has been about to manipulate data: cleaning data of NAs, extracting subsets of data, calling functions to generate a mean or a correlation coefficient. It has taken some getting used to the different ways of working with data, but it is clearly a better way than working with a general-purpose programming language like C.

An interesting thing about online courses is that they tend to attract a wide spectrum of students. With data science being the new hot, there are thousands of students who are attracted to a new career path that promises lucrative jobs at the end of the rainbow. This particular course is being taught as if the students were in college at John Hopkins University. This works well for me, given that I have, I think, a pretty high aptitude for learning difficult subjects, and the not-insignificant fact that I have over twenty-five years of commercial programming experience. For those attracted to data science but with little programming experience, there are two big hurdles: understanding a new conceptual field (data science) and a new technical practice, programming. There is a long thread in the class forums about this. I'm glad to hear students speak out, too, because the instructors and the school need to be really clear about the challenging nature of the course for non-programmers.