# Data Science with R

The John Hopkins' Coursera course on R Programming started two weeks ago and I thought I'd share some of my thoughts on the course and my growing knowledge of R.

The programming assignments in this course are, thankfully, challenging. I was concerned about the exercises being too simple in order to teach the syntax. The problems, however, have required

• figuring out how generate filenames from numbers that needed leading zeros (easy)
• how to combine sets of data across files (slighty harder)
• how to extract vectors of cleaned data from data frames (which are collections of vectors)
• how to compute correlations on data for those data sets where threshold limits are applied

If I had to do these sorts of things in C#, my most familiar language at the moment, it would be pure grunt work but I could get it done. With R, I had to make sure not only that I understood the syntax but that I performed the correct procedure to get a result. Sounds like programming but with a totally different language gestalt.

Having programmed (my chronology) in Pascal, C, C++, Visual Basic, C#, Javascript and Python, I have discovered that R is a language tailored to the data scientist, but with a lot of general-purpose programming features.

Data scientists often have to work with dirty data sets and they need to clean the data before applying statistical techniques. There is even a whole course dedicated to data cleaning in the JH Data Science Specialization!
Since this is primarily a quantitative science, there is the concept of NaN, or Not A Number, in R, just as there is in some general-purpose languages. R also supports the well-known NULL value. But because data scientists often work with data sets that have holes, or no data for some data values, R supports something called NA, which means Not Available (or missing values). The data sets used in the course, e.g., are, so far, CSV files that have plenty of NA values.

There are built-in functions for dealing with NAs such as is.na() which checks on whether a vector object contains NA values. Ah, that's another cool feature of R: vectors. Even single values are a vector of length 1. A vector is essentially a list of data values all of the same class or type. The reason vectors are so useful in R is that often you have multiple values (e.g., in a column) against which you want to perform the same function.

Here I will define a vector of class numeric containing five values, two of which are NA, and then I'll show one way to remove the NAs:

volumes <- c(3.75, NA, 82.01, 25.6, NA)

What is actually happening is that the function is.na is applied to every element of the vector volumes, and it returns, not a single TRUE or FALSE, but another vector, called bad, which has a boolean value for each of the elements in volume. Finally, reapplying the bad vector (negated with '!') generates a vector containing only the good, or non-NA, elements.

Here is the same thing with a few printing steps interspersed in the R console to show what's going on. The console prompt is > and by just typing the name of an object, the console prints out its contents:

> volumes <- c(3.75, NA, 82.01, 25.6, NA)
> volumes
[1]  3.75    NA 82.01 25.60    NA
> bad <- is.na(volumes)