My Week in Review: Podcast, Resources and Warming up to R
I had an interesting week in the world of Data Science this week. I started learning some R programming, continued to find new and interesting data science learning resources, and had an interesting coincidence occur relating to data science as practiced at my company.
Predictive Analytics and Infer
On Tuesday evening, as I left work, on Paul Miller's The Cloud of Data podcast, I listened to an interview with Infer CEO and founder, Vik Singh, talk about predictive analytics. Infer, simply, uses data analytics to help companies get more customers. One of the companies in the short list mentioned by Singh was New Relic! I had heard of our marketing organization using some form of analytics so although it was surprising to suddenly hear my employer's name mentioned, I wasn't shocked.
The next day I had lunch with one of the marketing organization's people who hangs out in our internal Data Science chatroom. As it turns out, he works directly with the Infer product and on improving the predictive model! Needless to say, I had a few questions given what I had learned from the podcast.
Infer is a great example of the new types of companies being formed these days that are focused on analyzing the wealth of public information about just about anything. Infer creates a specific model for each of their customers, using the customer's proprietary data. They then bring in the weight of the public data available on the web and from other sources to help hone predictions. I obviously cannot comment on the model for New Relic, but it is exciting to know that this forward-thinking approach is being used.
Learning Data Science
Each week I learn about a new resource for learning about data science. This week, I'll share a few books I've come across and two great blogs.
Books
I've been reading bits and pieces of this book as I've monkeyed around in RStudio. It's not comprehensive but is easy to follow and works well as a companion while you are starting to write code in R.
An Introduction to Statistical Learning
When I enrolled in a Stanford-based online course earlier this year, this book was the textbook. It explains concepts pretty well and does not shy away from the math. Although I recommend this book, if you're like me and want to start programming right away, use this book as a way to get a deeper understanding of statistical learning processes, not as a primer on writing code.
Blogs
Machine Learning Mastery
Jason Brownlee's incredibly informative and helpful blog for learning about machine learning.
Data School
Kevin Markham is a student of data science and an entrepreneur. I really like his "dive right in" approach to learning which comes through in his blog.
Warming up to R
The first course module in The Data Science Specialization just covers installing the tools (RStudio, R and git) and learning how to use git and Github. This first module goes for four weeks! So I've been a little bored and went looking for something to do.
I have been reading Learn R in a Day as well as reading various bits and pieces of articles and other books. It was time to write some code and I found a great place to start.
Kevin Markham, whose blog I mentioned earlier, is another student of data science who is maybe a few months ahead of where I'm at. He took (or is taking?) a Data Science course through General Assembly. Fortunately, one of the instructors, Aaron Schumacher has put together course materials in his Github account. In there is a section entitled Preliminaries that contains an R and a Python source file (warmup.R and warmup.py) for "warming up."
I took the challenge, to implement the mean, variance and covariance methods in R, without using the built-in functions. The mean was easy, but what was more interesting was the fact that the "==" equality operator was used in the exercise to compare the student's mean implementation to the built-in one. Since the data being used was floating-point data generated as
x <- rnorm(100)
the equality operation will not work. I created the following overloaded operator for temporary use in the program, after digging around to find out the best way to compare floating-point values in R:
'%==%' <- function(x, y)
{ result = isTRUE(all.equal(x, y)) }
It was nice to receive a reply from Aaron Schumacher, who also accepted my pull request for the change.
I still haven't finished that "warm up" because it requires a lot of looking around and learning of R, but I'm getting there. Here is what I've done so far:
my.mean <- function(x)
{ result = sum(x)/length(x)}
my.var <- function(x)
{
m = my.mean(x)
a <- x - m
a <- a * a
result = my.mean(a)
}
My variance function, my.var, doesn't quite match the result of the built-in var function. I'll continue working on that.
Getting into R has been fun, and fortunately for an IDE-dependent developer like myself (I've been using Microsoft's Visual Studio since the mid-90s), the editor of choice for the courses I'm taking is RStudio. So far I'm impressed with it.
I've signed up for the second course in the Data Science Specialization series, R Programming. It doesn't start until July 7th but I'll be getting more into R before then.
One last note about the work I've been doing: up until today, all of the R programming I've been doing this week has been on a Surface Pro machine (see below). Some of the keys on the Type keyboard are showing sign of wearing down, so I may permanently switch over to my MacBook Pro. The Surface is nice for reading, though.