Week 2 : DS4100

Using Big Data

Posted on January 21st, 2017

This week we spent a good amount of time discussing how important it is to write good algorithms and code so as to work effectively with truly big data. As mentioned in class, any company can now have access and use for thousands of different, large streams of data, which results in millions to billions of unique data points that need to be parsed, cleaned, and analysed. Because of this, it can be hard to gain clear, valuable insights from one's data sets, as there are just too many different perspectives and angles to look at all the information gathered. That's why I think it's always best for a data scientist to know exactly what they want to determine with their analysis, and what they need to acheive that goal, so that they won't get sidetracked by all the unncessary other data they have access to.

A good example of this can by seen in an article by the New York Times' data analysis section, the Upshot. In May of 2015, they published an article about upward mobility in different counties of the United States. They based their discussion around a study done by Raj Chetty and Nathaniel Hendren, who had quite a broad range of economic and sociological data at their disposal. While it would have been easy to get overwhelmed by all this data, they decided to focus in on only the factors that were relevant to them and their research. They did an analysis on how children born in different counties would perform in the future based on historic data. They then normalized this to account for immigration and gender, and divided it by the economic class that the child was in throughout most of their childhood. This gives a very interesting look into which American counties are the best for upward-mobility, while also not getting bogged down in unnessary details.

For example, I found that my home county is quite bad for upward-mobility in general, and falls in the 11th percentile of the US. I also learned that our country (Suffolk County) here at Northeastern is not ideal for upward-mobility either, falling far short of Norfolk County, which is 80th percentile nationally and the best in Massachusetts. These were all interesting findings to make, and they were very accessible. Despite the fact that there were massive data sets behind this research, those involved were able to focus on only the attributes that mattered, and thus utilized Big Data in a way that allowed them to make interesting, key observations.