Week 1: DS4100

An introduction to the world of data

Posted on January 15th, 2017

This week we began by delving into what data collection and analysis entails, and concluded by getting introduced(in my case atleast) to R and learning the various unique intricacies related to the language. One of the most important aspects of data collection appears to be the 6 Vs - Velocity, Veracity, Volume, Validity, Variety, and Volatilty - and how they can have an outsized role to play in any data project.

These 6 Vs were especially interesting to me because of my most recent co-op, where a good portion of my job entailed doing data visualization and analysis work. To do this, I would work with datasets the company had, extract them into Excel, and then use that data in Tableau. In the many use cases I had, I found that veracity would be the biggest issue I had to face. A lot of the data I worked with was strictly quantitative, with one example being the number of hours an employee would work on a given project. In theory this was good because it would be easier to analyse, especially since the data came through Excel. The issues stemmed from the fact that much of the information was self-reported, which, as this article shows, can lead to some issues in terms of good results.

In this specific instance, the data could be skewed for two main reasons. Firstly, different managers had different reporting requirements, so while some people would report working on a project one way, others would use a different methodology, making it very hard to compare results. Secondly, it's possible that people changed the numbers or reported what they thought would look best, since this was something shown to management and something that difrectly showed the results of their work. Of course it's possible that the data was 100% accurately reported, but unfortunately it's very hard to confirm that without knowing for sure. It basically taught me that, in nearly every case, it's very important to look into exactly how the data you have was created/obtained, as that can have a very big impact into both the analysis you can perform and the overall accuracy of the data itself.

Of course, as this was a problem that could be found with many if not most data sets, I also realized that it was important not to treat it as something that's unsolvable. Instead, I tried to ensure that whenever I worked with data where veracity would be an issue, I would continue to perform my analysis, but also include a qualification that pointed out why the data might have been skewed or inaccurate. That way it's still possible to do analyis, but it makes it so that the end reader of the report doesn't overemphasize the results or misinterpret what's been produced.