Week 3 : DS4100

Cleaning external Data

Posted on January 29th, 2017

This week in class we spent some time discussing how to import external data, be it in JSON or CSV form, and how engaging in that process would require data scientists to also clean up the data, and ensure that it wasn't filled with Null values, incorrectly formatted information, and bad data. This was very relevant to me, as over co-op I became quite well accustomed to the life of working with data and spending most of my time trying to figure out why data was incorrect or how to clean it correctly so it wouldn't be.

My strongest example comes from an employee packet that the firm I worked for internally published to show new co-ops who their counterparts throughout the firm were. The company hired many, many co-ops, and so this was a useful resource to find others that worked there and see if already had any connections with the other co-ops working at the time. My job was to take that packet, published internally as a pdf, and convert it into a nice visual webpage for co-ops to use. The large nature of the file meant that I couldn't do this work manually, and would thus need to have a function that could create HTML dynamically for me using the information in the pdf. Sounds simple enough, but the work was anything but.

The reason for my difficulties stemmed from the fact that when I first used programs to convert certain parts of the pdf to a usable JSON file, there was a ton of false or mislabeled data. Basically, the pdf would be converted to the more code-friendly JSON, but there were so many unnessesary uses of code, and so many irrelevant parts of the page converted too, that I had to take a closer look at the data. I realized the way it was formatted made different parsers consider many irrelevant parts of the document as important, which really messed with the accuracy of the data being produced.

To resolve that issue, I realized I needed to manually go through the pdf and strip it down to its most basic, unformatted form, so that it was essentially just text-data. This was not exactly ideal, but it did make my job much easier and more manageable. In the end, I realized that, while not the most exciting activity, cleaning data was a vital part of doing data analysis, and without doing it well one can not accurately claim to have analysed their data at all.

(Atleast until software can do it all for us :) )