Week 6 : DS4100

Web Scraping

Posted on February 19th, 2017

This week in class we spent a good amount of time discussing web scraping and web apis, and the role the play in data analysis. It's an interesting topic because many different web services and companies take strong measures to limit web scaping, and ensuring that their site is structured in such a way that makes it harder to scrape it for actual information. As discussed in class, it appears that the reason for this is that it can be very taxing on a company's servers to have so many requests come in constantly from different programs. Along with that, the site would likely want to keep their information to themselves, and not have other people be able to easily collect it all.

One example of this recently is RyanAir. In January they decided to suspend their holiday package after only a month because a software provider was illegallly scraping the site.In this instance, RyanAir was most upset about the fact that the software provider was able to scrape their site and re-sell fares, a problem that affects many different airlines in the industry. It's a problem because it can reduce the profit potential for these firms, and allow others to profit off services that are not theirs.

A recent article Information Age speaks to the security effects that web-scraping can have. They describe how bots are expected to make up 46% of all web traffic, and these bots can work very quickly to obtain content. The danger lies in what type of content they pick up. All sorts of industries can be affect by web-scraping, including Real Estate, E-Commerce, and Travel (as seen with RyanAir). In many of these cases, they can even pick up private or sensitive information. To quote the article, "Diverse actors leverage web scraping bots, including nefarious competitors, internet upstarts, hedge funds, fraudsters, hackers, and spammers, to effortlessly steal whatever pieces of content they are programmed to find, and often mimic regular user behavior, making them hard to detect and even harder to block." Essentially any content that can be found online can be scraped, and thus one must be very careful in how they structure their sites. If they don't, proprietary information can be taken, the integrity of content can be damaged, and and work that takes great amounts of time to complete can get taken without warning.