Is Web Scraping Ethical?

Last week I released the UkBuses Ruby Gem. It scrapes real-time data from the Traveline site and allows developers to use the data returned as a Hash. It quickly became the second most weekly starred Ruby repo on GitHub and sparked some interesting debate on Hacker News. There’s strong feelings on both sides of the debate but basically no legal precident on the subject. After debating it with friends and collegues, it’s clear that nobody really knows. First of all, let’s make the question up for debate very clear:

Should it be OK, both legally and morally, for someone to download the HTML output of a non-authenticated webpage and convert it to a format that makes it more easily usable for non-commercial purposes?

There’s two important parts to the question that I have posed:

  1. non-authenticated – In other words, assume that any data to be scraped will respond to a non-authenticated HTTP request and that a typical end-user for that data has not needed to enter any sort of credentials. (OAuth, HTTP Basic or anything!)

  2. non-commercial – The use of the data scraped should not ultimately be for the generation of revenue; profitable or not.

My personal belief is that if either of the two previous conditions have been violated then there are serious ethical concerns that need addressing. In the case of the UKBuses gem, the data conforms to both of those conditions, although there is nothing to stop someone from using the gem for commercial means.

So here’s the arguments that have been put to me and my responses to them. I don’t claim to have the right answers and simply seek to spark debate on the subject.


Against

“[Company] has paid money to aggregate the data. Why do you have the right to take the data and make it available for free?”

This is the strongest argument I have heard so far, effectively comparing data scraping to theft making it both illegal and immoral. However, I would only agree with this if the company who held the data made an attempt to restrict access to it by use of a paywall or other authentication. Since we have already agreed that in such cases, scraping would be immoral (and probably illegal.)

However, in the case of the real-time UK bus data, the data is aggregated for the sole purpose of making that data available to the general public at bus stops and on the Traveline website. Therefore, one can only assume that no money has been lost that would otherwise have been made. So in this case, I would propose a counter question:

“Although [company] has paid money to aggregate the data. Since the data is used on a daily basis by members of the general public, why wouldn’t I have the right to make the data available in other formats?”


For

“By heavily restricing access to the data and providing no public API, [company] effectively shields itself from public accontability.”

Public transport in the UK often gets a bad time in the press for being late. Justifiably. The UK has one of the least punctual and most expensive public transport systems in the world. Yet, despite central government’s push to open up more data, the privately owned public transport companies have bucked this trend making real-time data nearly impossible to get hold of.

It is, unfortunately, the transport companies themselves that participate in mutual back-slapping of a job well done when they themselves produce reports in to their punctuality. My old local train company claimed 90% in the last month that I used their services. As an admittedly unscientific sample, I would love to know how they got to that figure. There was a point where I could organise my day by the very fact that my train from Bournville to Birmingham New Street would be 4 minutes late.

But I digress. The point is that open data creates accountability. I would love to see someone use the UKBuses gem to aggregate data as to how late buses are and see if it matches up with the figures given by the companies themselves.

Central and local government in the UK is opening up more and more data because they have realised that it creates accountability. I would argue that where there is a strong public interest (I hate that phrase.), certain companies should be compelled to open such data. Public transport should fall in to that category.

Yes, I know that the data is technically available. As long as you fill out a lengthy form, comply with various rules and in some cases provide a deposit for any ‘damages’ caused by whatever your application of the data happens to be. The documentation is in huge PDF files with lengthy examples for how to retrieve the data. This is boring and stifles innovation. Coders like data streams to play around with, it’s part of the fun of coding. Companies with a strong interest in the general public should understand this and embrace it.

Comments