Screen scraping: How to profit from your rival's data

  • Published
Flight departure board
Image caption,

Sites that sell time-sensitive data are often targeted by scrapers

Some call it theft, others call it legitimately gathering business intelligence - and everyone is doing it.

Screen scraping might sound like something you do to the car windows on a frosty morning, but on the internet it means copying all the data on a target website.

"Every corporation does it, and if they tell you they're not they're lying," says Francis Irving, head of Scraper Wiki, which makes tools that help many different organisations grab and organise data.

To copy a document on a computer, you highlight the text using a mouse or keyboard command such as Control A, Control C. Copying a website is a bit trickier because of the way the information is formatted and stored.

Typically, copying that information is a computationally intensive task that means visiting a website repeatedly to get every last character and digit.

If the information on that site changes rapidly, then scrapers will need to visit more often to ensure nothing is missed.

And that is one of the reasons why many websites actively try to stop screen scraping because of the heavy toll it can take on their computational resources. Servers can be slowed down and bandwidth soaked up by the scrapers scouring every webpage for data.

"Up to 40% of the data traffic visiting our clients sites is made up of scrapers," says Mathias Elvang, head of security firm Sentor, which makes tools to thwart the data-grabbing programs.

"They can be spending a lot of money for infrastructure to serve the scrapers."

Image caption,

Betting aggregators often target the odds offered on particular sports events

And that's the problem. Instead of serving customers, a firm's web resources are helping computer programs that have no intention of spending any money.

Data loss

What's worse is that those scrapers are likely to be working for your rivals, says Mike Gaffney, former head of IT security at Ladbrokes, who spent a lot of his time at the bookmakers combating scrapers.

"Ladbrokes was blocking about one million IP addresses on a daily basis," he says, describing the scale of the scraping effort directed against the site.

Many of those scrapers were being run by unscrupulous rivals abroad that did not want to pay to get access to the data feed Ladbrokes provides of its latest odds, he says.

Instead, they got it for free via a scraper and then combined it with similar data scraped from other sites to give visitors a rounded picture of all the odds offered by lots of different bookmakers.

"It's important that your pricing information is kept as close to the chest as possible away from the competitor but is freely available to the punter," says Mr Gaffney.

The key, he said, was blocking the scraping traffic but letting the legitimate gamblers through.

The sites most often targeted by scrapers are those that offer time-sensitive data. Gambling firms offering odds on sports events are popular targets as are airlines and other travel firms.

The problem, says Shay Rapaport, co-founder of anti-scraping firm Fireblade, is determining whether a visitor is a human looking for a cheap flight or an automated program, or bot, intent on sucking all the data away,

"It's growing because it's easy to scrape and there are so many tools out there on the web," he says.

The best scraping programs mimic human behaviour and spread the work out among lots of different computers. That makes it hard to separate PC from person, he adds.

In many countries scraping is not illegal, adds Mr Rapaport, so scrupulous and unscrupulous businesses alike indulge in it.

Image caption,

Scraping has helped make parliamentary debates and voting records more accessible

"A lot of big companies scrape content," he says. "Sometimes it's published on the web and re-packaged and sometimes it's just for internal use for business leads."

Talking heads

Frances Irving, head of ScraperWiki, says that not all of that grabbing of data is bad. There are legitimate uses to which it can be put.

For instance, says Mr Irving, good scraping tools can help to index and make sense of huge corpuses of data that would otherwise be hard to search and use.

Scrapers have been used to grab data from Hansard ,which publishes voting records of the UK's MPs and transcribes what they say in the Houses of Parliament.

"It's pretty uniform data because they have a style standard but it was done by humans so there's the odd mistake in it here and there," he says.

Scraping helped to organise all that information and get it online so voters can keep an eye on their elected representatives.

In addition, he says, it can be used to get around bureaucratic and organisational barriers that would otherwise stymie a data-gathering project.

And, he says, it's worth remembering that the rise of the web has been driven by two big scrapers - Google and Facebook.

In the early days the search engine scraped the web to catalogue all the information being put online and made it accessible. More recently, Facebook has used scraping to help people fill out their social network.

"Google and Facebook effectively grew up scraping," he says, adding that if there were significant restrictions on what data can be scraped then the web would look very different today.

Related internet links

The BBC is not responsible for the content of external sites.