05/02/2020

Scraping: what is it and how does it work?

Web scraping refers to an IT technique which enables the collection of data relating to a website and its storage either locally or on a database. This procedure is completely automatic and is carried out using specific software, which is able to read and copy thousands of webpages in a very short space of time.

Possible reasons for using this procedure are many: some are purely personal, such as wanting to save a website locally in order to view it offline, while others are connected to SEO (Search Engine Optimisation).

How scraping works, in detail

The main purpose of scraping is to extract data from a web portal, so that it can be easily analysed in order to obtain useful information. The bots that carry out the scraping procedure merely simulate human navigation of the site, but in a much quicker way.

There are two different approaches that can be used: the first consists of making a complete copy of the pages analysed, which are then saved in an external database; the second involves an additional step: data elaboration.

In the latter case, once the program has read the page, it automatically extracts data of interest to the user and saves only that data on the database. In this way, it is possible to carry out precise market analyses; for example, an e-commerce business can view the cities where certain products are purchased more frequently, the age range of visitors to the site and lots more.

The potential for the use of scraping is enormous and is actively contributing to the planning of SEO and marketing strategies. Nevertheless, these techniques can also be used for negative purposes, so it is essential to take the utmost care, especially when defining scraping permissions for a website.

Illegal use of scraping

Just like any other instrument, scraping can also end up in the hands of the unscrupulous who are only too eager to abuse its potential in order to cause considerable damage. Although the functioning of scrape bots is similar to that of search engine crawlers, they can be used illegally.

Not all websites allow scraping of their pages (to view authorisations, consult the ‘robots.txt’ file on the web portal or the conditions of service page). In such cases, the use of software for scraping constitutes a voluntary infringement of the law, which could result in prosecution.

Due to the heavy workload to which bots subject the servers of the websites analysed during their operations, it is possible for the criminally minded to set them up in such a way as to carry out a full-blown cyber-attack. By increasing the number of requests to the server exponentially it is possible to make the portal crash, block it or force it to display the ‘http 500’ error message.

Scraping can also be used to copy a website and post its content before its real owner has put them online, thereby anticipating its indexing (a commonly used practice in Negative SEO).

How website scraping works

Carrying out website scraping is relatively straightforward. Once the authorisations granted by the portal in question have been verified, it is possible to use automatic scraping software, such as Teleport, Httrack, or Simple Html Dom, which will allow the necessary data to be gathered and stored offline.

Other more sophisticated software created in ASP or PHO will allow the data to be saved directly onto an online database. Google API can also be used to analyse the results of the SERP in the search engine itself or specific key words on a web page.

Translated by Joanne Beckwith