Whatever the activity that we practice, the possession of certain information is able to change the course of things. We have an example of this in the movie Everybody say “I love you”. The patient played by Julia Robert indulges in confidences and these are spied on by little girls who have drilled a hole in a wall. They benefit from it Joe, the character played by Woody Allen who thus serves the beauty what she wants to hear and can thus seduce her.
Having crucial information is so valuable that a major part of intelligence activity is based on this necessity — another part of spreading false information in the hope that it will be exploited by other governments who believe it. . But it is not necessary to act on such a scale to feel the need to hold some tips of a crucial nature. Thus, in a video game, being informed of the weak point of a monster that we have to face can make the difference between victory or loss. game over.
In reality, the Web is a mega-reservoir of data and what is more, in permanent change. However, the characteristic of these is that they often escape rational analysis. Some sites such as Google Trends, for example, show us current trends. Thus, on the day when these lines were written, when the subject of the day would have seemed to be the vote on a motion of censure in the Assembly, very curiously, it was the word “spring” which crystallized the most the curiosity of Internet users.
By nature, the Web is thus the reflection of a multifaceted activity, an immense ballet that involves millions of protagonists. Knowing how to read the trends hidden behind this whirlwind can therefore be invaluable.
However, this is what a technology allows: Web scraping or extraction of content from the Web.
Web scraping, what is it exactly?
The term Web Scraping encompasses various technologies dedicated to:
- automated extraction of structured information from a website;
- their formatting in an easily usable format, such as that of Excel.
To better understand what this activity covers, imagine that it is carried out “by hand”.
You instruct an employee or trainee to explore online sales sites such as Fnac, Amazon, CDiscount, Leboncoin or other in order to identify, for all audio headsets, their price, the number of stars assigned by users, their ranking in sales… Objective: copy this information into an Excel table of thousands of lines. A table that can be analyzed with the appropriate statistical tools or whose columns can be classified as desired.
Normally, this task should take several days. On the other hand, if the employee in charge of this mission is able to rely on a web scraper, then the task can be completed in a much shorter time, and it can be restarted at will, for example, every Monday. Each time, the tool will browse tens of thousands of pages and repatriate its harvest automatically.
We are not going to believe that the task will necessarily be easy.
The user of a Web scraper must first specify the address of the site or the pages that he wishes to explore. It is then necessary to indicate very precisely the headings which must be the subject of an analysis:
- name of an item;
- category ;
- prix ;
- notations…
The Web scraper will then explore the pages indicated and most often produce a .CSV file that can be opened in Excel, Google Sheets.
What are the typical uses of Web Scraping?
Here is a non-exhaustive list of activities that can benefit from web scraping.
If you are marketing one type of item, you should be keen to compare the prices offered by major e-commerce sites, ideally on a day-to-day basis.
Statistics and trends
For many companies or executives, it is valuable to know what are the search trends of Internet users in a field, or the evolution of demand in a particular sector.
Some sites pride themselves on finding good deals and directing their visitors to them. We then speak of affiliation because the advice site is paid according to the traffic it brings to the destination site. By necessity, web scraping helps to identify the good plans in question. This activity is found in particular in real estate.
The media, politicians but also companies are curious to know the public’s feeling about a theme. Web scraping can therefore help to decode what the general mood is on a subject at a specific moment.
Types of web scraping
Web browser extensions
Quite often, web scraping is operated from a web browser like Chrome or Firefox. Advantage: the user can order the capture of information “on the fly”. If he visits a site he deems worthy of interest, he can then activate a web scraping extension and configure it by inspecting the presentation of information.
Custom app
In some cases, the most appropriate will be to create a personalized Web scraping tool, using a language such as Python. The downside is that it is necessary to have a good mastery of programming beforehand.
Many Web scraping software are on the market and they usually have more extensive setting options than browser extensions. It is usually necessary to dedicate a high-capacity machine to this potentially time-consuming activity for several hours at a time. In fact, it happens that a web scraping tool is required to analyze millions of pages.
Web scraping on the Cloud
Some Web scraping service providers offer that you can operate from their own server and therefore, without penalizing local computing.
Obstacles to web scraping
Once we have clearly defined the mission to be accomplished by a web scraper, it will normally accomplish its task with speed. However, the preparatory stage is likely to be long. There are multiple reasons.
Websites are designed in such a way as to be pleasant to consult by their visitors. This user-friendliness factor is essential. Website creators don’t care about crawling and analysis programs such as web scrapers. In fact, if they can make it harder for them, they won’t hesitate to do so, because why let their competitors benefit from the treasure of customer data?
Thus, a web scraping application must identify the specific sections of a web page that are of interest to the analyst. To do this, it is sometimes necessary to interfere in the internal code of a web page and to have even a minimal mastery of the languages that allowed its creation: HTML, CSS, javascript, XML…
Some websites use “captchas” to verify that they are visited by humans and not by robots. There are various automatic captcha bypass systems, and their effectiveness is more or less great. It also happens that some sites analyze the behavior of certain visitors and identify a “bot” such as the Web scraper, in which case they block access to it. Some services manage to circumvent such limitations, for example by multiplying the IPs from which the Web scraper gives the impression of connecting and by spacing out the requests in a way that appears natural.
It also happens that some sites protect their information in a legal way and, with this in mind, practicing web scraping can be extremely risky.
In France, the law is in favor of companies that consider it appropriate to protect themselves against web scraping. Article 323-3 of the Penal Code states that fraudulently extracting data from an automated processing system could be punished by five years’ imprisonment and a fine of €150,000. However, the term “fraudulent” should be clarified. The company Leboncoin.fr took it badly to see its real estate ads being the subject of data extraction by a competing site and the courts ruled in its favor, which was even confirmed during an appeal judged on February 2, 2021 .
Morality: before embarking on a web scraping procedure, it is essential to check this factor. And anyway, hiding your IP address is highly recommended.