Plugavel.
  • Home
  • Tech
  • Car
  • More
    • Privacy policy
    • About us
    • Contact us
No Result
View All Result
Plugavel.
  • Home
  • Tech
  • Car
  • More
    • Privacy policy
    • About us
    • Contact us
No Result
View All Result
Plugavel.
No Result
View All Result

How to use Beautiful Soup to do web scraping with Python?

2 de March de 2023
in Tech
How to use Beautiful Soup to do web scraping with Python?

The Internet is a huge source of information. An in-depth analysis of web data is able to reveal a lot of useful information: trends, changes in public tastes, average price of a given product…

Let’s say you want to retrieve price and rating information from a huge merchant site. What are you going to do ? You can, of course, ask a trainee to consult the pages one by one and copy the information into an Excel table. We bet that he will often devote his fifteen days of training to it. If you use a Web scraping tool such as Beautiful Soup Library sous PythonPythonthis task can be automated and you will get the desired result in a few minutes!

But here it is: the websites of large companies are organized in such a way as to promote a pleasant visit by the Internet user. They are in no way designed to facilitate the analysis of their data by other companies, usually competitors. So, as much as you know, Web scraping is a complex discipline that will involve a good deal of preparation before the actual use of Beautiful Soup…

What is web scraping?

Web scraping or web data extraction refers to technologies that help:

  • to extract content from the Web in an automated way;
  • ensure their conversion into formats that can be used by applicationsapplications d’analyse : Excel, GoogleGoogle Sheets, Open Office Calc…

To perform Web scraping, you must first know what type of data you are looking for. From there, it will be necessary to locate a site in which the information in question appears. In fact, many sites are teeming with information of all kinds, especially merchant sites: objects of a certain category, prices, ratings given by buyers… Only here: as is, this data cannot be used by a data analystdata analyst. It is therefore necessary to extract them in order to transform them into Excel tables and then be able to classify them according to certain criteria, obtain averages, produce curves…

However, each site follows an individual approach. Sometimes it is necessary to train a tool so that it learns to locate certain types of data on the web pages it crawls. Often also, this exploration will be done by oneself. Isn’t all this simple? Yes, but the game is worth the candle.

If one wishes use Beautiful Soup for web scraping, it is important to know that this will generally be an arduous task. Because in reality, you will have to master four complementary disciplines.

HTML and XML

Web pages such as the one you are currently viewing are coded in a language called HTML (and also in another language based on the same principle, XML). This is called a markup language: the text or media that appears on the screen is coded using tags. For example, page titles are indicated between two tags:

and

.

The HTML code:

All about black holes

would thus indicate a stylestyle level 1 title for the text “All about black holesblack holes “. In the main browsersbrowsers such as Chrome or Firefox, it is possible to display this HTML code by pressing the Ctrl Shift I keys.

If you want to master Web scraping, it is therefore essential to have some notions of HTML, to be familiar with this codingcoding because it will be necessary to seek the information which one wishes to exploit within a “gibberish” first sibylline. HTML coding can be daunting at first, but it’s basically a matter of practice.

Imagine that you are looking to locate information such as product names for example, you will discover by analyzing a merchant site that these may have been referenced by HTML mentions such as:

  • Class= “product-name”;
  • Class= « product_ide ».

For prices, it could be:

  • Class= « prix_ttc » ;
  • Class= « current-price ».

In other words, whoever created the web page has established his own reference system and it is important to know how to decode it. In itself, such an operation is not excessively complex because, remember, you can point to an element of a Web page and bring up the corresponding HTML code. However, it is necessary to have acquired some familiarity with this coding to move forward.

Be aware, however: some websites – about 20% – are particularly difficult to “scrape”. They were intentionally designed that way. This difficulty could discourage a data analyst from attempting to explore it.

Another concern: some websites change their structure and therefore their HTML coding from time to time. The program that we will therefore have been able to write with Beautiful Soup could turn out to be unusable thanks to a major update.

Submit a request to the website

Before you can analyze a website with Beautiful Soup, you must first send a request to this site in order to access its data. A Python library such as Requests, or a APIAPI (programming interface) like REST, should be used to access the serverserver concerned.

It is better to know: certain sites, in particular those of fnac.com, FacebookFacebook or LinkedIn are closed to web scraping. It is possible to have the heart net by consulting the file robotsrobots.txt from these websites. Anyone who ventures to override the protections in place and extract information from such sites could incur legal risks.

Also be aware that some sites, which are first open to exploration, fermentferment access when the extraction of data is practiced too often for their taste from the same IPIP.

Organize data with Beautiful Soup

Once access to the site is obtained, Beautiful Soup is able to return organized information, presented in an elegant way. Before the intervention of a tool such as Beautiful Soup, what is recovered is a series of extremely difficult to decipher text. So this is the action of this tool: transform this “soup” into something presentable. Incidentally, this is where Beautiful Soup got its name. This tool, invented by programmer Leonard Richardson in 2004, is able to transform HTML “soup” into something neatly presented and therefore easier to analyze.

Last point: in order to be able to process the data thus extracted, it is important to be able to export them to an easily readable table in Excel. The format commonly used to do this is CSV, or sequences of information separated by a particular sign, usually a comma or a semicolon. The first row of such a table usually contains the column headings.

Here is, for example, the appearance of a CSV file:

  • First name, Last name, Media;
  • Sabrina, Bounali, Art;
  • Fabien, Buchard, CNET ;
  • Claudie, Gaminole, C News ;
  • Geraldine, Guantana, D17 ;
  • Maylis, Kessyie, France 2.

From a tool like Excel or Google Sheets, this information will appear in a clear and tabulated presentation:

First name

Nom

Average

Sabrina

bounali

until

Fabien

Buchard

CNET

Claudia

Caminole

C News

Geraldine

Guantana

D17

Maylis

Kessyie

France 2

It is possible to use Beautiful Soup to convert data from a website into a .CSV file.

What does a Web scraping procedure with Beautiful Soup consist of?

Let’s recap. A web scraping action usually involves four steps:

  1. Inspect the site and its coding (in HTML) in order to identify the elements to be extracted.
  2. Send a request to the site – it may happen that it refuses to submit it to an automated analysis.
  3. Extract the information you want in an organized way with Beautiful Soup.
  4. Export this data in the form of a table that can be used in an application like Excel.

What uses for Beautiful Soup?

Beautiful Soup is used in various industries having to analyze data from the Web:

  • product trend monitoring;
  • analysis of competitor performance;
  • analysis of public sentiment in relation to a given theme;
  • evolution of a market such as that of real estate;
  • yield estimate (rental, value of a property for sale, etc.);
  • price monitoring;
  • investment, etc.

What are the pros and cons of Beautiful Soup?

The benefits of Beautiful Soup are multiple:

  • Beautiful Soup appeared in 2004 and it has undergone many evolutions. It is therefore a “mature” tool, which is in its fourth version and covers the needs of Web extraction well;
  • once you are familiar with its commands, searching for data types within a Web page seems accessible in a few lines of code;
  • due to its age, it has extensive documentation;
  • a large online community offers solutions to problems that one may encounter when using Beautiful Soup.

It remains that its use supposes to be familiar with other concepts, in particular the HTML coding of the Web pages, and that the preparation of a procedure of Web scraping can be tedious.

The ability to extract information from a site so that it can be analyzed is valuable. And so, the temptation will be strong to be able add Beautiful Soup to your Data Scientist CV. However, as we have seen, a formation solidsolid is necessary, since it goes beyond simply mastering Beautiful Soup’s commands. It is therefore necessary to consider one or more weeks of training.

What are the alternatives to Beautiful Soup?

Octoparse

A Web scraping tool that is very easy to use and whose main advantage is that it does not require any programming knowledge. The ease with which it can analyze certain sites such as Amazon is very appreciable. What’s more, the paid version of Octoparse includes many ” templates » (predefined templates) adapted to major sites.

Web Scraper

This Chrome extension claims some 500,000 users. Just like Octoparse, no programming is required. Its handling is not as immediate as that of Octoparse but once you have understood the principle, Web Scraper is of practical use since it is integrated into the browser.

Scrapy

Scrapy is a tool open sourceopen source which works, just like Beautiful Soup, under Python. It also allows you to program web site analyses. One of its strengths is that it can process queries asynchronousasynchronous, and therefore access several target pages simultaneously and very quickly. It has a large community of users, which can be valuable when trying to accomplish an unusual task. On the other hand, its handling is less immediate than that of Beautiful Soup.

Tags: beautifulbig datacodingformat htmlInternetprogramming toolsPythonpython languagescrapingSoupWebweb scraping
ShareTweetPin1

We would like to send you notifications with news, you can unsubscribe at any time.

Unsubscribe
  • Home
  • Privacy policy
  • About us
  • Contact us
© 2020 - 2023 Plugavel - News about technology and cars on one site Plugavel.
No Result
View All Result
  • Home
  • Tech
  • Car
  • More
    • Privacy policy
    • About us
    • Contact us