Webscraping with BeautifulSoup – how to export data from a website without API?

Last month I have taken part in a great Natural Language Processing course organized by DataWorkshop. One of the tasks we could have challenge ourselves with during the course was trying to predict the cost of the flats based on their descriptions. The dataset for the excercise has been provided by its organizers. The course has given me tons of ideas for blog posts, smaller and bigger experiments. One of them, although not directly connected with NLP, was trying to obtain such dataset on my own by webscraping.

I have done a research of various methods in the Internet and learnt how to scrape data from various websites such as Google, Facebook, Twitter or Instagram. I have also remind myself how I scraped Spotify data once in this post. It’s not that complicated as it seems to be. But what if we do not have API for a specific website? No worries. BeautifulSoup library comes for the rescue. It’s a Python package for parsing HTML and XML documents. Basically, the BeautifulSoup’s text attribute will return a string stripped of any HTML tags and metadata. If only we know css basics, its selectors – it’s quite simple. If not, we can use plenty of courses available online, for free on YouTube (I highly recommend this one!) or some cheaper ones on Udemy.

Webscraping possibilities

The Internet is the largest collection of data collected by mankind. Countless layers of scientific materials, articles and photos. Lots of knowledge available for free. Manual gathering of knowledge collected on the Internet is a very time-consuming process. Go to the website, copy the necessary fragments, write them down in some notebook. If there is a lot of data and similar pages, doing it manually is simply a waste of time. Investing in the stock market or online auctions requires quick reaction and constant monitoring of the price. It can be tiring and time-consuming to constantly navigate to the pricing pages.

This is where webscraping comes in handy. The programmed bot will follow the path set by us, download the indicated data and even perform its initial analysis or grouping. Of course, the bot programming process itself can be time-consuming, but very often a few hours spent on creating a scraper will save us a few days of manual data download.

Is it always a good idea?

Is webscraping always applicable? Definitely not. When entering a website, at first glance, we do not know whether individual data can be scraped. Most websites have Terms of Usage (ToU). If the document with such name does not have the information you are looking for, you can find it simply in the website regulations. Webscraping against the rules can lead to blocking such users. Further attempts to visit the website may result in permanent blocking the use of the website for a given address.

Theoretically, we can infer that we should not download any data available on the website without the consent given by the website. It is also forbidden to aggregate and process data each time. In practice, this means that we cannot create our own database of cars, the data of which we collected from the advertisements of the website with automotive offers.

Fortunately, it is perfectly legal if you scrape data from websites for public consumption and use it for analysis. However, it is not legal if you scrape confidential information for profit. For example, scraping private contact information without permission, and sell them to a 3rd party for profit is illegal. Thus if we do it just for statistics, our own learning purposes, we can definitely remain calm. 🙂

In which city are the cheapest apartments by the sea?

As mentioned, I have decided to scrape data on apartments for sale. I have came across a great tutorial showing how to do this. I have chosen Polish website – OLX – where users can upload various items for sale. Those are not only apartments, can be cars, furniture or even pets, but we’ll focus on this category today. Huge apologies for all non-Polish readers! I was interested in my area thus I have used a Polish webpage. However to understand the method described in this point, you don’t need to know Polish language. Although the website is Polish, you can apply the code for any kind of website. 🙂

OLX website – real estate category

Let’s scrape data about apartments located in my area (Poland, Pomeranian Voivodeship) and see where can we find the best occassions. I will scrape titles, locations, sizes and prices for apartments. Since we are based on html, our code will work as long as the structure of the page matches.

Webscraping initiation

We need to import necessary libraries. However, BeautifulSoup is not a library entering the site. It only parses, converts html to such a form that we can choose information. So we need something that will retrieve this data from our website – the request library. We can choose any information from the page that interests us. In our case, it will be the title, location, apartment area and price.

from bs4 import BeautifulSoup
from requests import get

Since I am only downloading data, we will only be interested in the GET method. So let’s create a variable that will process us the address.

URL = "https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/pomorskie/"

page = get(URL)
print(page.content, 'html.parser')

Now we need selectors to select data from the page. BeautifulSoup can operate on the content directly received from us as html. If it were rendered using javascript, we would have a problem getting the content. To do so we could use Selenium which imitates our browser. But that’s for another case.

Selectors

You can check out every webpage structure with the “Inspect” option selection (right mouse click). OLX is using quite old html setup with tables. You need to know that not every website will look like this so you need to bear in mind that technologies are changing. Anyway, the Internet is full of similar tutorials and guides and I am sure you’ll always find the right tips to apply webscraping to the site of your choice, even if it follows more modern structure than shown in my example.

Please remember also to check out BeautifulSoup documentation. We’ll use it just for one case but for sure there are more options and selectors worth investigating.

As you can see on the screenshot, in our case offer wrapper is the selector which can enable us to download any offers. As you will see in the screenshot, by selecting all selectors of this type, we can get the desired information. In order to go through all of them, we’re going to use a for loop.

You may wonder why we have used “class” with underscore. It’s because “class” word is already defined in Python syntax. However, we have used break because I want to download only one offer for an example so far.

for offer in bs.find_all('div', class_="offer-wrapper"):
   print(offer)
   break
Console output

Isn’t that beautiful? Although the website does not allow you to download this data automatically, we can easily generate it ourselves. Just remember to use BeautifulSoup and crawling for a good purpose, such as statistical analysis, as in our case. Never steal data!

Localization

Location on the website

To get localization, I will use the footer of each advertisement. This is the “bottom-cell” class. Then, in this footer, I look for a location that is a “breadcrumb” class object. To check if we chose the element correctly, we can print the variable inside the loop.

for offer in bs.find_all('div', class_='offer-wrapper'):
   footer = offer.find('td', class_="bottom-cell")
   location = footer.find('small', class_='breadcrumb').get_text()
   print(location)
   break
Console output

As you can see, we need to format this variable a bit. I would like to get rid of new lines. I can do this with the strip() method. When I remove break, I can see that cities are often given with commas.

Let’s assume that we are only interested in the city, not its district. I can tell Python that we are only interested in the first element before the comma.

for offer in bs.find_all('div', class_='offer-wrapper'):
   footer = offer.find('td', class_="bottom-cell")
   location = footer.find('small', class_='breadcrumb').get_text().strip().split(',')[0]
   print(location)
Console output

Title

Another item that I want to scrape is the title:

for offer in bs.find_all('div', class_='offer-wrapper'):
   footer = offer.find('td', class_="bottom-cell")
   location = footer.find('small', class_='breadcrumb').get_text().strip().split(',')[0]
   title = offer.find('strong').get_text().strip()

Price

Price is in an element with class price. Unfortunately, it is in a very unfriendly format for us. We need to remove the spaces, the information about the currency (after all, we only have Polish zloty) and we will replace commas with dots. We would also like to have a result variable in float format. Let’s create a dedicated function for this purpose:

def parse_price(price):
    return float(price.replace(' ', '').replace('zł', '').replace(',', '.'))
for offer in bs.find_all('div', class_='offer-wrapper'):
   footer = offer.find('td', class_="bottom-cell")
   location = footer.find('small', class_='breadcrumb').get_text().strip().split(',')[0]
   title = offer.find('strong').get_text().strip()
   price = parse_price(offer.find('p', class_='price').get_text().strip())
   print(title, location, price)
Console output

Flat area

Contrary to the previous variables, in order to obtain information about the flat area, I have to enter the offer:

for offer in bs.find_all('div', class_='offer-wrapper'):
   footer = offer.find('td', class_="bottom-cell")
   location = footer.find('small', class_='breadcrumb').get_text().strip().split(',')[0]
   title = offer.find('strong').get_text().strip()
   price = parse_price(offer.find('p', class_='price').get_text().strip())
   link = offer.find('a')
   print(link['href'])
   print(title, location, price)
   break
Console output

It turns out that we have two different data sources on the website. Some of the apartments are offered by OLX, and some by OtoDom. Unfortunately, they have a completely different structure. It will then be easier for us to apply a top-down filter on the website and limit the ads, e.g. to studios (1 room). The url has been modified so we need to update it:

URL = "https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/pomorskie/?search%5Bfilter_enum_rooms%5D%5B0%5D=one"
for offer in bs.find_all('div', class_='offer-wrapper'):
   footer = offer.find('td', class_="bottom-cell")
   location = footer.find('small', class_='breadcrumb').get_text().strip().split(',')[0]
   title = offer.find('strong').get_text().strip()
   price = parse_price(offer.find('p', class_='price').get_text().strip())
   print(title, location, price)

Data export with sqlite3

In order to find the most advantageous apartment offers, I need to extract the data somehow. To do so we’ll need sqlite3 library.

I import the library and define a connection to the database.

import sqlite3
db = sqlite3.connect("dane.db")

I also add a cursor variable that returns us database operations.

cursor = db.cursor()

As I do not create the table every time, but only once, I need an additional function from the sys package. Argv is a functionality that stores in a list all the items with which the script is running. What to do to check that the setup is running well? Then create a table with data. With quit() I tell Python that I just want to create a database.

from sys import argv

if len(argv)>1 and arg[1] == "setup":
   cursor.execute('''CREATE TABLE offers (name TEXT, price REAL, city TEXT)''')
   quit()

Then, using cursor, I pass the values ​​obtained by the selectors to the database and commit the query with a commit. You need to find a balance and not commit too large or too small fragments to the database.

for offer in bs.find_all('div', class_='offer-wrapper'):
    footer = offer.find('td', class_="bottom-cell")
    location = footer.find('small', class_='breadcrumb').get_text().strip().split(',')[0]
    title = offer.find('strong').get_text().strip()
    price = parse_price(offer.find('p', class_='price').get_text().strip())
    cursor.execute('INSERT INTO offers VALUES (?, ?, ?)', (title, price, location))
    db.commit()

Let’s not forget to close the connection!

db.close()

When the code is ready, I first execute the command on the terminal once:

python main.py setup

And then I run the script with the command:

python main.py

I can get to our data, e.g. through the free DBeaver program, by connecting to the newly created database:

Table structure in DBeaver
Offer table filled with scraped data

Pagination

Is it all? Stop! Please note that we have only downloaded the first page of the ads. To switch page on the website I need to click another number on the bottom but unfortunately BeautifulSoup doesn’t have such feature. It just parse us html content.

Changing page number

To download more pages I can use additional argument to the url to download individual pages from the website:

What I will to now is to select Refactor -> Extract -> Method in PyCharm:

I give the method name:

Now I can move “db.commit()” into the new place:

Now I should create a for loop which will determine how many pages I would like to parse. Around 30 would be fine:

for page in range(1,31):
    parse_page(page)

As the new parameter is not used yet in the script, I need to include it into parse_page():

To validate this in DBeaver I need to delete my table (Watch out! We shouldn’t act like this when in production!):

Now I can execute my whole script using “python main.py“:

After refreshing offers table in DBeaver it should be populated with data from all 31 pages:

Database Querying

Now our main point of today’s activity – In which city are the cheapest apartments by the sea? We will use proper SQL select clauses in DBeaver, e.g.:

As you can see the cheapest studios you can see in Lębork. The results are not suprising for me as all those cities are quite far from Tricity where prices are the highest.

Car offers scraping

To verify my new skills I have made an additional excercise – car offers webscraping. I have decided to apply some filters first:

  • price: 10-20k PLN
  • manual gearbox
  • year 2010-2015
  • my area (Pomeranian Voivodeship)

Below you can find the cheapest offers through car’s production date in my area:

Of course to select the most attractive offers I would need to dive deeper into its parameters – analyze distances, maybe investigate car brands and details included in the titles but let’s leave this for another post when I’ll play more with NLP methods.

Below you can find the code I have used for the whole webscraping process:

from bs4 import BeautifulSoup
from requests import get
import sqlite3
from sys import argv

# Applied filters: price (10-20k PLN), manual gearbox, year (2010-2015)

def parse_price(price):
    return float(price.replace(' ', '').replace('zł', '').replace(',', '.'))

def parse_distance(distance):
    return float(distance.replace(' ', '').replace('km', '').replace(',', '.'))

def parse_page(number):
    print(f'I am working on the page number {number}.')
    page = get(f'{URL}&page={number}')
    bs = BeautifulSoup(page.content, 'html.parser')
    for offer in bs.find_all('div', class_='css-1sw7q4x'):
        title = offer.find('p', class_='css-cqgwae-Text eu5v0x0').get_text().strip()
        location = offer.find('p', class_='css-106ejje-Text eu5v0x0').get_text().strip()
        price = parse_price(offer.find('span').get_text().strip())
        year = offer.find('p', class_="css-1obsecn").get_text().strip()[:5]
        distance = parse_distance(offer.find('p', class_="css-1obsecn").get_text().strip()[8:])

        cursor.execute('INSERT INTO offers VALUES (?, ?, ?, ?, ?)', (title, price, location, year, distance))

    db.commit()

URL = 'https://www.olx.pl/d/motoryzacja/samochody/pomorskie/?search%5Bfilter_float_price:from%5D=10000&search%5Bfilter_float_price:to%5D=20000&search%5Bfilter_float_year:from%5D=2010&search%5Bfilter_float_year:to%5D=2015&search%5Bfilter_enum_transmission%5D%5B0%5D=manual'
db = sqlite3.connect('cars_db.db')
cursor = db.cursor()

if len(argv) > 1 and argv[1] == 'setup':
    cursor.execute('''CREATE TABLE offers (title TEXT, price REAL, location TEXT, year TEXT, distance REAL)''')
    quit()

for page in range(1,31):
    parse_page(page)

db.close()

You can also check out the whole code for both apartment and car scraping on my GitHub. 🙂

Conclusion

Summing up, what makes me happy about such experiments is that you can use your new skills not only for work related stuff but also for own purposes! How often do you need to make decisions shopping online? Selecting new laptop, sportswear, phone or any other item available online will never be the same with webscraping! Not a fan of buying stuff online? How about scraping for hotels, summer destinations? You can scrape Booking.com or TripAdvisor data. What restaurant to choose for a date? Google Reviews! As you can see, the sky is the limit here. You can apply webscraping to any kind of hobby.

Thus let me finish this article with a question – what kind of data will you scrape first? 🙂

Resources:

Leave a Reply

Your email address will not be published.