Danny Moran

Web scraping Yell with Python, Beautiful Soup, and Requests

Published March 01, 2021 by Danny Moran

Table of Contents
PAGE CONTENT

Introduction

One of my goals this year is to improve some of my programming knowledge, especially with Python, so when I was speaking with a friend and he was telling me about how he was manually gathering data from Yell, I realised I could use this situation to improve some of my skills as well as help a friend out.

The scope of the project

Create a small app that a search term and location can be entered and it would return all of the company name, phone number, address, and website, where these were available.

The tooling used

I decided that Python matched with the Beautiful Soup and Requests libraries would be able to accomplish this. I also used the json library to format the response and the Time library to slow down the web scrape as to not get blocked by Yell.

The Python script

from bs4 import BeautifulSoup
import requests
import json
import time


def extract(pages):
    url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={searchTerm}&location={searchLocation}&pageNum={pages}"
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")
    return soup


def transform(soup):
    article = soup.find_all("div", class_="row businessCapsule--mainRow")
    for item in article:
        try:
            companyName = item.find("h2", itemprop="name").text
            companyNumber = item.find("span", itemprop="telephone").text.strip()
            companyAddress = item.find("span", itemprop="address").text.strip().replace("\n", "").replace(",","")
            companyWebsite = item.find("a", rel="nofollow noopener", class_="btn btn-yellow businessCapsule--ctaItem")['href']
        except:
            companyName = "None"
            companyNumber = "None"
            companyAddress = "None"
            companyWebsite = "None"

        companyInfo = {
            "companyName": companyName,
            "companyNumber": companyNumber,
            "companyAddress": companyAddress,
            "companyWebsite": companyWebsite
        }
        joblist.append(companyInfo)
    return


searchTerm = "enter+search+term+here"
searchLocation = "enter+location+here"
searchPages = 2
pages = searchPages + 1

joblist = []
for i in range(1, pages, 1):
    print(f"Started page: {i}")
    e = extract(i)
    transform(e)
    print(f"Finished page: {i}")
    time.sleep(3)

print(json.dumps(joblist, indent=1))

Prerequisites for being able to run the script

You need to have the following installed in your Python 3 environment:

Breakdown of how the script works

from bs4 import BeautifulSoup
import requests
import json
import time

This imports the required libraries that are used within the script

def extract(pages):
    url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={searchTerm}&location={searchLocation}&pageNum={pages}"
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")
    return soup

This function takes the variables defined later on in the script and crates a url with the information and then goes to the link using the requests library, and then saves the response into a variable called Soup.

def transform(soup):
    article = soup.find_all("div", class_="row businessCapsule--mainRow")
    for item in article:
        try:
            companyName = item.find("h2", itemprop="name").text
            companyNumber = item.find("span", itemprop="telephone").text.strip()
            companyAddress = item.find("span", itemprop="address").text.strip().replace("\n", "").replace(",","")
            companyWebsite = item.find("a", rel="nofollow noopener", class_="btn btn-yellow businessCapsule--ctaItem")['href']
        except:
            companyName = "None"
            companyNumber = "None"
            companyAddress = "None"
            companyWebsite = "None"

        companyInfo = {
            "companyName": companyName,
            "companyNumber": companyNumber,
            "companyAddress": companyAddress,
            "companyWebsite": companyWebsite
        }
        joblist.append(companyInfo)
    return

This funcation takes the previously returned Soup variable and then parses out the companyName, companyNumber, companyAddress, and companyWebsite where available and then saves the information into a dictionary called companyInfo.

searchTerm = "enter+search+term+here"
searchLocation = "enter+location+here"
searchPages = 2
pages = searchPages + 1

These are the variables that can be modified to change the results that are returned.

joblist = []
for i in range(1, pages, 1):
    print(f"Started page: {i}")
    e = extract(i)
    transform(e)
    print(f"Finished page: {i}")
    time.sleep(3)

This for loop, loops the script so that all pages are parsed. It waits for 3 seconds after each loop as to not get blocked by Yell for too many requests.

print(json.dumps(joblist, indent=1))

This prints the results into the console in a json format.