Automate Web Data Extraction from Yellow Pages Using Scrapy

Description

In this blog, I will guide you through the process of automating web data extraction using Scrapy, a powerful web scraping framework. We will focus on scraping car dealer information from Yellow Pages and saving it into a CSV file.

Detailed Description

Web scraping is a vital skill for collecting data from websites, and Scrapy is one of the most powerful frameworks for this purpose. Below, we will explore a practical example of using Scrapy to scrape car dealer information from Yellow Pages in Idaho and save the data into a CSV file.

Prerequisites

Before you start, ensure you have Python installed. If not, download and install it from Python’s official website. Next, install Scrapy using pip:

pip install scrapy

The Script

Here’s the complete Python script designed to extract car dealer information from Yellow Pages:

# -*- coding: utf-8 -*-
import csv
import scrapy
import os

filename = "car dealer"
url = "https://www.yellowpages.com/search?search_terms=car%20dealers&geo_location_terms=Idaho&page=2"

class YellowpagesSpider(scrapy.Spider):
    name = 'yellowpages'

    if f"{filename}.csv" not in os.listdir(os.getcwd()):
        with open(f"{filename}.csv", "a", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(['Category', 'Title', 'Address', 'Phone', 'Website', 'Email'])

    def start_requests(self):
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        links = response.xpath('.//*[@class="business-name"]/@href').extract()
        category = response.xpath('.//*[@class="breadcrumb"]/span/text()').extract_first()
        for link in links:
            yield scrapy.Request(response.urljoin(link), callback=self.getdata, meta={'category': category})

        nextlink = response.xpath('.//*[@class="next ajax-page"]/@href').extract_first()
        if nextlink:
            yield scrapy.Request(response.urljoin(nextlink), callback=self.parse)

    def getdata(self, response):
        title = response.xpath('.//h1/text()').extract_first().strip()
        category = response.meta.get('category')
        try:
            address = ''.join(response.xpath('.//*[@class="address"]//text()').extract())
        except:
            address = ''
        try:
            phone = ''.join(response.xpath('.//*[@class="phone"]//text()').extract()).replace('Phone:  ', '')
        except:
            phone = ''
        try:
            website = response.xpath('.//a[contains(.,"Visit Website")]/@href').extract_first()
        except:
            website = ''
        try:
            email = response.xpath('.//a/@href[contains(.,"mailto:")]').extract_first().replace('mailto:', '')
        except:
            email = ''

        with open(f"{filename}.csv", "a", encoding="utf-8", newline="") as f:
            writer = csv.writer(f)
            writer.writerow([category, title, address, phone, website, email])
            print([category, title, address, phone, website, email])

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(YellowpagesSpider)
process.start()

How It Works

File Initialization:
- The script begins by checking if the CSV file exists. If not, it creates one and writes the header row.
Spider Definition:
- The YellowpagesSpider class inherits from Scrapy’s Spider class.
- The start_requests method initiates a request to the target URL.
- The parse method extracts links to individual business pages and follows them. It also handles pagination to scrape multiple pages.
Data Extraction:
- The getdata method extracts details such as title, address, phone number, website, and email from each business page.
- These details are then written to the CSV file.
Crawler Process:
- The script uses CrawlerProcess to start the spider.

Running the Script

To run the script, save it as scrapy_spider.py and execute it:

python scrapy_spider.py

This will start the scraping process, and the collected data will be saved into car dealer.csv.

Conclusion

Scrapy is a powerful tool for web scraping, and this example demonstrates how to use it to extract data from Yellow Pages. You can adapt this script to scrape other types of data from different websites. Happy scraping!

Feel free to leave comments or questions below, and I’ll be happy to help you out.