Description
In this blog, I will guide you through the process of automating web data extraction using Scrapy, a powerful web scraping framework. We will focus on scraping car dealer information from Yellow Pages and saving it into a CSV file.
Detailed Description
Web scraping is a vital skill for collecting data from websites, and Scrapy is one of the most powerful frameworks for this purpose. Below, we will explore a practical example of using Scrapy to scrape car dealer information from Yellow Pages in Idaho and save the data into a CSV file.
Prerequisites
Before you start, ensure you have Python installed. If not, download and install it from Python’s official website. Next, install Scrapy using pip:
pip install scrapy
The Script
Here’s the complete Python script designed to extract car dealer information from Yellow Pages:
# -*- coding: utf-8 -*-
import csv
import scrapy
import os
filename = "car dealer"
url = "https://www.yellowpages.com/search?search_terms=car%20dealers&geo_location_terms=Idaho&page=2"
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
if f"{filename}.csv" not in os.listdir(os.getcwd()):
with open(f"{filename}.csv", "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['Category', 'Title', 'Address', 'Phone', 'Website', 'Email'])
def start_requests(self):
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
links = response.xpath('.//*[@class="business-name"]/@href').extract()
category = response.xpath('.//*[@class="breadcrumb"]/span/text()').extract_first()
for link in links:
yield scrapy.Request(response.urljoin(link), callback=self.getdata, meta={'category': category})
nextlink = response.xpath('.//*[@class="next ajax-page"]/@href').extract_first()
if nextlink:
yield scrapy.Request(response.urljoin(nextlink), callback=self.parse)
def getdata(self, response):
title = response.xpath('.//h1/text()').extract_first().strip()
category = response.meta.get('category')
try:
address = ''.join(response.xpath('.//*[@class="address"]//text()').extract())
except:
address = ''
try:
phone = ''.join(response.xpath('.//*[@class="phone"]//text()').extract()).replace('Phone: ', '')
except:
phone = ''
try:
website = response.xpath('.//a[contains(.,"Visit Website")]/@href').extract_first()
except:
website = ''
try:
email = response.xpath('.//a/@href[contains(.,"mailto:")]').extract_first().replace('mailto:', '')
except:
email = ''
with open(f"{filename}.csv", "a", encoding="utf-8", newline="") as f:
writer = csv.writer(f)
writer.writerow([category, title, address, phone, website, email])
print([category, title, address, phone, website, email])
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(YellowpagesSpider)
process.start()
How It Works
- File Initialization:
- The script begins by checking if the CSV file exists. If not, it creates one and writes the header row.
- Spider Definition:
- The
YellowpagesSpider
class inherits from Scrapy’sSpider
class. - The
start_requests
method initiates a request to the target URL. - The
parse
method extracts links to individual business pages and follows them. It also handles pagination to scrape multiple pages.
- The
- Data Extraction:
- The
getdata
method extracts details such as title, address, phone number, website, and email from each business page. - These details are then written to the CSV file.
- The
- Crawler Process:
- The script uses
CrawlerProcess
to start the spider.
- The script uses
Running the Script
To run the script, save it as scrapy_spider.py
and execute it:
python scrapy_spider.py
This will start the scraping process, and the collected data will be saved into car dealer.csv
.
Conclusion
Scrapy is a powerful tool for web scraping, and this example demonstrates how to use it to extract data from Yellow Pages. You can adapt this script to scrape other types of data from different websites. Happy scraping!
Feel free to leave comments or questions below, and I’ll be happy to help you out.