Web scraping is a highly effective technique used to extract data from websites, offering a powerful way to gather and analyze information from the vast expanse of the internet. This technique is particularly valuable for collecting large amounts of data that would otherwise be time-consuming and labor-intensive to obtain manually. Among the myriad tools available for web scraping, Scrapy has emerged as one of the most popular and robust Python frameworks. To elaborate, Scrapy provides a comprehensive set of tools and features that facilitate efficient data extraction from websites, including built-in support for handling various web protocols, managing requests, and processing responses.
About the vapingwholesale scraper
Furthermore, Scrapy’s flexibility allows users to store the extracted data in a variety of formats, such as JSON, CSV, and XML, depending on their specific needs and preferences. This versatility ensures that users can integrate the data seamlessly into different applications or databases for further analysis or reporting. In this blog post, we will undertake a detailed exploration of how to build a web scraper using Scrapy. Specifically, we will focus on a practical example to illustrate the process: collecting data on vape products from a sample e-commerce website.
To achieve this, we will first guide you through setting up a Scrapy project, including creating the necessary project structure and configuration files. Next, we will define the item structures to specify the data we want to extract, and then we will develop the spider that will crawl the e-commerce site and gather the relevant information. Additionally, we will cover how to handle various challenges that may arise during the scraping process, such as dealing with pagination and managing data extraction efficiently.
By the end of this blog post, you will have a thorough understanding of how to leverage Scrapy’s powerful features to build an effective web scraper. Consequently, you will be equipped with the knowledge and skills to apply these techniques to your own web scraping projects, making it easier to collect and analyze data from a wide range of sources.
Introduction to Scrapy
Scrapy is an open-source and collaborative web crawling framework specifically designed for Python. To start with, it boasts a design that prioritizes speed, simplicity, and flexibility. As a result, it allows you to efficiently scrape websites and extract valuable data by employing various selectors, including CSS and XPath. Additionally, Scrapy offers a range of built-in features that further enhance its functionality. For instance, it not only handles HTTP requests seamlessly but also follows links between pages automatically, ensuring that you can navigate through websites without manual intervention.
Furthermore, Scrapy facilitates data management by providing multiple export options. You can easily export the extracted data into various formats, such as JSON, CSV, or XML, depending on your needs. This versatility in data handling ensures that you can integrate the scraped information into different applications or workflows with minimal effort. Consequently, Scrapy equips you with a robust and comprehensive toolkit, making it an invaluable asset for managing and organizing your web scraping projects efficiently.
Overview of the VapeSpider
In this tutorial, we’ll develop a Scrapy spider named VapeSpider
that will:
- Crawl a vape product e-commerce website.
- Extract product details such as the title, brand, price, and stock information.
- Save the data to a JSON file.
Here’s the complete code for our VapeSpider
:
import scrapy
import json
import os
from scrapy.http import Request
from fake_headers import Headers
class VapeSpider(scrapy.Spider):
name = "vape_spider"
start_urls = [
'https://vapingwholesale.co.uk/collections/new?sort_by=created-descending&filter.v.price.gte=&filter.v.price.lte='
]
custom_settings = {
'DOWNLOAD_DELAY': 1, # Delay between requests to the same domain
}
def __init__(self, *args, **kwargs):
super(VapeSpider, self).__init__(*args, **kwargs)
if "alldata.json" not in os.listdir(os.getcwd()):
with open("alldata.json", "w") as f:
json.dump({}, f, indent=4)
self.alldata = json.loads(open("alldata.json", "r").read())
def parse(self, response):
sel = scrapy.Selector(response)
nextlink = sel.xpath('.//*[@rel="next"]/@href').extract_first()
links = list(set(sel.xpath('.//*[@class="product-item__title text--strong link"]/@href').extract()))
for l in links:
link = 'https://vapingwholesale.co.uk' + l
headers = Headers().generate()
yield Request(url=link, headers=headers, callback=self.parse_product)
if nextlink:
yield response.follow(nextlink, self.parse)
def parse_product(self, response):
sel = scrapy.Selector(response)
title = sel.xpath('.//*[@class="product-meta__title heading h1"]/text()').extract_first()
brand = sel.xpath('.//*[@class="product-meta__vendor link link--accented"]/text()').extract_first()
variants_data = sel.xpath('.//*[@type="application/json"][contains(.,"product")]/text()').extract_first()
if not variants_data:
return
variants = json.loads(variants_data)['product']['variants']
for variant in variants:
options = variant['title']
sku = variant['sku']
price = variant['price']
try:
in_stock = sel.xpath('.//div[@class="nm-easywholesale-name nm-primary-color"][contains(.,"' + str(variant['title']) + '")]/small[contains(.,"in stock:")]/text()').extract_first().split(':')[-1].replace(')', '')
except:
in_stock = 0
if link not in self.alldata.keys():
self.alldata[link] = {}
self.writetojson()
if sku not in self.alldata[link].keys():
self.alldata[link][sku] = []
self.alldata[link][sku].append({
'title': title,
'options': options,
'category': 'Disposable Vape',
'brand': brand,
'price': price,
'in_stock': in_stock
})
self.writetojson()
def writetojson(self):
with open("alldata.json", "w") as f:
json.dump(self.alldata, f, indent=4)
Understanding the Code
1. Initialization
In the __init__
method, the spider checks if alldata.json
exists in the current working directory. If it doesn’t exist, it creates an empty JSON file. This file will store all the scraped data.
def __init__(self, *args, **kwargs):
super(VapeSpider, self).__init__(*args, **kwargs)
if "alldata.json" not in os.listdir(os.getcwd()):
with open("alldata.json", "w") as f:
json.dump({}, f, indent=4)
self.alldata = json.loads(open("alldata.json", "r").read())
2. Starting the Crawl
The start_urls
attribute contains the initial URL for the spider to crawl. The custom_settings
attribute is used to set a download delay of 1 second between requests to the same domain, which helps to avoid overwhelming the server.
start_urls = [
‘https://vapingwholesale.co.uk/collections/new?sort_by=created-descending&filter.v.price.gte=&filter.v.price.lte=’
]
custom_settings = {
‘DOWNLOAD_DELAY’: 1, # Delay between requests to the same domain
}
3. Parsing the Response
The parse
method is the default callback for Scrapy requests. It processes the response of each URL:
- Extracts product links from the page.
- Creates a Scrapy
Request
for each product link, passing it toparse_product
. - Follows the pagination link to the next page, if available.
def parse(self, response):
sel = scrapy.Selector(response)
nextlink = sel.xpath('.//*[@rel="next"]/@href').extract_first()
links = list(set(sel.xpath('.//*[@class="product-item__title text--strong link"]/@href').extract()))
for l in links:
link = 'https://vapingwholesale.co.uk' + l
headers = Headers().generate()
yield Request(url=link, headers=headers, callback=self.parse_product)
if nextlink:
yield response.follow(nextlink, self.parse)
4. Parsing Product Details
The parse_product
method processes each product page:
- Extracts the product title, brand, price, and stock status using XPath selectors.
- Parses JSON data embedded in the page to retrieve product variants.
- Appends the extracted data to
self.alldata
and writes it toalldata.json
.
def parse_product(self, response):
sel = scrapy.Selector(response)
title = sel.xpath('.//*[@class="product-meta__title heading h1"]/text()').extract_first()
brand = sel.xpath('.//*[@class="product-meta__vendor link link--accented"]/text()').extract_first()
variants_data = sel.xpath('.//*[@type="application/json"][contains(.,"product")]/text()').extract_first()
if not variants_data:
return
variants = json.loads(variants_data)['product']['variants']
for variant in variants:
options = variant['title']
sku = variant['sku']
price = variant['price']
try:
in_stock = sel.xpath('.//div[@class="nm-easywholesale-name nm-primary-color"][contains(.,"' + str(variant['title']) + '")]/small[contains(.,"in stock:")]/text()').extract_first().split(':')[-1].replace(')', '')
except:
in_stock = 0
if link not in self.alldata.keys():
self.alldata[link] = {}
self.writetojson()
if sku not in self.alldata[link].keys():
self.alldata[link][sku] = []
self.alldata[link][sku].append({
'title': title,
'options': options,
'category': 'Disposable Vape',
'brand': brand,
'price': price,
'in_stock': in_stock
})
self.writetojson()
5. Writing Data to JSON
The writetojson
method is responsible for saving the scraped data into alldata.json
file:
def writetojson(self):
with open("alldata.json", "w") as f:
json.dump(self.alldata, f, indent=4)
Running the Spider
To run the spider, you need to have Scrapy installed and a Scrapy project set up. Here are the steps:
- Install Scrapy: You can install Scrapy using pip:
pip install scrapy
To get started with your web scraping tasks, the very first step is to create a Scrapy project. Consequently, you should run the following command to initiate the project setup. This command will generate a new Scrapy project, which involves creating a structured directory and essential configuration files. Specifically, this setup includes folders for your spiders, item definitions, and pipelines, as well as the project’s settings file.
Once you execute the command, Scrapy will establish a well-organized environment tailored for your web scraping needs. This environment provides a solid foundation for you to begin developing your spiders, setting up your scraping logic, and customizing your project according to your specific requirements. By completing this step, you ensure that all the necessary components are in place for a smooth and efficient web scraping process.
scrapy startproject vapescraper
Save the Spider: Save the VapeSpider
code in a file named vape_spider.py
within the spiders
directory of your Scrapy project.
To proceed with the data extraction process, you first need to run the Spider. Therefore, execute the following command. This action will initiate the crawling process, allowing the Spider to start gathering the necessary data. Subsequently, the Spider will begin traversing the website and collecting the information as specified in your project settings.
scrapy crawl vape_spider
Conclusion
In this blog post, we explored how to use Scrapy to build a web scraper that extracts data from a vape product e-commerce website. We learned how to set up a Scrapy project, define a spider, parse product details,