In today’s digital age, web scraping is an indispensable tool for collecting data from the web, especially for e-commerce sites. Scrapy, a robust Python framework, simplifies this task by allowing developers to efficiently extract data from websites. In this blog post, we will explore how to create a Scrapy spider named “EI Vape Scraper” to gather product information from an e-commerce site specializing in vape products.
What is Scrapy?
Scrapy is a popular web crawling framework for Python that is designed to scrape and extract data from websites. It offers a comprehensive set of tools for handling various aspects of web scraping, such as following links, parsing HTML/XML, and storing data. Its asynchronous architecture makes it particularly suitable for large-scale scraping projects.
Setting Up the Scrapy Project
Before you start building the scraper, make sure you have Python and Scrapy installed. You can install Scrapy using pip:
pip install scrapy
Create a new Scrapy project to organize your scraper:
scrapy startproject ei_vape_scraper
Navigate to your project directory and generate a new spider:
cd ei_vape_scraper
scrapy genspider ei_vape eivape.com
Building the EI Vape Scraper
Here is a detailed breakdown of the EI Vape Scraper
implementation:
import scrapy
import json
import datetime
from datetime import timedelta
from scrapy.crawler import CrawlerProcess
from fake_headers import Headers
import os
class VapeSpider(scrapy.Spider):
name = "vape_spider"
custom_settings = {
"DOWNLOAD_DELAY": 1,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY": 1,
"AUTOTHROTTLE_MAX_DELAY": 10,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 1.0,
}
# Start URLs and category mappings
start_urls = [
{'url': 'https://eivape.com/disposables/', 'category': 'disposables'},
{'url': 'https://eivape.com/hardware/', 'category': 'hardware'},
]
# Load or initialize data
if "alldata.json" not in os.listdir(os.getcwd()):
with open("alldata.json", "w") as f:
json.dump({}, f, indent=4)
alldata = json.loads(open("alldata.json", "r").read())
# Get date range for data collection
try:
start_date = alldata[list(alldata.keys())[0]][list(alldata[list(alldata.keys())[0]].keys())[0]][0]['date']
except:
start_date = datetime.datetime.now().strftime("%Y-%m-%d")
dates = [datetime.datetime.strftime(datetime.datetime.strptime(start_date, '%Y-%m-%d') + timedelta(days=i), '%Y-%m-%d')
for i in range((datetime.datetime.now() + timedelta(days=1) - datetime.datetime.strptime(start_date, '%Y-%m-%d')).days)]
def start_requests(self):
for url_data in self.start_urls:
yield scrapy.Request(url=url_data['url'], callback=self.parse, meta={'category': url_data['category']})
def parse(self, response):
category = response.meta['category']
next_page = response.xpath('.//*[@class="next page-numbers"]/@href').extract_first()
links = list(set(response.xpath('.//*[@class="woocommerce-LoopProduct-link woocommerce-loop-product__link"]/@href').extract()))
for link in links:
yield scrapy.Request(url=link, callback=self.parse_product, meta={'category': category})
if next_page:
yield response.follow(next_page, callback=self.parse, meta={'category': category})
def parse_product(self, response):
category = response.meta['category']
title = response.xpath('.//*[@class="product_title entry-title"]/text()').extract_first()
datas = response.xpath('.//*[@class="variation_grid_table_wrapper"]//tbody/tr').extract()
for data in datas:
sel = scrapy.Selector(text=data)
try:
options = sel.xpath('.//td[1]/text()').extract_first().strip()
except:
options = ''
variations = scrapy.Selector(text=response.xpath('.//*[@class="variation_grid"]/thead').extract_first()).xpath('.//th[1]/following-sibling::th/text()').extract()
for i in range(len(variations)):
try:
in_stock = sel.xpath('.//td[' + str(i + 2) + ']/div[3]/text() | .//td[' + str(i + 2) + ']/div[3]/p/text()').extract_first().replace(' in Stock', '').replace('in stock', '').strip()
except:
in_stock = ''
sku = options + ' : ' + variations[i]
if response.url not in self.alldata.keys():
self.alldata[response.url] = {}
if sku not in self.alldata[response.url].keys():
self.alldata[response.url][sku] = []
today_date = datetime.datetime.now().strftime('%Y-%m-%d')
temp = []
if not self.alldata[response.url][sku]:
self.alldata[response.url][sku].append({
'title': title,
'date': today_date,
'options': options,
'category': category,
'in_stock': in_stock
})
self.writetojson()
for date in self.dates:
try:
temp.append([i['in_stock'] for i in self.alldata[response.url][sku] if i['date'] == date][0])
except:
temp.append('')
print({'header': ['title', 'department', 'Variations'] + self.dates})
print({'data': [title, category, sku] + temp})
else:
if today_date not in [i['date'] for i in self.alldata[response.url][sku]]:
self.alldata[response.url][sku].append({
'title': title,
'date': today_date,
'options': options,
'category': category,
'in_stock': in_stock
})
self.writetojson()
for date in self.dates:
try:
temp.append([i['in_stock'] for i in self.alldata[response.url][sku] if i['date'] == date][0])
except:
temp.append('')
print({'header': ['title', 'department', 'Variations'] + self.dates})
print({'data': [title, category, sku] + temp})
else:
for date in self.dates:
try:
temp.append([i['in_stock'] for i in self.alldata[response.url][sku] if i['date'] == date][0])
except:
temp.append('')
print({'header': ['title', 'department', 'Variations'] + self.dates})
print({'data': [title, category, sku] + temp})
def writetojson(self):
with open("alldata.json", "w") as f:
json.dump(self.alldata, f, indent=4)
# Run the Spider
process = CrawlerProcess()
process.crawl(VapeSpider)
process.start()
Let’s break down the code for the EI Vape Scraper
in detail to understand how each part works and what its purpose is. This Scrapy spider is designed to scrape product information from the e-commerce site “EI Vape,” specifically targeting the “disposables” and “hardware” categories.
Overview
The code implements a Scrapy spider that:
- Initializes a list of URLs to scrape.
- Parses product listings and details.
- Saves the data in a structured JSON format.
- Manages scraping settings for efficiency and politeness.
Let’s examine each component of the code in detail:
Imports
import scrapy
import json
import datetime
from datetime import timedelta
from scrapy.crawler import CrawlerProcess
from fake_headers import Headers
import os
- scrapy: The core library for web scraping, providing classes and functions to define spiders, parse responses, and manage the crawl process.
- json: Used for reading and writing JSON files, which are used to store scraped data.
- datetime: Provides date manipulation capabilities, which are used to track the scraping timeline.
- CrawlerProcess: Part of Scrapy, this allows you to run the spider programmatically.
- Headers: Generates random headers for each request to mimic browser behavior and avoid blocking.
- os: Used to interact with the operating system, such as checking for file existence.
Spider Class Definition
class VapeSpider(scrapy.Spider):
name = "vape_spider"
- VapeSpider: The class defining the spider. It inherits from
scrapy.Spider
, which is the base class for all Scrapy spiders. - name: A unique name for the spider. This is used to reference the spider when running it.
Custom Settings
custom_settings = {
"DOWNLOAD_DELAY": 1,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY": 1,
"AUTOTHROTTLE_MAX_DELAY": 10,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 1.0,
}
- DOWNLOAD_DELAY: Sets a delay between requests to the same domain, helping prevent overloading the server and avoiding being blocked.
- AUTOTHROTTLE_ENABLED: Enables Scrapy’s AutoThrottle feature, which adjusts the crawling speed dynamically based on load and response times.
- AUTOTHROTTLE_START_DELAY: Initial download delay, giving the site time to respond before adjusting speed.
- AUTOTHROTTLE_MAX_DELAY: Maximum delay between requests, providing an upper limit to how slow the spider can crawl.
- AUTOTHROTTLE_TARGET_CONCURRENCY: The target number of parallel requests to be sent to each remote server, balancing speed and server load.
Start URLs and Category Mapping
start_urls = [
{'url': 'https://eivape.com/disposables/', 'category': 'disposables'},
{'url': 'https://eivape.com/hardware/', 'category': 'hardware'},
]
- start_urls: A list of dictionaries, each containing a URL and its corresponding category. This allows the spider to start crawling from these specific pages and map products to their respective categories.
Data Initialization
if "alldata.json" not in os.listdir(os.getcwd()):
with open("alldata.json", "w") as f:
json.dump({}, f, indent=4)
alldata = json.loads(open("alldata.json", "r").read())
- Data File Check: Checks if
alldata.json
exists. If not, it creates an empty JSON file to store the scraped data. - alldata: Loads the existing data from
alldata.json
into a Python dictionary for in-memory manipulation during the scraping process.
Date Range Calculation
try:
start_date = alldata[list(alldata.keys())[0]][list(alldata[list(alldata.keys())[0]].keys())[0]][0]['date']
except:
start_date = datetime.datetime.now().strftime("%Y-%m-%d")
dates = [datetime.datetime.strftime(datetime.datetime.strptime(start_date, '%Y-%m-%d') + timedelta(days=i), '%Y-%m-%d')
for i in range((datetime.datetime.now() + timedelta(days=1) - datetime.datetime.strptime(start_date, '%Y-%m-%d')).days)]
- Start Date: Attempts to extract the earliest date from the existing data for continuous tracking. If no data exists, it defaults to the current date.
- dates: Generates a list of dates from the start date to the current date. This is used to track product stock changes over time.
Start Requests
def start_requests(self):
for url_data in self.start_urls:
yield scrapy.Request(url=url_data['url'], callback=self.parse, meta={'category': url_data['category']})
- start_requests: Overrides the default method to send HTTP requests to each start URL. The
meta
attribute passes the category information to the parse method.
Parsing Product Listings
def parse(self, response):
category = response.meta['category']
next_page = response.xpath('.//*[@class="next page-numbers"]/@href').extract_first()
links = list(set(response.xpath('.//*[@class="woocommerce-LoopProduct-link woocommerce-loop-product__link"]/@href').extract()))
for link in links:
yield scrapy.Request(url=link, callback=self.parse_product, meta={'category': category})
if next_page:
yield response.follow(next_page, callback=self.parse, meta={'category': category})
- parse: Processes the response from a category page.
- category: Retrieves the category from the
meta
attribute. - next_page: Extracts the link to the next page of products, if available.
- links: Extracts and deduplicates product links from the current page.
- yield scrapy.Request: Sends a request for each product link to
parse_product
. - response.follow: Follows the link to the next page and repeats the process if a
next_page
is found.
- category: Retrieves the category from the
Parsing Product Details
def parse_product(self, response):
category = response.meta['category']
title = response.xpath('.//*[@class="product_title entry-title"]/text()').extract_first()
datas = response.xpath('.//*[@class="variation_grid_table_wrapper"]//tbody/tr').extract()
for data in datas:
sel = scrapy.Selector(text=data)
try:
options = sel.xpath('.//td[1]/text()').extract_first().strip()
except:
options = ''
variations = scrapy.Selector(text=response.xpath('.//*[@class="variation_grid"]/thead').extract_first()).xpath('.//th[1]/following-sibling::th/text()').extract()
for i in range(len(variations)):
try:
in_stock = sel.xpath('.//td[' + str(i + 2) + ']/div[3]/text() | .//td[' + str(i + 2) + ']/div[3]/p/text()').extract_first().replace(' in Stock', '').replace('in stock', '').strip()
except:
in_stock = ''
sku = options + ' : ' + variations[i]
if response.url not in self.alldata.keys():
self.alldata[response.url] = {}
if sku not in self.alldata[response.url].keys():
self.alldata[response.url][sku] = []
today_date = datetime.datetime.now().strftime('%Y-%m-%d')
temp = []
if not self.alldata[response.url][sku]:
self.alldata[response.url][sku].append({
'title': title,
'date': today_date,
'options': options,
'category': category,
'in_stock': in_stock
})
self.writetojson()
for date in self.dates:
try:
temp.append([i['in_stock'] for i in self.alldata[response.url][sku] if i['date'] == date][0])
except:
temp.append('')
print({'header': ['title', 'department', 'Variations'] + self.dates})
print({'data': [title, category, sku] + temp})
else:
if today_date not in [i['date'] for i in self.alldata[response.url][sku]]:
self.alldata[response.url][sku].append({
'title': title,
'date': today_date,
'options': options,
'category': category,
'in_stock': in_stock
})
self.writetojson()
for date in self.dates:
try:
temp.append([i['in_stock'] for i in self.alldata[response.url][sku] if i['date'] == date][0])
except:
temp.append('')
print({'header': ['title', 'department', 'Variations'] + self.dates})
print({'data': [title, category, sku] + temp})
else:
for date in self.dates:
try:
temp.append([i['in_stock'] for i in self.alldata[response.url][sku] if i['date'] == date][0])
except:
temp.append('')
print({'header': ['title', 'department', 'Variations'] + self.dates})
print({'data': [title, category, sku] + temp})
parse_product: Extracts detailed information
Key Features of the EI Vape Scraper
- Custom Settings: The spider is configured with custom settings to handle download delays and autothrottle, ensuring that it interacts with the website politely and efficiently.
- Dynamic URL Handling: The scraper targets two main categories—disposables and hardware—each with its specific URL, allowing for flexible scraping based on category needs.
- Data Management: The scraper maintains a JSON file (
alldata.json
) to track product data over time. This includes information about the product’s title, category, variations, and stock status. - Asynchronous Requests: Scrapy’s asynchronous nature is utilized to follow links and parse pages efficiently, significantly speeding up the data extraction process.
- Data Parsing and Storage: Product data is parsed using XPath selectors, extracting details such as product title, SKU, and stock availability. This data is then appended to the JSON file, with historical stock data tracked by date.
Benefits of Using Scrapy for Web Scraping
- Scalability: Scrapy can handle large-scale scraping tasks efficiently, making it suitable for projects that require extensive data collection.
- Flexibility: The framework allows easy customization and extension to adapt to various scraping needs and website structures.
- Efficiency: With its asynchronous architecture, Scrapy can perform web scraping faster than synchronous approaches, making it ideal for real-time data extraction.
Conclusion
The EI Vape Scraper demonstrates the power and versatility of Scrapy in automating web scraping tasks for e-commerce data. By leveraging Scrapy’s robust features, you can efficiently gather and manage data from various websites, unlocking valuable insights for business analysis and decision-making.
Whether you are new to web scraping or looking to optimize your existing processes, Scrapy provides a comprehensive solution for extracting and managing web data effectively.