Web scraping is a powerful technique for extracting data from websites. This blog post details how to scrape financial data from Bizstats using Python’s Scrapy library. We will walk through a script that captures various financial metrics for different industries, saving the results into a CSV file.
Introduction to Web Scraping with Scrapy
Scrapy is an open-source web crawling framework for Python. It is designed to extract the data from web pages in a systematic and efficient way. In this tutorial, we will use Scrapy to scrape financial data from Bizstats, a website that provides detailed financial statistics for various industries.
Prerequisites
Before you begin, make sure you have Python and Scrapy installed. You can install Scrapy using pip:
pip install scrapy
You will also need the html2text
and csv
libraries. You can install html2text
with pip:
pip install html2text
The csv
module is part of Python’s standard library, so no installation is needed.
Script Overview
The script defines a Scrapy spider that navigates Bizstats, extracts financial data for various industries, and writes the data to a CSV file.
Imports and Initial Setup
First, we import the necessary libraries and set up the HTML to text converter.
import scrapy
import html2text
import csv
h = html2text.HTML2Text()
Spider Definition
The spider class is defined to handle the crawling and data extraction process. The start_urls
list contains the URL from which the spider will start crawling.
class BizstatsSpider(scrapy.Spider):
name = 'bizstats'
allowed_domains = ['bizstats.com']
start_urls = ['http://bizstats.com/sole-proprietorship-business-financials/industries.php']
CSV Initialization
Before the spider starts, we initialize the CSV file with column headers.
with open("first.csv", "a") as f:
writer = csv.writer(f)
writer.writerow(['industry', 'sales', 'inventory', 'cost_of_sales', 'cos', 'gross_profit', 'salary_wages',
'contract_labor_commission', 'rent', 'taxes', 'interest_paid', 'amort_dept', 'advertising',
'benefits_pension', 'insurance', 'home_office_expense', 'other_sga_expense', 'total_expenses',
'net_profit', 'total_direct_labor_np'])
Parsing the Industry Links
The parse
method extracts links to individual industry pages and initiates a request to scrape data from each link.
def parse(self, response, *args):
datas = response.xpath('.//ul/li/a').extract()
for data in datas:
sel = scrapy.Selector(text=data)
link = sel.xpath('.//a/@href').extract_first()
cat = sel.xpath('.//a/text()').extract_first()
yield scrapy.Request(response.urljoin(link), callback=self.getdatas, meta={'cat': cat})
Parsing the Financial Data
The getdatas
method extracts financial data from each industry’s page and writes it to the CSV file.
def getdatas(self, response):
datas = response.xpath('.//ul/li/a').extract()
for data in datas:
sel = scrapy.Selector(text=data)
link = sel.xpath('.//a/@href').extract_first()
cat = sel.xpath('.//a/text()').extract_first()
yield scrapy.Request(response.urljoin(link), callback=self.getdatas, meta={'cat': cat})
if 'Income-Expense Statement' in response.text:
cat = response.meta.get('cat')
sales = response.xpath('.//td[contains(.,"Sales")]/following-sibling::td/text()').extract_first()
inventory = response.xpath('.//td[contains(.,"Inventory")]/following-sibling::td/text()').extract_first()
cost_of_sales = response.xpath('.//td[contains(.,"Cost of Sales")]/following-sibling::td/text()').extract_first()
cos = response.xpath('.//td[contains(.,"COS-Labor Portion")]/following-sibling::td/text()').extract_first()
gross_profit = response.xpath('.//td[contains(.,"Gross Profit")]/following-sibling::td/text()').extract_first()
salary_wages = response.xpath('.//td[contains(.,"Salary-Wages")]/following-sibling::td/text()').extract_first()
contract_labor_commission = response.xpath('.//td[contains(.,"Contract Labor-Commissions")]/following-sibling::td/text()').extract_first()
rent = response.xpath('.//td[contains(.,"Rent")]/following-sibling::td/text()').extract_first()
taxes = response.xpath('.//td[contains(.,"Taxes")]/following-sibling::td/text()').extract_first()
interest_paid = response.xpath('.//td[contains(.,"Interest paid")]/following-sibling::td/text()').extract_first()
amort_dept = response.xpath('.//td[contains(.,"Amort. & Dep.")]/following-sibling::td/text()').extract_first()
advertising = response.xpath('.//td[contains(.,"Advertising")]/following-sibling::td/text()').extract_first()
benefits_pension = response.xpath('.//td[contains(.,"Benefits-Pension")]/following-sibling::td/text()').extract_first()
insurance = response.xpath('.//td[contains(.,"Insurance (non-health)")]/following-sibling::td/text()').extract_first()
home_office_expense = response.xpath('.//td[contains(.,"Home Office Expense")]/following-sibling::td/text()').extract_first()
other_sga_expense = response.xpath('.//td[contains(.,"Other SG&A Exp.")]/following-sibling::td/text()').extract_first()
total_expenses = response.xpath('.//td[contains(.,"Total Expenses")]/following-sibling::td/text()').extract_first()
net_profit = response.xpath('.//td[contains(.,"Net Profit")]/following-sibling::td/text()').extract_first()
total_direct_labor_np = response.xpath('.//td[contains(.,"Total Direct Labor & NP")]/following-sibling::td/text()').extract_first()
with open("first.csv", "a") as f:
writer = csv.writer(f)
writer.writerow([cat, sales, inventory, cost_of_sales, cos, gross_profit, salary_wages, contract_labor_commission,
rent, taxes, interest_paid, amort_dept, advertising, benefits_pension, insurance,
home_office_expense, other_sga_expense, total_expenses, net_profit, total_direct_labor_np])
print([cat, sales, inventory, cost_of_sales, cos, gross_profit, salary_wages, contract_labor_commission,
rent, taxes, interest_paid, amort_dept, advertising, benefits_pension, insurance, home_office_expense,
other_sga_expense, total_expenses, net_profit, total_direct_labor_np])
Running the Spider
To run the spider, save the script as bizstats_spider.py
and execute the following command in your terminal:
scrapy runspider bizstats_spider.py
This will start the scraping process, and the data will be saved into a file named first.csv
.
Conclusion
This tutorial demonstrated how to use Scrapy to scrape financial data from Bizstats. The script navigates through the site, extracts financial metrics for various industries, and stores the data in a CSV file. This automated approach can save time and effort compared to manual data collection, providing a scalable solution for gathering financial statistics from the web.