Web Scraping Financial Data from Bizstats with Scrapy

Web scraping is a powerful technique for extracting data from websites. This blog post details how to scrape financial data from Bizstats using Python’s Scrapy library. We will walk through a script that captures various financial metrics for different industries, saving the results into a CSV file.

Introduction to Web Scraping with Scrapy

Scrapy is an open-source web crawling framework for Python. It is designed to extract the data from web pages in a systematic and efficient way. In this tutorial, we will use Scrapy to scrape financial data from Bizstats, a website that provides detailed financial statistics for various industries.

Prerequisites

Before you begin, make sure you have Python and Scrapy installed. You can install Scrapy using pip:

pip install scrapy

You will also need the html2text and csv libraries. You can install html2text with pip:

pip install html2text

The csv module is part of Python’s standard library, so no installation is needed.

Script Overview

The script defines a Scrapy spider that navigates Bizstats, extracts financial data for various industries, and writes the data to a CSV file.

Imports and Initial Setup

First, we import the necessary libraries and set up the HTML to text converter.

import scrapy
import html2text
import csv

h = html2text.HTML2Text()

Spider Definition

The spider class is defined to handle the crawling and data extraction process. The start_urls list contains the URL from which the spider will start crawling.

class BizstatsSpider(scrapy.Spider):
    name = 'bizstats'
    allowed_domains = ['bizstats.com']
    start_urls = ['http://bizstats.com/sole-proprietorship-business-financials/industries.php']

CSV Initialization

Before the spider starts, we initialize the CSV file with column headers.

    with open("first.csv", "a") as f:
        writer = csv.writer(f)
        writer.writerow(['industry', 'sales', 'inventory', 'cost_of_sales', 'cos', 'gross_profit', 'salary_wages',
                         'contract_labor_commission', 'rent', 'taxes', 'interest_paid', 'amort_dept', 'advertising',
                         'benefits_pension', 'insurance', 'home_office_expense', 'other_sga_expense', 'total_expenses',
                         'net_profit', 'total_direct_labor_np'])

Parsing the Industry Links

The parse method extracts links to individual industry pages and initiates a request to scrape data from each link.

    def parse(self, response, *args):
        datas = response.xpath('.//ul/li/a').extract()
        for data in datas:
            sel = scrapy.Selector(text=data)
            link = sel.xpath('.//a/@href').extract_first()
            cat = sel.xpath('.//a/text()').extract_first()
            yield scrapy.Request(response.urljoin(link), callback=self.getdatas, meta={'cat': cat})

Parsing the Financial Data

The getdatas method extracts financial data from each industry’s page and writes it to the CSV file.

    def getdatas(self, response):
        datas = response.xpath('.//ul/li/a').extract()
        for data in datas:
            sel = scrapy.Selector(text=data)
            link = sel.xpath('.//a/@href').extract_first()
            cat = sel.xpath('.//a/text()').extract_first()
            yield scrapy.Request(response.urljoin(link), callback=self.getdatas, meta={'cat': cat})

        if 'Income-Expense Statement' in response.text:
            cat = response.meta.get('cat')
            sales = response.xpath('.//td[contains(.,"Sales")]/following-sibling::td/text()').extract_first()
            inventory = response.xpath('.//td[contains(.,"Inventory")]/following-sibling::td/text()').extract_first()
            cost_of_sales = response.xpath('.//td[contains(.,"Cost of Sales")]/following-sibling::td/text()').extract_first()
            cos = response.xpath('.//td[contains(.,"COS-Labor Portion")]/following-sibling::td/text()').extract_first()
            gross_profit = response.xpath('.//td[contains(.,"Gross Profit")]/following-sibling::td/text()').extract_first()
            salary_wages = response.xpath('.//td[contains(.,"Salary-Wages")]/following-sibling::td/text()').extract_first()
            contract_labor_commission = response.xpath('.//td[contains(.,"Contract Labor-Commissions")]/following-sibling::td/text()').extract_first()
            rent = response.xpath('.//td[contains(.,"Rent")]/following-sibling::td/text()').extract_first()
            taxes = response.xpath('.//td[contains(.,"Taxes")]/following-sibling::td/text()').extract_first()
            interest_paid = response.xpath('.//td[contains(.,"Interest paid")]/following-sibling::td/text()').extract_first()
            amort_dept = response.xpath('.//td[contains(.,"Amort. & Dep.")]/following-sibling::td/text()').extract_first()
            advertising = response.xpath('.//td[contains(.,"Advertising")]/following-sibling::td/text()').extract_first()
            benefits_pension = response.xpath('.//td[contains(.,"Benefits-Pension")]/following-sibling::td/text()').extract_first()
            insurance = response.xpath('.//td[contains(.,"Insurance (non-health)")]/following-sibling::td/text()').extract_first()
            home_office_expense = response.xpath('.//td[contains(.,"Home Office Expense")]/following-sibling::td/text()').extract_first()
            other_sga_expense = response.xpath('.//td[contains(.,"Other SG&A Exp.")]/following-sibling::td/text()').extract_first()
            total_expenses = response.xpath('.//td[contains(.,"Total Expenses")]/following-sibling::td/text()').extract_first()
            net_profit = response.xpath('.//td[contains(.,"Net Profit")]/following-sibling::td/text()').extract_first()
            total_direct_labor_np = response.xpath('.//td[contains(.,"Total Direct Labor & NP")]/following-sibling::td/text()').extract_first()

            with open("first.csv", "a") as f:
                writer = csv.writer(f)
                writer.writerow([cat, sales, inventory, cost_of_sales, cos, gross_profit, salary_wages, contract_labor_commission,
                                 rent, taxes, interest_paid, amort_dept, advertising, benefits_pension, insurance,
                                 home_office_expense, other_sga_expense, total_expenses, net_profit, total_direct_labor_np])
                print([cat, sales, inventory, cost_of_sales, cos, gross_profit, salary_wages, contract_labor_commission,
                       rent, taxes, interest_paid, amort_dept, advertising, benefits_pension, insurance, home_office_expense,
                       other_sga_expense, total_expenses, net_profit, total_direct_labor_np])

Running the Spider

To run the spider, save the script as bizstats_spider.py and execute the following command in your terminal:

scrapy runspider bizstats_spider.py

This will start the scraping process, and the data will be saved into a file named first.csv.

Conclusion

This tutorial demonstrated how to use Scrapy to scrape financial data from Bizstats. The script navigates through the site, extracts financial metrics for various industries, and stores the data in a CSV file. This automated approach can save time and effort compared to manual data collection, providing a scalable solution for gathering financial statistics from the web.

Leave a Comment

Your email address will not be published. Required fields are marked *

Shopping Cart