Efficiently Scrape WEEDMAPS Data with Scrapy using proxy

Introduction

In today’s data-driven world, web scraping has become an essential tool for extracting valuable information from websites. If you’re looking to gather business data efficiently, using Python libraries such as Scrapy and BeautifulSoup can be incredibly effective. In this guide, we’ll walk you through a detailed example of how to scrape business information from a website using these powerful tools. Let’s move to the weedmaps scraping.

Prerequisites

Before diving into the code, make sure you have the following:

  • Basic knowledge of Python
  • Familiarity with web scraping concepts
  • Python installed on your system
  • Necessary Python libraries: Scrapy, BeautifulSoup, fake_headers

Code Walkthrough

Below is a comprehensive breakdown of a Scrapy spider designed to scrape business data from a website, including contact details and opening hours.

Importing Libraries

import scrapy
import csv
import time
from bs4 import BeautifulSoup
import json
import os
from fake_headers import Headers
import random

In this script, we import several libraries:

  • Scrapy: The main library for web scraping.
  • CSV: For handling CSV files.
  • BeautifulSoup: For parsing HTML and XML documents.
  • JSON: For handling JSON data.
  • OS: For interacting with the operating system.
  • fake_headers: To generate random HTTP headers.

Setting Up HTTP Headers and Proxies

headers = {
    "Host": "weedmaps.com",
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "If-None-Match": 'W/"514af-3RoR9eELSyEvLlM2jxJ3z1sIgkw"',
    "TE": "Trailers"
}

API_KEY = "f5249954f70f58569957aa74e0319ef2"
proxies = "http://scraperapi:"+str(API_KEY)+"@proxy-server.scraperapi.com:8001"

Here, we configure HTTP headers to mimic a real user agent and set up a proxy to avoid IP bans.

Defining the Scrapy Spider

class AppSpider(scrapy.Spider):
    name = 'app'
    allowed_domains = ['weedmaps.com']
    start_urls = ['http://weedmaps.com/']

    filename = "US"
    
    if filename+'.csv' not in os.listdir(os.getcwd()):
        with open(filename+'.csv','a') as f:
            writer = csv.writer(f)
            writer.writerow(['Url','Name','City','Type','Email','Telephone','Street Address','Address Locality','State','Postal Code',
                             'image','description','website',
                             'Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'])

In this section:

  • We define the Scrapy spider with a name and allowed domain.
  • start_urls: The initial URL from which the scraping starts.
  • filename: Defines the CSV file where data will be saved.
  • If the CSV file does not exist, it creates a new one and writes the headers.

Managing Links and Requests

    links = []
    try:
        with open(filename+'.csv','r') as q:
            reader = csv.reader(q)
            next(reader)
            for line in reader:
                links.append(line[0])
    except:
        pass

    u_links = list(set(links))

    def start_requests(self):
        with open('../link_scraper/'+str(self.filename)+'.csv','r') as r:
            reader = csv.reader(r)
            next(reader)
            for line in reader:
                if line[6] not in self.u_links:
                    request = scrapy.FormRequest(url=line[6],method="GET",callback=self.parse,meta={
                        'name':line[1],
                        'state':line[3],
                        'city':line[4],
                        'type':line[5]
                    },headers=headers,)
                    request.meta['proxy'] = proxies
                    yield request
                else:
                    print("Exists ...")

In this part:

  • We load existing URLs from the CSV to avoid redundant requests.
  • start_requests: This method generates Scrapy requests for each URL and attaches metadata for later use.

Parsing the Response

    def parse(self, response):
        rawdata = response.xpath('.//*[@type="application/ld+json"]/text()').extract_first()

        try:
            email = json.loads(rawdata)['email'].replace('mailto:','')
        except:
            email = ''

        try:
            telephone = json.loads(rawdata)['telephone']
        except:
            telephone = ''

        try:
            streetaddress = json.loads(rawdata)['address']['streetAddress']
        except:
            streetaddress = ''
        
        # Extract additional fields similarly...

        with open(self.filename+'.csv','a') as f:
            writer = csv.writer(f)
            writer.writerow([response.url,response.meta.get('name'),response.meta.get('city'),response.meta.get('type'),
                             email,telephone,streetaddress,addresslocality,response.meta.get('state'),postalcode,
                             image,description,website,
                             str(sundayopens)+' - '+str(sundaycloses),str(mondayopens)+' - '+str(mondaycloses),
                             str(tuesdayopens)+' - '+str(tuesdaycloses),str(wednesdayopens)+' - '+str(wednesdaycloses),
                             str(thursdayopens)+' - '+str(thursdaycloses),str(fridayopens)+' - '+str(fridaycloses),
                             str(saturdayopens)+' - '+str(saturdaycloses)])

Here:

  • parse: This method handles the response, extracting data from the JSON-LD schema embedded in the HTML.
  • The extracted data, including business hours, is written to the CSV file.

Conclusion

Scraping business data with Scrapy and BeautifulSoup can streamline data collection for various purposes. This guide provided a clear example of how to set up a Scrapy spider, configure HTTP headers and proxies, manage requests, and parse responses to extract valuable business information. By following these steps, you can efficiently gather and organize data from websites.

Feel free to adapt the code to fit your specific needs and explore additional features of Scrapy and BeautifulSoup to enhance your web scraping projects.

Leave a Comment

Your email address will not be published. Required fields are marked *

Shopping Cart