Automating LEAFLY Dispensary Data Collection with Python

Introduction

In the ever-expanding world of data, automation is key to efficiently gather and process large volumes of information. This blog post delves into a Python script that automates the collection of dispensary data from the Leafly website, using libraries such as requests, parsel, csv, and json. We’ll explore the step-by-step process, ensuring that whether you’re on Windows or another operating system, you can effortlessly set up and run this script.

The Script Breakdown

1. Environment Setup and Dependency Management

The script begins by checking the operating system and installing the parsel library if it’s not already installed. This ensures that the script can run on both Windows and non-Windows systems without manual intervention.

import platform
import os

if platform.system() == "Windows":
    try:
        from parsel import Selector
    except ImportError:
        os.system('python -m pip install parsel')
else:
    try:
        from parsel import Selector
    except ImportError:
        os.system('python3 -m pip install parsel')

2. Importing Necessary Libraries

Next, the script imports essential libraries required for web scraping, data handling, and HTTP requests.

import json
import csv
import requests
from parsel import Selector
import os

3. Defining Request Headers

The script sets up custom headers to mimic a legitimate browser request, which helps in avoiding potential blocks from the server while scraping.

headers = {
    "Request Headers (1.739 KB)": {
        "headers": [
            {"name": "Accept", "value": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"},
            {"name": "Accept-Encoding", "value": "gzip, deflate, br"},
            {"name": "Accept-Language", "value": "en-US,en;q=0.5"},
            {"name": "Cache-Control", "value": "no-cache"},
            {"name": "Connection", "value": "keep-alive"},
            {"name": "Cookie", "value": "__cfduid=..."},
            {"name": "Host", "value": "www.leafly.com"},
            {"name": "Pragma", "value": "no-cache"},
            {"name": "TE", "value": "Trailers"},
            {"name": "Upgrade-Insecure-Requests", "value": "1"},
            {"name": "User-Agent", "value": "Mozilla/5.0 ..."}
        ]
    }
}

base_headers = {}
for key in headers.keys():
    for data in headers[key]['headers']:
        base_headers[data['name']] = data['value']

4. Initializing the CSV File

The script checks if a CSV file named leafly_finder.csv exists in the current working directory. If not, it creates one and writes the header row.

if __name__ == "__main__":
    if 'leafly_finder.csv' not in os.listdir(os.getcwd()):
        with open("leafly_finder.csv", "a", encoding="utf-8", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(['name', 'address', 'city', 'location', 'zip', 'phone', 'flags', 'email', 'image', 'link'])

5. Fetching and Parsing Data

The script defines a URL to fetch dispensary data and retrieves the total number of pages available. It then iterates through each page, extracting relevant information about each dispensary and writing it to the CSV file.

    url = "https://web-finder.leafly.com/api/search-this-area?topLeftLat=54.00305492458148&topLeftLon=-145.25113830000004&bottomRightLat=25.07833452918743&bottomRightLon=-43.29801330000004&userLat=40.8364&userLon=-74.1403&retailType=dispensary&page=1&limit=10"
    pages = json.loads(requests.get(url).text)['dispensaries']['pageCount']

The URL contains parameters defining the search area and limits for the API request.

6. Parsing and Storing Data

For each page, the script sends a request to the Leafly API, parses the JSON response, and extracts relevant data fields. It then writes the data to the CSV file.

    for page in range(1, pages + 1):
        jsondata = json.loads(requests.get(f"https://web-finder.leafly.com/api/search-this-area?topLeftLat=54.00305492458148&topLeftLon=-145.25113830000004&bottomRightLat=25.07833452918743&bottomRightLon=-43.29801330000004&userLat=40.8364&userLon=-74.1403&retailType=dispensary&page={page}&limit=10").text)

        for data in jsondata['dispensaries']['stores']:
            name = data['name']
            address = data['address1']
            city = data['city']
            location = data['formattedShortLocation']
            zip = data['zip']
            phone = data['phone']
            flags = ', '.join(data['flags'])
            image = data['coverImage']
            link = f'https://www.leafly.com/cbd-store/{data["slug"]}'
            print(name)

            try:
                email = Selector(text=requests.get(link).text).xpath('.//a/@href[contains(.,"mailto:")]').extract_first().replace('mailto:', '')
            except:
                email = ''

            with open("leafly_finder.csv", "a", encoding="utf-8", newline="") as f:
                writer = csv.writer(f)
                writer.writerow([name, address, city, location, zip, phone, flags, email, image, link])
                print([name, address, city, location, zip, phone, flags, email, image, link])

This loop iterates through each page, sending requests and processing the data into a structured format suitable for analysis or reporting.

Conclusion

This Python script is a robust solution for automating the extraction and storage of dispensary data from the Leafly website. By setting up appropriate headers, handling JSON responses, and writing to a CSV file, this script ensures that the data collection process is efficient and reliable. Whether you’re using this data for analysis, market research, or integration into a larger project, this script provides a solid foundation for your web scraping needs.