Efficient Web Scraping with Python: LoopNet Scraper Script

Web scraping is a powerful tool for gathering data from the web, and when it comes to real estate listings, it can be particularly useful for extracting information from sites like LoopNet. This blog post will walk you through a Python script designed to scrape apartment building listings from LoopNet, leveraging several libraries to handle HTTP requests, HTML parsing, and web automation.

Introduction to Web Scraping

Web scraping involves programmatically extracting data from websites. It’s a valuable skill for data analysts, researchers, and developers who need to collect large amounts of data without manual intervention. In this tutorial, we’ll explore a Python script that scrapes real estate listings from LoopNet, a prominent commercial real estate listing site.

Prerequisites

Before diving into the code, make sure you have Python installed on your system. The script will handle the installation of necessary libraries automatically, but it’s helpful to have a basic understanding of the following Python libraries:

  • requests for making HTTP requests
  • parsel for parsing HTML content
  • selenium for automating web browsing
  • undetected_chromedriver for bypassing bot detection
  • fake_headers for generating realistic HTTP headers

Script Overview

The script is designed to work on both Windows and Unix-based systems. It begins by checking the operating system and installing the necessary libraries if they are not already installed.

import os
import platform

if platform.system() == "Windows":
    try:
        import requests
        from parsel import Selector
        from selenium import webdriver
        import undetected_chromedriver as uc
        import fake_headers
    except ImportError:
        os.system('python -m pip install requests')
        os.system('python -m pip install parsel')
        os.system('python -m pip install selenium')
        os.system('python -m pip install undetected-chromedriver')
        os.system('python -m pip install fake_headers')

else:
    try:
        import requests
        from parsel import Selector
        from selenium import webdriver
        import undetected_chromedriver as uc
        import fake_headers
    except ImportError:
        os.system('python3 -m pip install requests')
        os.system('python3 -m pip install parsel')
        os.system('python3 -m pip install selenium')
        os.system('python3 -m pip install undetected-chromedriver')
        os.system('python3 -m pip install fake_headers')

Detailed Script Breakdown

Imports and Initial Setup

After importing the necessary libraries, we initialize the web driver. The script is designed to work with Firefox but can be adapted for other browsers if needed.

from selenium import webdriver
import csv
import os
import time
from parsel import Selector
import requests
from fake_headers import Headers
import datetime
import platform
import json

Function to Start Scraping

The getstart function is responsible for navigating to the specified URL, extracting the data, and saving it to a CSV file.

def getstart(url, county, page):
    driver.get(url + '/' + str(page))
    time.sleep(5)
    try:
        jsondata = json.loads(Selector(text=driver.page_source).xpath('.//*[@type="application/ld+json"]/text()[contains(.,"SearchResultsPage")]').extract_first())
        links = [i['item']['url'] for i in jsondata['about']]
    except:
        links = []
    
    if not links:
        try:
            jsondata = json.loads(Selector(text=driver.page_source).xpath('.//*[@type="application/ld+json"]/text()[contains(.,"SearchResultsPage")]').extract_first())
            links = [jsondata['about']['url']]
        except:
            links = []

    print(len(links))
    for link in links:
        if link not in alreadyscrapped:
            alreadyscrapped.append(link)
            date_time = datetime.datetime.now().strftime('%y-%m-%d')
            response = Selector(text=requests.get(link, headers=Headers().generate()).text)
            time.sleep(2)
            # Extract details
            try:
                title = ', '.join(Selector(text=response.xpath('.//*[@class="profile-hero-heading"]/h1').extract_first()).xpath('.//span/text()').extract())
            except:
                title = ''
            try:
                location = ' '.join(response.xpath('.//*[@id="breadcrumb-section"]/h1/text()').extract())
            except:
                location = ''
            price = response.xpath('.//td[contains(.,"Price")]/following-sibling::td/span/text()').extract_first()
            apartment_style = response.xpath('.//td[contains(.,"Apartment Style")]/following-sibling::td/span/text()').extract_first()
            price_per_unit = response.xpath('.//td[contains(.,"Price Per Unit")]/following-sibling::td/span/text()').extract_first()
            building_class = response.xpath('.//td[contains(.,"Building Class")]/following-sibling::td/span/text()').extract_first()
            sale_type = response.xpath('.//td[contains(.,"Sale Type")]/following-sibling::td/span/text()').extract_first()
            lot_size = response.xpath('.//td[contains(.,"Lot Size")]/following-sibling::td/span/text()').extract_first()
            cap_rate = response.xpath('.//td[contains(.,"Cap Rate")]/following-sibling::td/span/text()').extract_first()
            building_size = response.xpath('.//td[contains(.,"Building Size")]/following-sibling::td/span/text()').extract_first()
            sale_conditions = response.xpath('.//td[contains(.,"Sale Conditions")]/following-sibling::td/span/text()').extract_first()
            average_occupancy = response.xpath('.//td[contains(.,"Average Occupancy")]/following-sibling::td/span/text()').extract_first()
            no_units = response.xpath('.//td[contains(.,"No. Units")]/following-sibling::td/span/text()').extract_first()
            no_stores = response.xpath('.//td[contains(.,"No. Stories")]/following-sibling::td/span/text()').extract_first()
            property_type = response.xpath('.//td[contains(.,"Property Type")]/following-sibling::td/span/text()').extract_first()
            year_build = response.xpath('.//td[contains(.,"Year Built/Renovated")]/following-sibling::td/span/text()').extract_first()
            property_subtype = response.xpath('.//td[contains(.,"Property Subtype")]/following-sibling::td/span/text()').extract_first()

            # Save data to CSV
            with open("loopnet.csv", "a", newline="", encoding="utf-8") as f:
                writer = csv.writer(f)
                writer.writerow([date_time, county, link, title, location, price, apartment_style, price_per_unit, building_class, sale_type, lot_size, cap_rate, building_size, sale_conditions,
                                 average_occupancy, no_units, no_stores, property_type, year_build, property_subtype])
                print([date_time, county, link, title, location, price, apartment_style, price_per_unit, building_class, sale_type, lot_size, cap_rate, building_size, sale_conditions,
                       average_occupancy, no_units, no_stores, property_type, year_build, property_subtype])
        else:
            print("Exists ...")

    if links:
        getstart(url, county, page + 1)

Main Execution Block

The script initializes the CSV file if it doesn’t exist and starts the web driver. It then iterates through a list of URLs, calling the getstart function for each one.

if __name__ == '__main__':
    if "loopnet.csv" not in os.listdir(os.getcwd()):
        with open("loopnet.csv", "a", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(['date', 'county', 'link', 'title', 'location', 'price', 'apartment_style', 'price_per_unit', 'building_class', 'sale_type', 'lot_size', 'cap_rate', 'building_size',
                             'sale_conditions', 'average_occupancy', 'no_units', 'no_stores', 'property_type', 'year_build', 'property_subtype'])

    alreadyscrapped = []
    with open("loopnet.csv", "r") as r:
        reader = csv.reader(r)
        for line in reader:
            alreadyscrapped.append(line[2])

    if platform.system() == "Linux":
        driver = webdriver.Firefox()
    else:
        driver = webdriver.Firefox(executable_path=os.getcwd() + '/geckodriver')

    urls = {
        'https://www.loopnet.com/search/apartment-buildings/los-angeles-county-ca/for-sale/': 'Los Angeles',
        'https://www.loopnet.com/search/apartment-buildings/fresno-county-ca/for-sale/': 'Fresno',
        'https://www.loopnet.com/search/apartment-buildings/kings-county-ca/for-sale': 'Kings',
        'https://www.loopnet.com/search/apartment-buildings/tulare-county-ca/for-sale': 'Tulare',
        'https://www.loopnet.com/search/apartment-buildings/madera-county-ca/for-sale/': 'Madera',
        'https://www.loopnet.com/search/apartment-buildings/monterey-county-ca/for-sale': 'Monterey',
        'https://www.loopnet.com/search/apartment-buildings/san-benito-county-ca/for-sale': 'San-Benito',
        'https://www.loopnet.com/search/apartment-buildings/kern-county-ca/for-sale': 'Kern',
        'https://www.loopnet.com/search/apartment-buildings/merced-county-ca/for-sale': 'Merced',
        'https://www.loopnet.com/search/apartment-buildings/sutter-county-ca/for-sale': 'Sutter',
        'https://www.loopnet.com/search/apartment-buildings/sacramento-county-ca/for-sale': 'Sacramento',
        'https://www.loopnet.com/search/apartment-buildings/el-dorado-county-ca/for-sale': 'El Dorado',
        'https://www.loopnet.com/search/apartment-buildings/amador-county-ca/for-sale': 'Amador',
        'https://www.loopnet.com/search/apartment-buildings/san-joaquin-county-ca/for-sale': 'San-Joaquin',
        'https://www.loopnet.com/search/apartment-buildings/solano-county-ca/for-sale': 'Solano',
        'https://www.loopnet.com/search/apartment-buildings/contra-costa-county-ca/for-sale': 'Contra-Costa',
        'https://www.loopnet.com/search/apartment-buildings/yolo-county-ca/for-sale': 'Yolo',
        'https://www.loopnet.com/search/apartment-buildings/placer-county-ca/for-sale': 'Placer',
        'https://www.loopnet.com/search/apartment-buildings/san-diego-county-ca/for-sale': 'San-Diego',
        'https://www.loopnet.com/search/apartment-buildings/orange-county-ca/for-sale': 'Orange',
        'https://www.loopnet.com/search/apartment-buildings/riverside-county-ca/for-sale': 'Riverside',
        'https://www.loopnet.com/search/apartment-buildings/imperial-county-ca/for-sale': 'Imperial'
    }

    for k, v in urls.items():
        getstart(k, v, 1)

    driver.close()

Running the Script

  1. Install Required Libraries: The script includes commands to install necessary libraries. If they are not already installed, the script will attempt to install them automatically.
  2. Set Up Web Driver: Make sure the appropriate web driver (e.g., geckodriver for Firefox) is set up and its path is correctly specified.
  3. Execute the Script: Run the script in your Python environment. It will navigate through the specified URLs, extract the required data, and save it to a CSV file named loopnet.csv.

Conclusion

This script provides a comprehensive solution for scraping real estate listings from LoopNet. By leveraging Python’s powerful libraries, it automates the process of collecting and storing data, saving you time and effort. Whether you’re gathering data for analysis, research, or business purposes, this script offers a solid foundation for your web scraping needs.

Feel free to customize the script to suit your specific requirements, such as scraping different types of properties or targeting different regions. Happy scraping!

Leave a Comment

Your email address will not be published. Required fields are marked *

Shopping Cart