Web scraping is a powerful tool for gathering data from the web, and when it comes to real estate listings, it can be particularly useful for extracting information from sites like LoopNet. This blog post will walk you through a Python script designed to scrape apartment building listings from LoopNet, leveraging several libraries to handle HTTP requests, HTML parsing, and web automation.
Introduction to Web Scraping
Web scraping involves programmatically extracting data from websites. It’s a valuable skill for data analysts, researchers, and developers who need to collect large amounts of data without manual intervention. In this tutorial, we’ll explore a Python script that scrapes real estate listings from LoopNet, a prominent commercial real estate listing site.
Prerequisites
Before diving into the code, make sure you have Python installed on your system. The script will handle the installation of necessary libraries automatically, but it’s helpful to have a basic understanding of the following Python libraries:
requests
for making HTTP requestsparsel
for parsing HTML contentselenium
for automating web browsingundetected_chromedriver
for bypassing bot detectionfake_headers
for generating realistic HTTP headers
Script Overview
The script is designed to work on both Windows and Unix-based systems. It begins by checking the operating system and installing the necessary libraries if they are not already installed.
import os
import platform
if platform.system() == "Windows":
try:
import requests
from parsel import Selector
from selenium import webdriver
import undetected_chromedriver as uc
import fake_headers
except ImportError:
os.system('python -m pip install requests')
os.system('python -m pip install parsel')
os.system('python -m pip install selenium')
os.system('python -m pip install undetected-chromedriver')
os.system('python -m pip install fake_headers')
else:
try:
import requests
from parsel import Selector
from selenium import webdriver
import undetected_chromedriver as uc
import fake_headers
except ImportError:
os.system('python3 -m pip install requests')
os.system('python3 -m pip install parsel')
os.system('python3 -m pip install selenium')
os.system('python3 -m pip install undetected-chromedriver')
os.system('python3 -m pip install fake_headers')
Detailed Script Breakdown
Imports and Initial Setup
After importing the necessary libraries, we initialize the web driver. The script is designed to work with Firefox but can be adapted for other browsers if needed.
from selenium import webdriver
import csv
import os
import time
from parsel import Selector
import requests
from fake_headers import Headers
import datetime
import platform
import json
Function to Start Scraping
The getstart
function is responsible for navigating to the specified URL, extracting the data, and saving it to a CSV file.
def getstart(url, county, page):
driver.get(url + '/' + str(page))
time.sleep(5)
try:
jsondata = json.loads(Selector(text=driver.page_source).xpath('.//*[@type="application/ld+json"]/text()[contains(.,"SearchResultsPage")]').extract_first())
links = [i['item']['url'] for i in jsondata['about']]
except:
links = []
if not links:
try:
jsondata = json.loads(Selector(text=driver.page_source).xpath('.//*[@type="application/ld+json"]/text()[contains(.,"SearchResultsPage")]').extract_first())
links = [jsondata['about']['url']]
except:
links = []
print(len(links))
for link in links:
if link not in alreadyscrapped:
alreadyscrapped.append(link)
date_time = datetime.datetime.now().strftime('%y-%m-%d')
response = Selector(text=requests.get(link, headers=Headers().generate()).text)
time.sleep(2)
# Extract details
try:
title = ', '.join(Selector(text=response.xpath('.//*[@class="profile-hero-heading"]/h1').extract_first()).xpath('.//span/text()').extract())
except:
title = ''
try:
location = ' '.join(response.xpath('.//*[@id="breadcrumb-section"]/h1/text()').extract())
except:
location = ''
price = response.xpath('.//td[contains(.,"Price")]/following-sibling::td/span/text()').extract_first()
apartment_style = response.xpath('.//td[contains(.,"Apartment Style")]/following-sibling::td/span/text()').extract_first()
price_per_unit = response.xpath('.//td[contains(.,"Price Per Unit")]/following-sibling::td/span/text()').extract_first()
building_class = response.xpath('.//td[contains(.,"Building Class")]/following-sibling::td/span/text()').extract_first()
sale_type = response.xpath('.//td[contains(.,"Sale Type")]/following-sibling::td/span/text()').extract_first()
lot_size = response.xpath('.//td[contains(.,"Lot Size")]/following-sibling::td/span/text()').extract_first()
cap_rate = response.xpath('.//td[contains(.,"Cap Rate")]/following-sibling::td/span/text()').extract_first()
building_size = response.xpath('.//td[contains(.,"Building Size")]/following-sibling::td/span/text()').extract_first()
sale_conditions = response.xpath('.//td[contains(.,"Sale Conditions")]/following-sibling::td/span/text()').extract_first()
average_occupancy = response.xpath('.//td[contains(.,"Average Occupancy")]/following-sibling::td/span/text()').extract_first()
no_units = response.xpath('.//td[contains(.,"No. Units")]/following-sibling::td/span/text()').extract_first()
no_stores = response.xpath('.//td[contains(.,"No. Stories")]/following-sibling::td/span/text()').extract_first()
property_type = response.xpath('.//td[contains(.,"Property Type")]/following-sibling::td/span/text()').extract_first()
year_build = response.xpath('.//td[contains(.,"Year Built/Renovated")]/following-sibling::td/span/text()').extract_first()
property_subtype = response.xpath('.//td[contains(.,"Property Subtype")]/following-sibling::td/span/text()').extract_first()
# Save data to CSV
with open("loopnet.csv", "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow([date_time, county, link, title, location, price, apartment_style, price_per_unit, building_class, sale_type, lot_size, cap_rate, building_size, sale_conditions,
average_occupancy, no_units, no_stores, property_type, year_build, property_subtype])
print([date_time, county, link, title, location, price, apartment_style, price_per_unit, building_class, sale_type, lot_size, cap_rate, building_size, sale_conditions,
average_occupancy, no_units, no_stores, property_type, year_build, property_subtype])
else:
print("Exists ...")
if links:
getstart(url, county, page + 1)
Main Execution Block
The script initializes the CSV file if it doesn’t exist and starts the web driver. It then iterates through a list of URLs, calling the getstart
function for each one.
if __name__ == '__main__':
if "loopnet.csv" not in os.listdir(os.getcwd()):
with open("loopnet.csv", "a", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['date', 'county', 'link', 'title', 'location', 'price', 'apartment_style', 'price_per_unit', 'building_class', 'sale_type', 'lot_size', 'cap_rate', 'building_size',
'sale_conditions', 'average_occupancy', 'no_units', 'no_stores', 'property_type', 'year_build', 'property_subtype'])
alreadyscrapped = []
with open("loopnet.csv", "r") as r:
reader = csv.reader(r)
for line in reader:
alreadyscrapped.append(line[2])
if platform.system() == "Linux":
driver = webdriver.Firefox()
else:
driver = webdriver.Firefox(executable_path=os.getcwd() + '/geckodriver')
urls = {
'https://www.loopnet.com/search/apartment-buildings/los-angeles-county-ca/for-sale/': 'Los Angeles',
'https://www.loopnet.com/search/apartment-buildings/fresno-county-ca/for-sale/': 'Fresno',
'https://www.loopnet.com/search/apartment-buildings/kings-county-ca/for-sale': 'Kings',
'https://www.loopnet.com/search/apartment-buildings/tulare-county-ca/for-sale': 'Tulare',
'https://www.loopnet.com/search/apartment-buildings/madera-county-ca/for-sale/': 'Madera',
'https://www.loopnet.com/search/apartment-buildings/monterey-county-ca/for-sale': 'Monterey',
'https://www.loopnet.com/search/apartment-buildings/san-benito-county-ca/for-sale': 'San-Benito',
'https://www.loopnet.com/search/apartment-buildings/kern-county-ca/for-sale': 'Kern',
'https://www.loopnet.com/search/apartment-buildings/merced-county-ca/for-sale': 'Merced',
'https://www.loopnet.com/search/apartment-buildings/sutter-county-ca/for-sale': 'Sutter',
'https://www.loopnet.com/search/apartment-buildings/sacramento-county-ca/for-sale': 'Sacramento',
'https://www.loopnet.com/search/apartment-buildings/el-dorado-county-ca/for-sale': 'El Dorado',
'https://www.loopnet.com/search/apartment-buildings/amador-county-ca/for-sale': 'Amador',
'https://www.loopnet.com/search/apartment-buildings/san-joaquin-county-ca/for-sale': 'San-Joaquin',
'https://www.loopnet.com/search/apartment-buildings/solano-county-ca/for-sale': 'Solano',
'https://www.loopnet.com/search/apartment-buildings/contra-costa-county-ca/for-sale': 'Contra-Costa',
'https://www.loopnet.com/search/apartment-buildings/yolo-county-ca/for-sale': 'Yolo',
'https://www.loopnet.com/search/apartment-buildings/placer-county-ca/for-sale': 'Placer',
'https://www.loopnet.com/search/apartment-buildings/san-diego-county-ca/for-sale': 'San-Diego',
'https://www.loopnet.com/search/apartment-buildings/orange-county-ca/for-sale': 'Orange',
'https://www.loopnet.com/search/apartment-buildings/riverside-county-ca/for-sale': 'Riverside',
'https://www.loopnet.com/search/apartment-buildings/imperial-county-ca/for-sale': 'Imperial'
}
for k, v in urls.items():
getstart(k, v, 1)
driver.close()
Running the Script
- Install Required Libraries: The script includes commands to install necessary libraries. If they are not already installed, the script will attempt to install them automatically.
- Set Up Web Driver: Make sure the appropriate web driver (e.g., geckodriver for Firefox) is set up and its path is correctly specified.
- Execute the Script: Run the script in your Python environment. It will navigate through the specified URLs, extract the required data, and save it to a CSV file named
loopnet.csv
.
Conclusion
This script provides a comprehensive solution for scraping real estate listings from LoopNet. By leveraging Python’s powerful libraries, it automates the process of collecting and storing data, saving you time and effort. Whether you’re gathering data for analysis, research, or business purposes, this script offers a solid foundation for your web scraping needs.
Feel free to customize the script to suit your specific requirements, such as scraping different types of properties or targeting different regions. Happy scraping!