Building a Zillow Scraper with Python: A Step-by-Step Guide

Web scraping is a powerful technique used to extract information from websites. In this blog post, we will walk through a Python script designed to scrape real estate listings from Zillow. This script is designed to run on both Windows and Unix-like systems (Linux, MacOS), handling the installation of necessary packages and managing sessions efficiently.

Overview of the Script

The script performs the following main tasks:

Package Installation: Installs necessary Python packages if they are not already installed.
Session Setup: Configures a session to mimic a web browser, ensuring successful communication with Zillow.
Data Scraping: Extracts real estate listing information and saves it to a CSV file.
Pagination Handling: Iterates through multiple pages of listings to gather comprehensive data.

Detailed Breakdown

1. Package Installation

To ensure that the required packages are installed, the script checks the platform it is running on and installs the packages accordingly. This ensures compatibility across different operating systems.

s = requests.Session()

s.headers.update(
    {
        "Host": "www.zillow.com",
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0",
        "Accept": "*/*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
    }
)

3. Data Scraping

The startscraping function handles the scraping process. It constructs the URL, sends the request, and processes the response to extract relevant information.

def startscraping(page, url, reqid):
    jdata = json.loads(urlparse(urllib.parse.unquote(url)).query.split('=')[-1])
    jdata['pagination'] = {"currentPage": page}
    
    r = s.get(
        "https://www.zillow.com/search/GetSearchPageState.htm",
        params={
            "searchQueryState": str(jdata),
            "wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
            "requestId": str(reqid),
        },
        headers={
            "Alt-Used": "www.zillow.com",
        },
    )

    jsondata = r.json()['cat1']['searchResults']['listResults']
    total_pages = r.json()['cat1']['searchList']['totalPages']

    if total_pages < page:
        jsondata = []

    for data in jsondata:
        try:
            id = data['hdpData']['homeInfo']['zpid']
        except:
            id = ''
        # ... [Additional data extraction logic] ...

        if id not in alreayscrapped:
            alreayscrapped.append(id)
            with open("zillow.csv", "a") as f:
                writer = csv.writer(f)
                writer.writerow([id, url, status, price, full_address, street, city, state, zipcode, beds, baths, area, days_on_zillow, broker_name, agent_name, sale_by, taxed_access_value, home_type, home_status, latitude, longitude])
                print([id, url, status, price, full_address, street, city, state, zipcode, beds, baths, area, days_on_zillow, broker_name, agent_name, sale_by, taxed_access_value, home_type, home_status, latitude, longitude])
            time.sleep(1)
        else:
            print("Exists...")

    if jsondata:
        time.sleep(10)
        startscraping(page + 1, url, reqid)

4. Running the Scraper

The script starts scraping from the initial page of a given Zillow URL. It initializes the alreayscrapped list and reqid, and calls the startscraping function.

if __name__ == '__main__':
    url = "https://www.zillow.com/new-york-ny/rentals/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22north%22%3A41.10215308856136%2C%22east%22%3A-73.1763057558594%2C%22south%22%3A40.291081355840575%2C%22west%22%3A-74.78305624414065%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22usersSearchTerm%22%3A%22New%20York%20NY%22%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A6181%2C%22regionType%22%3A6%7D%5D%7D"
    alreayscrapped = []
    reqid = 1
    startscraping(1, url, reqid)

Conclusion

This script provides a robust solution for scraping real estate listings from Zillow. By handling pagination and data extraction efficiently, it collects comprehensive data and saves it to a CSV file for further analysis. This guide should help you understand the key components of the script and how to adapt it for your own web scraping projects. Happy scraping!