Web Scraping University Data from Gotouniversity with Scrapy

Web scraping allows us to extract data from websites in an automated fashion. In this blog post, we’ll explore how to scrape detailed university information from Gotouniversity using Scrapy, a powerful Python framework. The script will gather various details about universities and save them into a CSV file.

Introduction to Web Scraping with Scrapy

Scrapy is a versatile and efficient web scraping framework for Python. It provides a comprehensive set of tools to extract data from web pages and process it as needed. In this guide, we’ll create a Scrapy spider that navigates Gotouniversity, extracts university details, and stores them in a CSV file.

Prerequisites

Before getting started, ensure you have Python and Scrapy installed. You can install Scrapy using pip:

pip install scrapy

You’ll also need the html2text library for converting HTML to plain text:

pip install html2text

The csv module is part of Python’s standard library, so no additional installation is needed.

Script Overview

The script defines a Scrapy spider that extracts university details from Gotouniversity, such as the university name, location, student support, and contact information.

Imports and Initial Setup

First, we import the necessary libraries and configure the HTML to text converter:

import scrapy
import csv
import os
import html2text

h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True

Spider Definition

The spider class, GotouniversitySpider, handles the crawling and data extraction. The start_urls list contains the URL from which the spider will begin crawling.

class GotouniversitySpider(scrapy.Spider):
    name = 'gotouniversity'
    allowed_domains = ['gotouniversity.com']
    start_urls = ['https://www.gotouniversity.com/university']
    
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0"
    }

CSV Initialization

Before the spider starts, we initialize the CSV file with column headers:

    with open("overview.csv","a") as f:
        writer = csv.writer(f)
        writer.writerow(['url', 'university_name', 'location', 'about_university', 'infrastructure_and_services',
                         'top_landmark_companies_airport_nearby', 'student_support', 'visas_immigration_support',
                         'international_student_contact', 'student_life', 'deadlines', 'address', 'email', 'phone'])

Email Decryption

The deCFEmail function decrypts email addresses that are obfuscated using Cloudflare’s email protection:

    def deCFEmail(self,fp):
        try:
            r = int(fp[:2],16)
            email = ''.join([chr(int(fp[i:i+2], 16) ^ r) for i in range(2, len(fp), 2)])
            return email
        except (ValueError):
            pass

Starting the Requests

The start_requests method reads URLs from a CSV file and initiates requests to scrape data from each URL:

    def start_requests(self):
        with open("link.csv", "r") as r:
            reader = csv.reader(r)
            for line in reader:
                yield scrapy.Request(
                    url="https://www.gotouniversity.com" + line[0],
                    callback=self.getdata
                )

Data Extraction

The getdata method extracts various details from each university’s page and writes them to the CSV file:

    def getdata(self, response):
        url = response.url
        university_name = response.xpath('.//*[@class="uni-name"]/text()').extract_first().strip()
        location = response.xpath('.//*[@class="univ-state"]/text()').extract_first().strip()
        
        try:
            about_university = h.handle(response.xpath('.//h2[contains(.,"History")]/.. | .//p[contains(.,"History")]/following-sibling::p/text() | .//h3[contains(.,"History")]/following-sibling::p/text()').extract_first())
        except:
            about_university = ''
        try:
            infrastructure_and_services = h.handle(response.xpath('.//h2[contains(.,"Infrastructure and Services")]/.. | .//p[contains(.,"Infrastructure and Services")]/following-sibling::p/text()').extract_first())
        except:
            infrastructure_and_services = ''
        try:
            top_landmark_companies_airport_nearby = h.handle(''.join(response.xpath('.//h3[contains(.,"Top Landmarks, Companies, and Airports nearby")]/following-sibling::p').extract()))
        except:
            top_landmark_companies_airport_nearby = ''
        try:
            student_support = h.handle(response.xpath('.//h3[contains(.,"Student Support")]/following-sibling::p').extract_first())
        except:
            student_support = ''
        try:
            visas_immigration_support = h.handle(response.xpath('.//h3[contains(.,"Visas and Immigration Support")]/following-sibling::p').extract_first())
        except:
            visas_immigration_support = ''
        try:
            international_student_contact = h.handle(response.xpath('.//h3[contains(.,"International Student Contact")]/following-sibling::p').extract_first())
        except:
            international_student_contact = ''
        try:
            student_life = h.handle(response.xpath('.//h2[contains(.,"Student Life")]/..').extract_first())
        except:
            student_life = ''
        try:
            deadlines = h.handle(''.join(response.xpath('.//h2[contains(.,"Deadlines")]/../following-sibling::p').extract()))
        except:
            deadlines = ''

        address = response.xpath('.//*[@class="contact-info-subblock visit"]/p/text()').extract_first()
        try:
            email = self.deCFEmail(response.xpath('.//*[@class="contact-info-subblock mail"]//@data-cfemail').extract_first())
        except:
            email = ''
        phone = response.xpath('.//*[@class="contact-info-subblock call"]/p/text()').extract_first()

        with open("overview.csv", "a") as f:
            writer = csv.writer(f)
            writer.writerow([a.replace('*','').strip() if a else a for a in [url, university_name, location, about_university,
                                 infrastructure_and_services, top_landmark_companies_airport_nearby, student_support,
                                 visas_immigration_support, international_student_contact, student_life, deadlines,
                                 address, email, phone]])

Running the Spider

To run the spider, save the script as gotouniversity_spider.py and execute the following command in your terminal:

scrapy runspider gotouniversity_spider.py

This will start the scraping process, and the data will be saved into a file named overview.csv.

Conclusion

This tutorial demonstrated how to use Scrapy to scrape detailed university information from Gotouniversity. The script navigates through the site, extracts various details about universities, and stores the data in a CSV file. This automated approach can significantly streamline the process of gathering university data for research or analysis.