Web scraping allows us to extract data from websites in an automated fashion. In this blog post, we’ll explore how to scrape detailed university information from Gotouniversity using Scrapy, a powerful Python framework. The script will gather various details about universities and save them into a CSV file.
Introduction to Web Scraping with Scrapy
Scrapy is a versatile and efficient web scraping framework for Python. It provides a comprehensive set of tools to extract data from web pages and process it as needed. In this guide, we’ll create a Scrapy spider that navigates Gotouniversity, extracts university details, and stores them in a CSV file.
Prerequisites
Before getting started, ensure you have Python and Scrapy installed. You can install Scrapy using pip:
pip install scrapy
You’ll also need the html2text
library for converting HTML to plain text:
pip install html2text
The csv
module is part of Python’s standard library, so no additional installation is needed.
Script Overview
The script defines a Scrapy spider that extracts university details from Gotouniversity, such as the university name, location, student support, and contact information.
Imports and Initial Setup
First, we import the necessary libraries and configure the HTML to text converter:
import scrapy
import csv
import os
import html2text
h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
Spider Definition
The spider class, GotouniversitySpider
, handles the crawling and data extraction. The start_urls
list contains the URL from which the spider will begin crawling.
class GotouniversitySpider(scrapy.Spider):
name = 'gotouniversity'
allowed_domains = ['gotouniversity.com']
start_urls = ['https://www.gotouniversity.com/university']
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0"
}
CSV Initialization
Before the spider starts, we initialize the CSV file with column headers:
with open("overview.csv","a") as f:
writer = csv.writer(f)
writer.writerow(['url', 'university_name', 'location', 'about_university', 'infrastructure_and_services',
'top_landmark_companies_airport_nearby', 'student_support', 'visas_immigration_support',
'international_student_contact', 'student_life', 'deadlines', 'address', 'email', 'phone'])
Email Decryption
The deCFEmail
function decrypts email addresses that are obfuscated using Cloudflare’s email protection:
def deCFEmail(self,fp):
try:
r = int(fp[:2],16)
email = ''.join([chr(int(fp[i:i+2], 16) ^ r) for i in range(2, len(fp), 2)])
return email
except (ValueError):
pass
Starting the Requests
The start_requests
method reads URLs from a CSV file and initiates requests to scrape data from each URL:
def start_requests(self):
with open("link.csv", "r") as r:
reader = csv.reader(r)
for line in reader:
yield scrapy.Request(
url="https://www.gotouniversity.com" + line[0],
callback=self.getdata
)
Data Extraction
The getdata
method extracts various details from each university’s page and writes them to the CSV file:
def getdata(self, response):
url = response.url
university_name = response.xpath('.//*[@class="uni-name"]/text()').extract_first().strip()
location = response.xpath('.//*[@class="univ-state"]/text()').extract_first().strip()
try:
about_university = h.handle(response.xpath('.//h2[contains(.,"History")]/.. | .//p[contains(.,"History")]/following-sibling::p/text() | .//h3[contains(.,"History")]/following-sibling::p/text()').extract_first())
except:
about_university = ''
try:
infrastructure_and_services = h.handle(response.xpath('.//h2[contains(.,"Infrastructure and Services")]/.. | .//p[contains(.,"Infrastructure and Services")]/following-sibling::p/text()').extract_first())
except:
infrastructure_and_services = ''
try:
top_landmark_companies_airport_nearby = h.handle(''.join(response.xpath('.//h3[contains(.,"Top Landmarks, Companies, and Airports nearby")]/following-sibling::p').extract()))
except:
top_landmark_companies_airport_nearby = ''
try:
student_support = h.handle(response.xpath('.//h3[contains(.,"Student Support")]/following-sibling::p').extract_first())
except:
student_support = ''
try:
visas_immigration_support = h.handle(response.xpath('.//h3[contains(.,"Visas and Immigration Support")]/following-sibling::p').extract_first())
except:
visas_immigration_support = ''
try:
international_student_contact = h.handle(response.xpath('.//h3[contains(.,"International Student Contact")]/following-sibling::p').extract_first())
except:
international_student_contact = ''
try:
student_life = h.handle(response.xpath('.//h2[contains(.,"Student Life")]/..').extract_first())
except:
student_life = ''
try:
deadlines = h.handle(''.join(response.xpath('.//h2[contains(.,"Deadlines")]/../following-sibling::p').extract()))
except:
deadlines = ''
address = response.xpath('.//*[@class="contact-info-subblock visit"]/p/text()').extract_first()
try:
email = self.deCFEmail(response.xpath('.//*[@class="contact-info-subblock mail"]//@data-cfemail').extract_first())
except:
email = ''
phone = response.xpath('.//*[@class="contact-info-subblock call"]/p/text()').extract_first()
with open("overview.csv", "a") as f:
writer = csv.writer(f)
writer.writerow([a.replace('*','').strip() if a else a for a in [url, university_name, location, about_university,
infrastructure_and_services, top_landmark_companies_airport_nearby, student_support,
visas_immigration_support, international_student_contact, student_life, deadlines,
address, email, phone]])
Running the Spider
To run the spider, save the script as gotouniversity_spider.py
and execute the following command in your terminal:
scrapy runspider gotouniversity_spider.py
This will start the scraping process, and the data will be saved into a file named overview.csv
.
Conclusion
This tutorial demonstrated how to use Scrapy to scrape detailed university information from Gotouniversity. The script navigates through the site, extracts various details about universities, and stores the data in a CSV file. This automated approach can significantly streamline the process of gathering university data for research or analysis.