Web scraping is a powerful tool for automating the extraction of data from websites. In this blog, we’ll dive into a practical example where we combine Selenium and Scrapy to scrape data from the Esango website, which lists non-governmental organizations (NGOs). We’ll go through the code step-by-step, explaining how it works and how you can use it for your own web scraping projects.
Prerequisites
Before we begin, ensure you have the following installed:
- Python
- Selenium
- Scrapy
- Firefox Browser
- Geckodriver (for Firefox)
Setting Up Selenium
First, we use Selenium to navigate through the pages of the Esango website and extract the links to NGO profiles. Here’s the initial setup and code to achieve this:
from selenium import webdriver
import csv
from selenium.webdriver.firefox.service import Service
import scrapy
service = Service(executable_path='/snap/bin/firefox.geckodriver')
driver = webdriver.Firefox(service=service)
driver.get("https://esango.un.org/civilsociety/withOutLogin.do?method=getOrgsByTypesCode&orgTypeCode=6&orgTypName=Non-governmental%20organization&ngoFlag=")
for i in range(0, 13000, 25):
driver.get(f"https://esango.un.org/civilsociety/displayConsultativeStatusSearch.do?method=list&show=25&from=list&col=&order=&searchType=$searchType&index={i}")
links = scrapy.Selector(text=driver.page_source).xpath('.//td/a/@href[contains(.,"showProfileDetail")]').extract()
for link in links:
with open("links.csv", "a") as f:
writer = csv.writer(f)
writer.writerow([link])
print([link])
driver.close()
Explanation:
- Setup Selenium: We configure Selenium to use Firefox via Geckodriver.
- Navigate Pages: We navigate through the Esango website pages that list NGOs, 25 at a time.
- Extract Links: Using Scrapy’s Selector, we extract the links to NGO profile details and save them in a CSV file.
Scraping NGO Details with Scrapy
Next, we use Scrapy to visit each extracted link and scrape detailed information about each NGO. We start by defining a Scrapy spider:
from typing import Iterable
import scrapy
import csv
import os
from scrapy import Request
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0"
}
class EsangoSpider(scrapy.Spider):
name = 'esango'
allowed_domains = ['esango.un.org']
start_urls = ['https://esango.un.org/civilsociety/displayConsultativeStatusSearch.do?method=list&show=25&from=list&col=&order=&searchType=$searchType&index=8575']
if "esango.csv" not in os.listdir(os.getcwd()):
with open("esango.csv", "a") as f:
writer = csv.writer(f)
writer.writerow(['url', 'organization_name', 'full_address', 'address', 'city', 'country', 'phone', 'fax', 'email', 'website', 'area_of_field_expertise', 'mission_statement', 'number_type_member'])
already_scrapped = []
with open("esango.csv", "r") as r:
reader = csv.reader(r)
for line in reader:
already_scrapped.append(line[0])
def start_requests(self) -> Iterable[Request]:
with open("links.csv", "r") as r:
reader = csv.reader(r)
for line in reader:
if "https://esango.un.org/civilsociety/" + line[0] not in self.already_scrapped:
yield scrapy.Request(
url="https://esango.un.org/civilsociety/" + line[0],
callback=self.get_datas,
dont_filter=True,
meta={'link': "https://esango.un.org/civilsociety/" + line[0]}
)
else:
print("Exists...")
def get_datas(self, response):
organization_name = response.xpath('.//td[contains(.,"Organization\'s name:")]/following-sibling::td/text()').extract_first()
addresses = [i.strip() if i else i for i in response.xpath('.//td[contains(.,"Address:")]/following-sibling::td/text()').extract()]
full_address = ', '.join(addresses)
try:
country = addresses[-1]
except:
country = ''
try:
city = addresses[-2]
except:
city = ''
try:
address = addresses[-3]
except:
address = ''
phone = response.xpath('.//td[contains(.,"Phone:")]/following-sibling::td/text()').extract_first()
fax = response.xpath('.//td[contains(.,"Fax:")]/following-sibling::td/text()').extract_first()
email = response.xpath('.//td[contains(.,"Email:")]/following-sibling::td/a/text()').extract_first()
website = response.xpath('.//td[contains(.,"Web site:")]/following-sibling::td/a/text()').extract_first()
yield scrapy.Request(
response.url + '&tab=3',
callback=self.get_data,
meta={'line': [response.meta.get('link'), organization_name, full_address, address, city, country, phone, fax, email, website]}
)
def get_data(self, response):
area_of_field_expertise = '\n'.join([i.strip() if i else i for i in response.xpath('.//td[contains(.,"Areas of expertise & Fields of activity:")]/following-sibling::td/li/text()').extract()])
mission_statement = response.xpath('.//td[contains(.,"Mission statement:")]/following-sibling::td/label/text()').extract_first()
number_type_member = response.xpath('.//td[contains(.,"Number and type of members:")]/following-sibling::td/text()').extract_first()
with open("esango.csv", "a") as f:
writer = csv.writer(f)
writer.writerow(response.meta.get('line') + [area_of_field_expertise, mission_statement, number_type_member])
Explanation:
- Spider Initialization: The spider initializes, checking if the CSV file exists and loading already scrapped links.
- Start Requests: It reads the links from the
links.csv
file and initiates requests to each link. - Extract NGO Details: In the
get_datas
method, we extract various details about the NGO like name, address, phone, email, etc. - Extract Additional Data: The
get_data
method handles additional data extraction like areas of expertise, mission statement, and membership details, and writes everything to a CSV file.
Conclusion
Combining Selenium and Scrapy leverages the strengths of both tools. Selenium handles dynamic content and pagination, while Scrapy excels at parsing and data extraction. This example demonstrates a robust approach to scraping structured data from complex websites. Use this as a foundation for your own web scraping projects, adapting and expanding as needed.
Happy scraping!