Web scraping can be a powerful tool to gather data from websites, and Facebook pages contain a wealth of information that can be useful for various purposes. In this blog, we will walk you through creating a Facebook page detail scraper using Python, Selenium, and Scrapy. This scraper will extract details such as name, location, website, phone, and email from Facebook pages and save them to a CSV file.
Prerequisites
Before we dive into the code, make sure you have the following installed:
- Python: Ensure you have Python installed. You can download it from python.org.
- Selenium: For automating web browser interactions. Install it using
pip
:
pip install selenium
Scrapy: For extracting data from web pages. Install it using pip
:
pip install scrapy
Code Overview
The following script logs into Facebook, navigates to specific pages, and extracts details such as name, location, website, phone, and email. The extracted data is then saved to a CSV file.
import csv
from selenium import webdriver
import scrapy
import os
import time
if __name__ == '__main__':
# File name to save scraped data
filename = "leander_workshop_google"
# Create CSV file and write headers if it doesn't exist
if filename + ".csv" not in os.listdir(os.getcwd()):
with open(filename + ".csv", "a") as f:
writer = csv.writer(f)
writer.writerow(['name', 'location', 'website', 'phone', 'email'])
# Initialize the Selenium WebDriver (using Firefox)
driver = webdriver.Firefox()
# Log into Facebook
driver.get("https://www.facebook.com/login")
driver.find_element_by_xpath('.//*[@id="email"]').send_keys("[email protected]")
driver.find_element_by_xpath('.//*[@id="pass"]').send_keys("your_password")
driver.find_element_by_id("loginbutton").click()
# Open the CSV file to read Facebook page URLs
with open(filename.replace(' google', '') + ".csv", "r") as r:
reader = csv.reader(r)
for line in reader:
try:
# Navigate to the Facebook page
driver.get(line[0])
time.sleep(2.5) # Wait for the page to load
# Extract page source using Scrapy
response = scrapy.Selector(text=driver.page_source)
try:
name = response.xpath('.//title/text()').extract_first().replace(' | Facebook', '')
except:
name = ''
try:
location = ''.join([i.strip() for i in scrapy.Selector(text=response.xpath(
'.//*[@class="l9j0dhe7 dhix69tm wkznzc2l p5pk11vy o9dbymsk j83agx80 kzx2olss aot14ch1 p86d2i9g beltcj47 m8zidbmv ccq6eem2 ellw4o9j kzizifcz g6srhlxm"]').extract_first()).xpath(
'.//text()').extract() if i.strip()])
except:
location = ''
website = response.xpath(
'.//*[@class="oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl py34i1dx gpro0wi8"]/text()').extract_first()
phone = response.xpath(
'.//*[@class="d2edcug0 hpfvmrgz qv66sw1b c1et5uql b0tq1wua jq4qci2q a3bd9o3v knj5qynh oo9gr5id"][contains(.,"+")]/text()').extract_first()
email = response.xpath('.//a/@href[contains(.,"mailto")]/../text()').extract_first()
# Write the extracted data to the CSV file
if name and "Facebook" not in name:
with open(filename + ".csv", "a") as f:
writer = csv.writer(f)
writer.writerow([name, location, website, phone, email])
print([name, location, website, phone, email])
except:
pass
# Close the Selenium WebDriver
driver.close()
Step-by-Step Explanation
- Initialization:
- The script starts by initializing the filename for the CSV file.
- If the CSV file does not exist, it creates one and writes the headers.
- Selenium WebDriver:
- The Selenium WebDriver is initialized using Firefox.
- The script navigates to the Facebook login page and logs in using the provided credentials.
- Reading Facebook Page URLs:
- The script reads a list of Facebook page URLs from an existing CSV file.
- Extracting Page Details:
- For each URL, the script navigates to the page, waits for it to load, and extracts the page source.
- Scrapy is used to parse the page source and extract details such as name, location, website, phone, and email.
- Saving Extracted Data:
- The extracted data is written to the CSV file if the name is valid.
- Closing the WebDriver:
- Finally, the Selenium WebDriver is closed.
Conclusion
This script demonstrates how to automate the extraction of details from Facebook pages using Python, Selenium, and Scrapy. It can be extended and customized to suit different needs, such as extracting additional details or handling more complex page structures. Remember to handle exceptions and errors gracefully, and ensure compliance with Facebook’s terms of service when scraping data.