Web scraping has become an essential skill for developers looking to extract useful data from websites. Scrapy, an open source web crawling framework written in Python, is a powerful tool for this purpose. In this guide, we’ll walk you through creating an email ID extractor using Scrapy, which will extract email addresses from specified pages of the website.
Why Scrapy?
Scrappy isn’t just about scratching; it’s a robust framework designed for web crawling. It can process queries, parse HTML, and follow links to scrape entire web pages. Scrapy combined with regular comments is especially effective for tasks like email extraction.
Prerequisites
Before you begin, make sure you have Python installed on your system. You should also install Scrapy and Scrapy-Selenium, which are required for dynamic content processing.
pip install scrapy
pip install scrapy-selenium
Step 1: Setting Up the Scrapy Project
Run the following command to create a new Scrapy project.
scrapy startproject geeksemailtrack
Navigate to your project directory:
cd geeksemailtrack
Then, make another spider:
scrapy genspider emails https://www.geeksforgeeks.org/
Step 2: Configure the project
Update the settings.py file and integrate Scrapy-Selenium. Selenium is used here because it can interact with dynamic objects on web pages.
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = []
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
Download the appropriate ChromeDriver for your version of Chrome and paste it next to your scrapy.cfg file.
Step 3: Writing the Spider
Add the necessary libraries to your emails.py spider file:
import scrapy
import re
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
Define your spider class and start making requests to the target location:
class EmailtrackSpider(scrapy.Spider):
name = 'emailtrack'
uniqueemail = set()
def start_requests(self):
yield SeleniumRequest(
url="https://www.geeksforgeeks.org/",
wait_time=3,
screenshot=True,
callback=self.parse,
dont_filter=True
)
Step 4: Parsing the Links
The Parse function removes all links from the main page and filters them to include only relevant pages such as “Contact” or “About” pages where emails are found
def parse(self, response):
links = LxmlLinkExtractor(allow=()).extract_links(response)
Finallinks = [str(link.url) for link in links]
links = [link for link in Finallinks if 'contact' in link.lower() or 'about' in link.lower()]
links.append(str(response.url))
for link in links:
yield SeleniumRequest(
url=link,
wait_time=3,
screenshot=True,
callback=self.parse_link,
dont_filter=True,
meta={'links': links}
)
Step 5: Extracting Emails
The parse_link function extracts email addresses from pages using regular expressions.
def parse_link(self, response):
links = response.meta['links']
bad_words = ['facebook', 'instagram', 'youtube', 'twitter', 'wiki', 'linkedin']
if not any(bad_word in response.url for bad_word in bad_words):
email_list = re.findall(r'\w+@\w+\.\w+', response.text)
self.uniqueemail.update(email_list)
if links:
yield SeleniumRequest(
url=links.pop(0),
callback=self.parse_link,
dont_filter=True,
meta={'links': links}
)
else:
yield SeleniumRequest(
url=response.url,
callback=self.parsed,
dont_filter=True
)
Step 6: Displaying the Results
Finally, the parsed function will remove any unnecessary email addresses and print the correct ones.
def parsed(self, response):
finalemail = [email for email in self.uniqueemail if '.com' in email or '.in' in email]
print("Emails scraped:", finalemail)
Step 7: Running the Spider
Run the spider using the following command on your terminal.
scrapy crawl emailtrack
This will start the scraping process, and your terminal will display emails that have been removed from the specified pages.
Conclusion
This email extractor project is a simple but effective demonstration of Scrapy’s capabilities. Using Python and Scrapy, you can automate the extraction of emails and other data from websites, saving time and effort. This project is a great addition to any developer’s portfolio and can be extended to handle more complex web scraping tasks.