How to Build an Email ID Extractor

Web scraping has become an essential skill for developers looking to extract useful data from websites. Scrapy, an open source web crawling framework written in Python, is a powerful tool for this purpose. In this guide, we’ll walk you through creating an email ID extractor using Scrapy, which will extract email addresses from specified pages of the website.

Why Scrapy?

Scrappy isn’t just about scratching; it’s a robust framework designed for web crawling. It can process queries, parse HTML, and follow links to scrape entire web pages. Scrapy combined with regular comments is especially effective for tasks like email extraction.

Prerequisites

Before you begin, make sure you have Python installed on your system. You should also install Scrapy and Scrapy-Selenium, which are required for dynamic content processing.

Python
pip install scrapy
pip install scrapy-selenium

Step 1: Setting Up the Scrapy Project

Run the following command to create a new Scrapy project.

Python
scrapy startproject geeksemailtrack

Navigate to your project directory:

Python
cd geeksemailtrack

Then, make another spider:

Python
scrapy genspider emails https://www.geeksforgeeks.org/

Step 2: Configure the project

Update the settings.py file and integrate Scrapy-Selenium. Selenium is used here because it can interact with dynamic objects on web pages.

Python
from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = []
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

Download the appropriate ChromeDriver for your version of Chrome and paste it next to your scrapy.cfg file.

Step 3: Writing the Spider

Add the necessary libraries to your emails.py spider file:

Python
import scrapy
import re
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

Define your spider class and start making requests to the target location:

Python
class EmailtrackSpider(scrapy.Spider):
    name = 'emailtrack'
    uniqueemail = set()

    def start_requests(self):
        yield SeleniumRequest(
            url="https://www.geeksforgeeks.org/",
            wait_time=3,
            screenshot=True,
            callback=self.parse,
            dont_filter=True
        )

Step 4: Parsing the Links

The Parse function removes all links from the main page and filters them to include only relevant pages such as “Contact” or “About” pages where emails are found

Python
def parse(self, response):
    links = LxmlLinkExtractor(allow=()).extract_links(response)
    Finallinks = [str(link.url) for link in links]
    links = [link for link in Finallinks if 'contact' in link.lower() or 'about' in link.lower()]
    links.append(str(response.url))
    
    for link in links:
        yield SeleniumRequest(
            url=link,
            wait_time=3,
            screenshot=True,
            callback=self.parse_link,
            dont_filter=True,
            meta={'links': links}
        )

Step 5: Extracting Emails

The parse_link function extracts email addresses from pages using regular expressions.

Python
def parse_link(self, response):
    links = response.meta['links']
    bad_words = ['facebook', 'instagram', 'youtube', 'twitter', 'wiki', 'linkedin']
    
    if not any(bad_word in response.url for bad_word in bad_words):
        email_list = re.findall(r'\w+@\w+\.\w+', response.text)
        self.uniqueemail.update(email_list)
    
    if links:
        yield SeleniumRequest(
            url=links.pop(0),
            callback=self.parse_link,
            dont_filter=True,
            meta={'links': links}
        )
    else:
        yield SeleniumRequest(
            url=response.url,
            callback=self.parsed,
            dont_filter=True
        )

Step 6: Displaying the Results

Finally, the parsed function will remove any unnecessary email addresses and print the correct ones.

Python
def parsed(self, response):
    finalemail = [email for email in self.uniqueemail if '.com' in email or '.in' in email]
    print("Emails scraped:", finalemail)

Step 7: Running the Spider

Run the spider using the following command on your terminal.

Python
scrapy crawl emailtrack

This will start the scraping process, and your terminal will display emails that have been removed from the specified pages.

Conclusion

This email extractor project is a simple but effective demonstration of Scrapy’s capabilities. Using Python and Scrapy, you can automate the extraction of emails and other data from websites, saving time and effort. This project is a great addition to any developer’s portfolio and can be extended to handle more complex web scraping tasks.

Share This Post:

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top