How to Build an Email ID Extractor - Code Masala Bytes - Helping Developers Solve Real World Problems

Web scraping has become an essential skill for developers looking to extract useful data from websites. Scrapy, an open source web crawling framework written in Python, is a powerful tool for this purpose. In this guide, we’ll walk you through creating an email ID extractor using Scrapy, which will extract email addresses from specified pages of the website.

Why Scrapy?

Scrappy isn’t just about scratching; it’s a robust framework designed for web crawling. It can process queries, parse HTML, and follow links to scrape entire web pages. Scrapy combined with regular comments is especially effective for tasks like email extraction.

Prerequisites

Before you begin, make sure you have Python installed on your system. You should also install Scrapy and Scrapy-Selenium, which are required for dynamic content processing.

Python

pip install scrapy
pip install scrapy-selenium

Step 1: Setting Up the Scrapy Project

Run the following command to create a new Scrapy project.

Python

scrapy startproject geeksemailtrack

Navigate to your project directory:

Python

cd geeksemailtrack

Then, make another spider:

Python

scrapy genspider emails https://www.geeksforgeeks.org/

Step 2: Configure the project

Update the settings.py file and integrate Scrapy-Selenium. Selenium is used here because it can interact with dynamic objects on web pages.

Python

from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = []
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

Download the appropriate ChromeDriver for your version of Chrome and paste it next to your scrapy.cfg file.

Step 3: Writing the Spider

Add the necessary libraries to your emails.py spider file:

Python

import scrapy
import re
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

Define your spider class and start making requests to the target location:

Python

class EmailtrackSpider(scrapy.Spider):
    name = 'emailtrack'
    uniqueemail = set()

    def start_requests(self):
        yield SeleniumRequest(
            url="https://www.geeksforgeeks.org/",
            wait_time=3,
            screenshot=True,
            callback=self.parse,
            dont_filter=True
        )

Step 4: Parsing the Links

The Parse function removes all links from the main page and filters them to include only relevant pages such as “Contact” or “About” pages where emails are found

Python

def parse(self, response):
    links = LxmlLinkExtractor(allow=()).extract_links(response)
    Finallinks = [str(link.url) for link in links]
    links = [link for link in Finallinks if 'contact' in link.lower() or 'about' in link.lower()]
    links.append(str(response.url))
    
    for link in links:
        yield SeleniumRequest(
            url=link,
            wait_time=3,
            screenshot=True,
            callback=self.parse_link,
            dont_filter=True,
            meta={'links': links}
        )

Step 5: Extracting Emails

The parse_link function extracts email addresses from pages using regular expressions.

Python

def parse_link(self, response):
    links = response.meta['links']
    bad_words = ['facebook', 'instagram', 'youtube', 'twitter', 'wiki', 'linkedin']
    
    if not any(bad_word in response.url for bad_word in bad_words):
        email_list = re.findall(r'\w+@\w+\.\w+', response.text)
        self.uniqueemail.update(email_list)
    
    if links:
        yield SeleniumRequest(
            url=links.pop(0),
            callback=self.parse_link,
            dont_filter=True,
            meta={'links': links}
        )
    else:
        yield SeleniumRequest(
            url=response.url,
            callback=self.parsed,
            dont_filter=True
        )

Step 6: Displaying the Results

Finally, the parsed function will remove any unnecessary email addresses and print the correct ones.

Python

def parsed(self, response):
    finalemail = [email for email in self.uniqueemail if '.com' in email or '.in' in email]
    print("Emails scraped:", finalemail)

Step 7: Running the Spider

Run the spider using the following command on your terminal.

Python

scrapy crawl emailtrack

This will start the scraping process, and your terminal will display emails that have been removed from the specified pages.

Conclusion

This email extractor project is a simple but effective demonstration of Scrapy’s capabilities. Using Python and Scrapy, you can automate the extraction of emails and other data from websites, saving time and effort. This project is a great addition to any developer’s portfolio and can be extended to handle more complex web scraping tasks.

Share This Post:

Why Scrapy?

Prerequisites

Step 1: Setting Up the Scrapy Project

Step 2: Configure the project

Step 3: Writing the Spider

Step 4: Parsing the Links

Step 5: Extracting Emails

Step 6: Displaying the Results

Step 7: Running the Spider

Conclusion

Leave a Reply Cancel reply