Web Scraping with Selenium + Decodo(Smart Proxy) Part 1

The project began when my company needed to build a larger, targeted mailing list for marketing and outreach purposes. My task was to find relevant contact information, specifically, email addresses and the names of person in charge from various website. Manual collection was too time-consuming and inefficient, so I used Selenium for web scraping, supported by Decodo (Smart Proxy) to enable seamless IP rotation and avoid being blocked by the target websites.

Why Selenium?

Selenium is a versatile automation tool mainly used for testing web applications, but it also works well for web scraping. Unlike basic scrapers that only process static HTML, Selenium interacts with web pages like a real user by clicking, typing, and managing AJAX content. This makes it suitable for extracting data from modern websites that rely heavily on JavaScript.

Why Decodo?

Decodo, formerly known as Smart Proxy, is a proxy service that offers access to a large pool of residential and mobile IPs. It helps users hide their real IP addresses, scrape data, and browse anonymously, all at competitive and affordable prices.

Tutorial Overview

This tutorial demonstrates how to perform web scraping using Selenium combined with Decodo proxies. For this example, we’ll be using:

Python v3.12.6
Selenium
Test website

Step 1: Create and activate a virtual environment

pyenv virtualenv 3.12.6 web-scraping-tutorial
pyenv activate web-scraping-tutorial

Step 2: Create a Jupyter Notebook

Next, create a new Jupyter notebook and name it index.ipynb to begin writing your scraping code.

Step 3: Install Required Libraries

Use the following commands to install all the necessary Python packages for this tutorial:

%pip install --upgrade setuptools
%pip install selenium
%pip install webdriver-manager
%pip install blinker==1.7.0
%pip install selenium-wire

Step 4: Import Necessary Libraries

Now, import all the required libraries to set up Selenium and interact with the web page:

# Standard Import
import random
from concurrent.futures import ThreadPoolExecutor

# Web Driver setup
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# Element selection and interaction
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC

Step 5: WebScraper Class Setup

This step creates a WebScraper class that handles all the configuration needed for our scraping project.

class WebScraper:
    def __init__(self):
        self.user_agents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7; rv:110.0) Gecko/20100101 Firefox/110.0"
        ]
        self.chrome_options = self.configure_chrome_options()
        self.service = Service(ChromeDriverManager().install())
        self.driver = self.initialize_driver()
    
    def configure_chrome_options(self):
        chrome_options = Options()
        chrome_options.add_argument("--headless=new")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--blink-settings=imagesEnabled=false")
        chrome_options.set_capability('goog:loggingPrefs', {'browser': 'ALL'})
        chrome_options.add_experimental_option(
           "prefs", {"profile.managed_default_content_settings.images": 2}
        )
        return chrome_options
    
    def authenticate(self):
        username = 'xxx'
        password = 'xxx'
        proxy = f"https://{username}:{password}@gate.decodo.com:7000"

        # set selenium-wire options to use the proxy
        seleniumwire_options = {
            "proxy": {
                "https": proxy
            },
            "exclude_hosts": [
                'clients2.google.com',
                'accounts.google.com',
            ]
        }

        return seleniumwire_options
    
    def initialize_driver(self):
        user_agent = random.choice(self.user_agents)
        self.chrome_options.add_argument(f"user-agent={user_agent}")
        
        return webdriver.Chrome(
            service=self.service, 
            options=self.chrome_options,
            seleniumwire_options=self.authenticate()
        )

    def scrape(self, url):
        try:
            self.driver.get(url)
            
            # Wait for products to load
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "li.product"))
            )
            
            products = []
            product_elements = self.driver.find_elements(By.CSS_SELECTOR, "li.product")
            
            for product in product_elements:
                try:
                    title = product.find_element(By.CSS_SELECTOR, "h2.woocommerce-loop-product__title").text
                    price = product.find_element(By.CSS_SELECTOR, "span.price bdi").text
                    
                    products.append({
                        'title': title,
                        'price': price
                    })
                except NoSuchElementException:
                    # Skip if any element is not found for a product
                    continue
                    
            return products
            
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return None
        
    def close(self):
        if hasattr(self, 'driver') and self.driver:
            self.driver.quit()

User Agents: This is a list of different browser identifiers that we can rotate through to make our requests appear more like regular browser traffic (helps avoid detection as a bot).
Chrome Configuration
- (“–headless=new”) # Run Chrome in headless mode (no visible window)
- (“–disable-gpu”) # Disable GPU hardware acceleration
- (“–no-sandbox”) # Bypass OS security model
- (“–disable-dev-shm-usage”) # Overcome limited resource problems
- (“–blink-settings=imagesEnabled=false”) # Disable images
- (“goog:loggingPrefs”, {“browser”: “ALL”}) # Enable browser logging
- (“prefs”, {“profile.managed_default_content_settings.images”: 2}) # Another way to disable images
Proxy Authentication
- Proxy credentials: Uses username/password to authenticate with a proxy service
- Proxy URL: Formats the complete proxy URL with authentication
- Exclusion list: Prevents certain domains (like Google services) from going through the proxy

Step-by-step breakdown of web scraping

self.driver.get(url)

The scraper navigates to the URL you want to scrape.

WebDriverWait(self.driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "li.product"))
)

This line waits up to 10 seconds for at least one product (li.product) to appear on the page.
It helps ensure that the page has finished loading before scraping.

for product in product_elements:

Loops through each product element found.

title = product.find_element(By.CSS_SELECTOR, "h2.woocommerce-loop-product__title").text
price = product.find_element(By.CSS_SELECTOR, "span.price bdi").text

Extracts the title and price of each product using specific CSS selectors.
.text gets the actual text content from the element.

products.append({
    'title': title,
    'price': price
})

Adds each product’s data as a dictionary into a list.

[
    {'title': 'Atlas Fitness Tank', 'price': '$18.00'},
    {'title': 'Atomic Endurance Running Tee (Crew-Neck)', 'price': '$29.00'},
]

Output Example

Step 6: Create a wrapper function

This function makes web scraping easier by putting everything into one place.
You just give it a URL, and it will:

Try to scrape the webpage
Show an error message if something goes wrong, and
Make sure the scraper is always closed at the end..

def web_scraping(url):
    scraper = WebScraper()
    try:
        return scraper.scrape(url)
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None
    finally:
        scraper.close()

Step 7: Invoke the function

url = "https://www.scrapingcourse.com/ecommerce/page/2/"
products = webScraping(url)
print(products)

Conclusion

This tutorial has walked you through the process of web scraping an e-commerce website using Selenium, along with IP rotation powered by Decodo. We hope you found it helpful and informative. Stay tuned for the next part, where we’ll explore how to implement multi-threading for even more efficient scraping.

StartUp

MSME

DahDigital

Digital Tracker

MyKADUN

Penang2030 Dashboard

42 Penang

PIX

CD² Penang