The project began when my company needed to build a larger, targeted mailing list for marketing and outreach purposes. My task was to find relevant contact information, specifically, email addresses and the names of person in charge from various website. Manual collection was too time-consuming and inefficient, so I used Selenium for web scraping, supported by Decodo (Smart Proxy) to enable seamless IP rotation and avoid being blocked by the target websites.
Why Selenium?
Selenium is a versatile automation tool mainly used for testing web applications, but it also works well for web scraping. Unlike basic scrapers that only process static HTML, Selenium interacts with web pages like a real user by clicking, typing, and managing AJAX content. This makes it suitable for extracting data from modern websites that rely heavily on JavaScript.
Why Decodo?
Decodo, formerly known as Smart Proxy, is a proxy service that offers access to a large pool of residential and mobile IPs. It helps users hide their real IP addresses, scrape data, and browse anonymously, all at competitive and affordable prices.
Tutorial Overview
This tutorial demonstrates how to perform web scraping using Selenium combined with Decodo proxies. For this example, we’ll be using:
- Python v3.12.6
- Selenium
- Test website
Step 1: Create and activate a virtual environment
pyenv virtualenv 3.12.6 web-scraping-tutorial
pyenv activate web-scraping-tutorial
Step 2: Create a Jupyter Notebook
Next, create a new Jupyter notebook and name it index.ipynb to begin writing your scraping code.
Step 3: Install Required Libraries
Use the following commands to install all the necessary Python packages for this tutorial:
%pip install --upgrade setuptools
%pip install selenium
%pip install webdriver-manager
%pip install blinker==1.7.0
%pip install selenium-wire
Step 4: Import Necessary Libraries
Now, import all the required libraries to set up Selenium and interact with the web page:
# Standard Import import random from concurrent.futures import ThreadPoolExecutor # Web Driver setup from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options # Element selection and interaction from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.support import expected_conditions as EC
Step 5: WebScraper Class Setup
This step creates a WebScraper class that handles all the configuration needed for our scraping project.
class WebScraper:
def __init__(self):
self.user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7; rv:110.0) Gecko/20100101 Firefox/110.0"
]
self.chrome_options = self.configure_chrome_options()
self.service = Service(ChromeDriverManager().install())
self.driver = self.initialize_driver()
def configure_chrome_options(self):
chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--blink-settings=imagesEnabled=false")
chrome_options.set_capability('goog:loggingPrefs', {'browser': 'ALL'})
chrome_options.add_experimental_option(
"prefs", {"profile.managed_default_content_settings.images": 2}
)
return chrome_options
def authenticate(self):
username = 'xxx'
password = 'xxx'
proxy = f"https://{username}:{password}@gate.decodo.com:7000"
# set selenium-wire options to use the proxy
seleniumwire_options = {
"proxy": {
"https": proxy
},
"exclude_hosts": [
'clients2.google.com',
'accounts.google.com',
]
}
return seleniumwire_options
def initialize_driver(self):
user_agent = random.choice(self.user_agents)
self.chrome_options.add_argument(f"user-agent={user_agent}")
return webdriver.Chrome(
service=self.service,
options=self.chrome_options,
seleniumwire_options=self.authenticate()
)
def scrape(self, url):
try:
self.driver.get(url)
# Wait for products to load
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "li.product"))
)
products = []
product_elements = self.driver.find_elements(By.CSS_SELECTOR, "li.product")
for product in product_elements:
try:
title = product.find_element(By.CSS_SELECTOR, "h2.woocommerce-loop-product__title").text
price = product.find_element(By.CSS_SELECTOR, "span.price bdi").text
products.append({
'title': title,
'price': price
})
except NoSuchElementException:
# Skip if any element is not found for a product
continue
return products
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
def close(self):
if hasattr(self, 'driver') and self.driver:
self.driver.quit()
- User Agents: This is a list of different browser identifiers that we can rotate through to make our requests appear more like regular browser traffic (helps avoid detection as a bot).
- Chrome Configuration
- (“–headless=new”) # Run Chrome in headless mode (no visible window)
- (“–disable-gpu”) # Disable GPU hardware acceleration
- (“–no-sandbox”) # Bypass OS security model
- (“–disable-dev-shm-usage”) # Overcome limited resource problems
- (“–blink-settings=imagesEnabled=false”) # Disable images
- (“goog:loggingPrefs”, {“browser”: “ALL”}) # Enable browser logging
(“prefs”, {“profile.managed_default_content_settings.images”: 2}) # Another way to disable images
- Proxy Authentication
- Proxy credentials: Uses username/password to authenticate with a proxy service
- Proxy URL: Formats the complete proxy URL with authentication
- Exclusion list: Prevents certain domains (like Google services) from going through the proxy
Step-by-step breakdown of web scraping
self.driver.get(url)
- The scraper navigates to the URL you want to scrape.
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "li.product"))
)
- This line waits up to 10 seconds for at least one product (
li.product
) to appear on the page. - It helps ensure that the page has finished loading before scraping.
for product in product_elements:
- Loops through each product element found.
title = product.find_element(By.CSS_SELECTOR, "h2.woocommerce-loop-product__title").text
price = product.find_element(By.CSS_SELECTOR, "span.price bdi").text
- Extracts the title and price of each product using specific CSS selectors.
.text
gets the actual text content from the element.
products.append({
'title': title,
'price': price
})
- Adds each product’s data as a dictionary into a list.
[
{'title': 'Atlas Fitness Tank', 'price': '$18.00'},
{'title': 'Atomic Endurance Running Tee (Crew-Neck)', 'price': '$29.00'},
]
- Output Example
Step 6: Create a wrapper function
This function makes web scraping easier by putting everything into one place.
You just give it a URL, and it will:
- Try to scrape the webpage
- Show an error message if something goes wrong, and
- Make sure the scraper is always closed at the end..
def web_scraping(url):
scraper = WebScraper()
try:
return scraper.scrape(url)
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
finally:
scraper.close()
Step 7: Invoke the function
url = "https://www.scrapingcourse.com/ecommerce/page/2/"
products = webScraping(url)
print(products)
Conclusion
This tutorial has walked you through the process of web scraping an e-commerce website using Selenium, along with IP rotation powered by Decodo. We hope you found it helpful and informative. Stay tuned for the next part, where we’ll explore how to implement multi-threading for even more efficient scraping.