In the previous article, I explained how to perform web scraping using Selenium along with Decodo, a proxy service that helps rotate IP addresses. That tutorial covered the basics of scraping a single webpage, which serves as a solid foundation. However, real-world scraping tasks usually require collecting data from multiple pages and doing it quickly and is not always straightforward.
Problem Statement:
Scraping one page at a time is slow and inefficient, especially when dealing with websites that span hundreds or thousands of pages. On top of that, making repeated requests from the same IP can easily trigger anti-bot systems, resulting in blocks or CAPTCHAs.
Solution:
One effective way to overcome this is by using multi-threading. This allows your scraper to process multiple pages simultaneously, significantly speeding up the operation. Combined with IP rotation through proxies like Decodo, you can minimize detection and gather data more efficiently.
Is Selenium Thread-Safe?
No, Selenium is not thread-safe. This means that sharing a single WebDriver instance across multiple threads can lead to unpredictable results, such as crashes, incorrect data extraction, or even browser failures. Each thread must create and control its own separate WebDriver session to function properly.
To implement multi-threading safely in a Selenium-based scraper:
- Instantiate a new WebDriver in each thread
- This ensures that browser sessions are isolated from each other.
- Avoid sharing data structures between threads
- Clean up resources after each thread finishes its job by properly closing the browser session
In the next section, we’ll walk through a working example of using multi-threading with Selenium while ensuring each WebDriver session remains self-contained to avoid thread-safety issues.
urls = [
'https://www.scrapingcourse.com/ecommerce/page/2/',
'https://www.scrapingcourse.com/ecommerce/page/3/',
'https://www.scrapingcourse.com/ecommerce/page/4/',
'https://www.scrapingcourse.com/ecommerce/page/5/',
'https://www.scrapingcourse.com/ecommerce/page/6/'
]
futures = []
with ThreadPoolExecutor(max_workers=5) as executor:
for url in urls:
futures.append(executor.submit(webScraping, url))
results = [future.result() for future in futures if future.result()]
print(results)
Step-by-step Breakdown
with ThreadPoolExecutor(max_workers=5) as executor:
- Creates a pool of worker threads (5 in this case) to execute tasks concurrently.
The
with
statement ensures proper cleanup when done.
for url in urls:
futures.append(executor.submit(webScraping, url))
- For each URL, we submit the
webScraping
function to be executed with the URL as argument. executor.submit()
returns a Future object that represents the eventual result.- We store these Future objects in the
futures
list.
results = [future.result() for future in futures if future.result()]
- After all tasks are submitted, we collect the results.
- future.result() waits for the task to complete and returns its result.
- The list comprehension filters out any None results (if webScraping returns None).
Conclusion
While Selenium is not inherently thread-safe, it can still be used effectively in a multi-threaded environment if handled with care. The key is to ensure that each thread operates its own isolated WebDriver instance and that all browser sessions are properly managed and closed after use. By following these best practices, you can safely implement multi-threading to speed up your scraping tasks without running into stability issues.