In the realm of web scraping, dealing with web pages that feature infinite scrolling is a scenario that often arises, particularly when using Selenium for automation. These pages dynamically load content as the user scrolls, presenting a unique challenge for scraping projects that require access to the entirety of a page’s content. To address this, Selenium offers tools to automate scrolling, enabling the scraper to mimic a user’s actions and ensure that all dynamically loaded content is captured. Acquiring a web scraping API can be a game-changer, offering advanced functionalities that streamline the data extraction process. This guide aims to delve into the strategies for automating scrolling within the Selenium framework, providing a step-by-step approach to effectively manage pages with infinite scrolling and unlock new possibilities in web scraping projects.
For this purpose, the JavaScript function window.scrollTo(x, y)
comes in handy, allowing us to programmatically scroll to specific coordinates on a page. To ensure we reach the bottom of an infinitely scrolling page, a while
loop can be employed to continually scroll until no further content is loaded.
An illustrative example of this approach can be seen when extracting information from an infinite scrolling page like web-scraping.dev/testimonials. The process involves executing a loop that scrolls to the bottom of the page until the end is reached, as demonstrated below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
driver.get("https://web-scraping.dev/testimonials/")
prev_height = -1
max_scrolls = 100
scroll_count = 0
while scroll_count < max_scrolls:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) # Allow time for new content to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == prev_height:
break
prev_height = new_height
scroll_count += 1
# Retrieve all loaded testimonials
elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "testimonial")))
results = []
for element in elements:
text = element.find_element(By.CLASS_NAME, "text").get_attribute('innerHTML')
results.append(text)
print(f"Scraped: {len(results)} results!")
driver.quit()
This script methodically scrolls through a page mimicking an infinite scroll, waiting for new sections to load and continuing until it detects that no new content appears, signifying the bottom has been reached. Once the scrolling stops, it proceeds to collect and parse the now fully loaded page content.