动态网站根据用户的某些操作产生一些结果。例如,当网页仅在向下滚动或将鼠标移动到屏幕上时才完全加载时,这背后一定有一些动态编程。当您将鼠标指针悬停在某些文本上时,它会为您提供一些选项,它还包含一些动态.这是是一篇关于动态网页的非常好的详细文章。
您可以在互联网上找到许多文章来帮助您抓取动态网站。这篇文章是我抓取Doordash.com 的方法。一切都是逐步进行的。
抓取动态网页的一个必要条件是在浏览器中加载其 javascript。而且,这是通过无头浏览器完成的(稍后会解释)。
我的目标是从 Doordash.com 上抓取 5 万多个菜单。
[请记住,除了某些特定条件外,Python 区分大小写。]
让我们通过导入一些必要的库以及我们可能需要的一些辅助库来开始编码。正如标题所示,我将使用 Selenium 库
- #importing required libraries
- from selenium import webdriver
- from selenium.webdriver.common.by import By
- from selenium.webdriver.support.ui import WebDriverWait
- from selenium.webdriver.support import expected_conditions as EC
- from selenium.common.exceptions import TimeoutException
- from selenium.webdriver.common.action_chains import ActionChains
- from selenium.webdriver.remote.webelement import WebElement
- from selenium.webdriver.support.wait import WebDriverWait
- from selenium_move_cursor.MouseActions import move_to_element_chrome
- from selenium.webdriver.common.keys import Keys
- from selenium.webdriver.chrome.options import Options
- import js
- import json
- import numpy as np
- import time
- import pandas as pd #to save CSV file
- from bs4 import BeautifulSoup
- import ctypes #to create text popup
Selenium 的“Webdriver”模块是最重要的,因为它将控制浏览器。为了控制浏览器,有一定的要求,这些要求已经以驱动程序的形式设置,例如“google chrome”的“chromedriver”。我将使用“ chromedriver ”。而且,要使用它,我们需要告诉“webdriver”它。
让我们为“webdriver”定义这个浏览器,并将其选项设置为“--headless”。
- #defining browser and adding the “ — headless” argument
- opts = Options()
- opts.add_argument(‘ — headless’)
- driver = webdriver.Chrome(‘chromedriver’, options=opts)
这个“无头”参数被设置为处理动态网页,加载它们的 javascript。
以下是 URL 以及使用“webdriver”打开 URL 的代码。
- url = 'https://www.doordash.com/en-US'
- driver.maximize_window() #maximize the window
- driver.get(url) #open the URL
- driver.implicitly_wait(220) #maximum time to load the link
我将 chromedriver 放在项目目录中以保持路径简单。或者可以使用“OS”模块定义路径来代替“chromedriver”。
我对 Doordash.com 进行了概述,以了解我们的结果(即菜单)的位置以及如何访问它们。
该脚本将
1-打开浏览器
- #defining browser and adding the “ — headless” argument
- opts = Options()
- opts.add_argument(‘ — headless’)
- driver = webdriver.Chrome(‘chromedriver’, options=opts)
2- 搜索 URL (doordash.com)
- url = 'https://www.doordash.com/en-US'
- driver.maximize_window() #maximize the window
- driver.get(url) #open the URL
- driver.implicitly_wait(220) #maximum time to load the link
3-向下滚动以加载整个页面
driver.execute_script("window.scrollTo(0, document.body.scrollHeight,)")
4-导航至“您附近的热门美食”
5-点击“Pizza Near Me”(我认为这对于 50k+ 菜单来说已经足够了)
- time.sleep(5)
- element = driver.find_element_by_xpath(‘//h2[text()=”Top Cuisines Near You”]’).find_element_by_xpath(‘//a[@class=”sc-hrWEMg fFHnHa”]’)
- time.sleep(5)
- element.click()
- driver.implicitly_wait(220)
6-加载页面和页面范围
- #define the lists
- names = []
- prices = []
- #extract the number of pages for the searched product
- driver.implicitly_wait(120)
- time.sleep(3)
- result = driver.page_source
- soup = BeautifulSoup(result, 'html.parser')
- page = list(soup.findAll('div', class_="sc-cvbbAY htjLED"))
- start = int(page[2].text)
- print('1st page:',start)
- last = int(page[-2].text)
- final = last +1
- print('last page:',final)
- #getting numbers out of string of pages
- print(f'first page:{start}, and last page with + 1: {final}')
7-点击各个商店(页面已设置默认位置中国,因此无需担心位置)
- #set the page_range And
- #lloop all the pages of store
- for i in range(start, final, 1):
- time.sleep(7)
- #find the number of stores per page
- list_length = len(driver.find_elements_by_xpath(“//div[@class=’StoreCard_root___1p3uN’]”))
- products_per_page = list_length+1
- #loop through the menues of each store on a page
- for x in range(0, list_length, 1):
- time.sleep(7)
- driver.execute_script(“window.scrollTo({top:75, behavior:’smooth’,})”)
- store_name = driver.find_elements_by_xpath(‘//div[@class=”StoreCard_storeDetail___3C0TX”]’)
- strnm = store_name[x]
- print(f’{x}- ‘, strnm.text)
- time.sleep(4)
- element=driver.find_elements_by_xpath(“//div[@class=’StoreCard_storeDetail___3C0TX’]”)
- click = element[x]
- move_to_element_chrome(driver, click, display_scaling=100)
- time.sleep(7)
- click.click()
- driver.implicitly_wait(360)
8-抓取菜单并抓取后返回商店页面
- time.sleep(20)
- result = driver.page_source
- time.sleep(11)
- soup = BeautifulSoup(result, ‘html.parser’)
- div = soup.find(‘div’, class_=”sc-jwJjzT kjdEnq”)
- if div is not None:
- time.sleep(25)
- for i in div.findAll(‘div’, class_=”sc-htpNat Ieerz”):
- pros = i.find(‘div’, class_=”sc-jEdsij hukZqW”)
- print(‘writing (‘, pros.text, ‘) to disk’)
- names.append(pros.text)
- rates = i.find(‘span’, class_=”sc-bdVaJa eEdxFA”)
- #if there is no price for the food, append ‘N/A’ in the list of ‘prices’
- if rates is not None:
- print(‘price: ‘, rates.text)
- rate = rates.text
- else:
- print(‘N/A’)
- rate = ‘N/A’
- prices.append(rate)
- driver.back()
9-检查名称列表中的菜单数量
length = len(names)
完成列表中大约 10000 个菜单后中断循环,并通过弹出窗口通知我们,否则重复循环
- #if menu record reaches the target, exit the script and produce target completion message box
- if ((length > 10000) and (length <10050)):
- ctypes.windll.user32.MessageBoxW(0, f”Congratulations! We have succefully scraped {length} menues.”, “Project Completion”, 1)
- break
- else:
- driver.back()
- continue
10-整个过程将保持循环,直到我们得到大约 10000 个菜单。
11-如果在抓取一页上的所有商店时未达到 10000 目标,请单击“下一步”按钮进行抓取
- #after scraping each store on a page, it will tell that it is going to next page
- print(f’Now moving to page number {i}’)
- #click next page button
- driver.find_elements_by_xpath(‘//div[@class=”sc-gGBfsJ jFaVNA”]’)[1].click()
12-将结果保存为 CSV 文件。
- #save to dataframe
- df = pd.DataFrame({‘Name’:names, ‘Price’:prices})
- #export as csv file
- df.to_csv(‘doordash_menues.csv’)