selenium基本使用、无头浏览器(chrome、FireFox)、搜索标签

selenium基本使用

这个模块：既能发请求，又能解析，还能执行js

selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行
JavaScript代码的问题

selenium 会做web方向的自动化测试
appnium 会做 app方向的自动化测试

selenium 可以操作浏览器，模拟人的行为

使用浏览器

下载浏览器驱动（chrome）：
- https://registry.npmmirror.com/binary.html?path=chromedriver/
- https://googlechromelabs.github.io/chrome-for-testing/
- https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip
- 火狐驱动：https://github.com/mozilla/geckodriver/releases/
- 跟浏览器型号和版本一一对应的
  ie，火狐，谷歌：谷歌为例
  谷歌浏览器有很多版本：跟版本一一对应
将驱动放到python解释器目录下，或者配置环境变量
下载模块：pip install selenium
写python代码，操作浏览器

import time
from selenium import webdriver

# 跟人操作浏览器一样，打开了谷歌浏览器，拿到浏览器对象
bro=webdriver.Firefox()

# 在地址栏中输入地址
bro.get('https://www.baidu.com')
time.sleep(5)
bro.close()
1
2
3
4
5
6
7
8
9
10

指令

bro为实例化所得对象

在地址栏中输入地址：bro.get('网址地址')
关闭浏览器：bro.close()
设置等待：bro.implicitly_wait(10)，从页面中找标签，如果找不到，就等待
页面最大化：bro.maximize_window()
当前页面html内容：bro.page_source)
选择器：
from selenium.webdriver.common.by import By
- 找一个：bro.find_element(by=By.选择器,value='')
- 找所有：bro.find_elements(by=By.选择器,value='')
点击：找到的标签.click()
文本框写入：找到的标签.send_keys()

模拟登录

from selenium import webdriver
from selenium.webdriver.common.by import By

bro = webdriver.Firefox()
bro.get('https://www.baidu.com')
bro.implicitly_wait(10)
bro.maximize_window()

# 找到登录按钮
a_login = bro.find_element(by=By.LINK_TEXT, value='登录')
a_login.click()

# 往输入框中写文字
username = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__userName')
username.send_keys('13437238745')
password = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__password')
password.send_keys('caimina1')

agree = bro.find_element(By.ID, 'TANGRAM__PSP_11__isAgree')
agree.click()

submit = bro.find_element(By.ID, 'TANGRAM__PSP_11__submit')
submit.click()

bro.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

selenium其它用法

无头浏览器(chrome浏览器)

如果我们做爬虫，我们只是为了获取数据，不需要非有浏览器在显示 $\dashrightarrow$ 隐藏浏览器图形化界面

chrome

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度
chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
bro = webdriver.Chrome(options=chrome_options)


bro.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html')

print(bro.page_source)
time.sleep(3)
bro.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

FireFox

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument("--headless")  # 设置火狐为headless无界面模式
options.add_argument("--disable-gpu")
driver = webdriver.Firefox(options=options)
driver.get("https://www.qq.com")
print(driver.page_source)
driver.close()

1
2
3
4
5
6
7
8
9
10

搜索标签

根据id号查找标签：
根据name属性查找标签：
根据标签查找标签：
按类名找：
a标签文字：
a标签文字，模糊匹配：
按css选择器找：
按xpath找：

获取标签的属性，文本，大小，位置

属性：bro.get_attribute('src')
文本：bro.text
大小：tag.size
位置：bro.location
id（不是标签id，无需关注）：tag.id
标签名：tag.tag_name

找到页面中所有div

divs=bro.find_elements(By.TAG_NAME,'div')
1

按类名找

div=bro.find_element(By.CLASS_NAME,'postDesc').text
1

按css选择器

div=bro.find_element(By.CSS_SELECTOR,'div.postDesc').text

#id为topics下的div下的div中类为postDesc
div=bro.find_element(By.CSS_SELECTOR,'#topics > div > div.postDesc').text
1
2
3
4

相关阅读:
【数据结构】树与二叉树
Linux 中 .tar 和 tar.gz 的区别
每日一个设计模式之【适配器模式】
Python pip更换清华源镜像
SWAT-MODFLOW耦合
css中的z-index是什么
C++的作用域和命名空间
拼多多根据关键词取商品列表 API
前缀树及AC自动机
目标检测YOLO实战应用案例100讲-水下机器人视域中小目标检测（中）

原文地址：https://blog.csdn.net/qq_44779250/article/details/134294505