3-爬虫-搜索文档树(find和find_all)、bs4其它用法、css选择器、selenium基本使用以及其他、selenium(无头浏览器、搜索标签)

1 搜索文档树
 1.1 find和find_all
1.2 爬取美女图片
 2 bs4其它用法
 3 css选择器

遍历文档树

-1 request 使用代理
	proxies = {
      'https': 192.168.1.12:8090,
	}
    
-2 代理的使用
	-高匿 透明
	-免费---》爬取免费代理--》开源
    	-https://www.zdaye.com/free/  ---》验证
    -收费
    
-3 django 获取访问者ip---》公网
	-django如果在内网---》局域网内访问没问题
    -如果到了公网，再回就回不来了
    -使用内网穿透技术实现
    -公网  内网
    
    
-4 爬取视频网站
	-1 获取一条条视频--》分析出一个地址--》正则
    -2 解析出视频id，视频地址
    -3 携带referer
    -4 视频不能播放--》能播的和不能播的有什么区别
    
    
-5 爬新闻    
	-requests+bs4
    -find_all
    -find
    
-6 bs介绍和使用
	-解析库---》xml
    -指定解析器  lxml   html.parser
    
-7 遍历文档树
	-soup=BeautifulSoup()
    -soup.body.title  返回的对象 也有这些方法和属性 Tag ，BeautifulSoup继承了Tag
    -BeautifulSoup类继承了Tag，所以以后拿到的任意一个标签都是Tag类的对象，所有的遍历文档，获取属性，文本---》跟BeautifulSoup的对象一样用    
	- . 找标签   只能找到第一个
    - .标签.标签
    - 获取标签名  soup.body.name
    - 获取标签属性：soup.标签.attrs.get('属性名')  
    	类 ：class标签 列表
    - 获取标签文本内容：
    	-text：子子孙孙的内容拼到一起
        -string：该标签有且只有它自己 有内容
        -strings：子子孙孙放到生成器中
        
    -子节点
    -兄弟节点
    -父亲节点
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

1 搜索文档树

# 1 find_all ：找所有  列表
# 2 find  找一个 Tag类的对象

1
2
3

1.1 find和find_all

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's storylqz

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 1、五种过滤器: 字符串、正则表达式、列表、True、方法

####  字符串
# -可以按标签名，可以按属性，可以按文本内容
# - 无论按标签名，按属性，按文本内容 都是按字符串形式查找

# p=soup.find('p')
# 找到类名叫 story的p标签
# p=soup.find(name='p',class_='story')
#### 可以按标签名，可以按属性，可以按文本内容
# obj=soup.find(name='span',text='lqz')
# obj=soup.find(href='http://example.com/tillie')

# 属性可以写成这样
# obj=soup.find(attrs={'class':'title'})
# print(obj)

#### 正则  无论按标签名，按属性，按文本内容 都是按正则形式查找
# 找到所有名字以b开头的所有标签
import re

# obj=soup.find_all(name=re.compile('^b'))
# obj=soup.find_all(name=re.compile('y$'))
# obj=soup.find_all(href=re.compile('^http:'))
# obj=soup.find_all(text=re.compile('i'))
# print(obj)


### 列表  无论按标签名，按属性，按文本内容 都是按列表形式查找
# obj=soup.find_all(name=['p','a'])
# obj = soup.find_all(class_=['sister', 'title'])
# print(obj)


#  True无论按标签名，按属性，按文本内容 都是按布尔形式查找
# obj=soup.find_all(id=True)
# obj=soup.find_all(href=True)
# obj=soup.find_all(name='img',src=True)
# print(obj)


### 方法 无论按标签名，按属性，按文本内容 都是按方法形式查找
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(name=has_class_but_no_id))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

1.2 爬取图片

import requests
from bs4 import BeautifulSoup

res = requests.get('https://pic.netbian.com/tupian/32518.html')
res.encoding = 'gbk'
# print(res.text)

soup = BeautifulSoup(res.text, 'html.parser')
ul = soup.find('ul', class_='clearfix')
img_list = ul.find_all(name='img', src=True)
for img in img_list:
    try:
        url = img.attrs.get('src')
        if not url.startswith('http'):
            url = 'https://pic.netbian.com' + url
        print(url)
        res1=requests.get(url)
        name=url.split('-')[-1]
        with open('./img/%s'%name,'wb') as f:
            for line in res1.iter_content():
                f.write(line)
    except Exception as e:
        continue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

2 bs4其它用法

# 1 遍历，搜索文档树---》bs4还可以修改xml
	-java的配置文件一般喜欢用xml写
    -.conf
    -.ini
    -.yaml
    -.xml
    
# 2 find_all 其他参数
	-limit=数字   找几条 ，如果写1 ，就是一条
    -recursive
    
# 3 搜索文档树和遍历文档树可以混用，找属性，找文本跟之前学的一样
1
2
3
4
5
6
7
8
9
10
11
12

< h1 id=“css”>3 css选择器

# id选择器
	 #id号
# 标签选择器
	标签名
# 类选择器
	.类名
    
# 记住的：
	#id
    .sister
    head
    div>a  # div下直接子节点a
    div a  # div下子子孙孙节点a
 

# 一旦会了css选择器的用法---》以后所有的解析库都可以使用css选择器去找

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html')
# print(res.text)
soup = BeautifulSoup(res.text, 'html.parser')
# a=soup.find(name='a',title='下载哔哩哔哩视频')
# print(a.attrs.get('href'))

# p=soup.select('#cnblogs_post_body p:nth-child(2) a:nth-child(5)')[0].attrs.get('href')
# p=soup.select('#cnblogs_post_body > p:nth-child(2) > a:nth-child(5)')[0].attrs.get('href')  # 以后直接复制即可
p=soup.select('a[title="下载哔哩哔哩视频"]')[0].attrs.get('href')  # 以后直接复制即可
print(p)
1
2
3
4
5
6
7
8
9
10
11
12
13

4 selenium基本使用

# 这个模块：既能发请求，又能解析，还能执行js
# selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题

# selenium 会做web方向的自动化测试
# appnium 会做 app方向的自动化测试

# selenium 可以操作浏览器，模拟人的 行为


# 如何使用
	1 下载浏览器驱动：https://registry.npmmirror.com/binary.html?path=chromedriver/
        https://googlechromelabs.github.io/chrome-for-testing/
        https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip
    	跟浏览器型号和版本一一对应的
        ie，火狐，谷歌：谷歌为例
        谷歌浏览器有很多版本：跟版本一一对应
    2 安装 selenium
    3 写python代码，操作浏览器
    import time
    from selenium import webdriver
    # 跟人操作浏览器一样，打开了谷歌浏览器，拿到浏览器对象
    bro=webdriver.Chrome()
    # 在地址栏中输入地址
    bro.get('https://www.baidu.com')
    time.sleep(5)
    bro.close()
	

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

bro = webdriver.Chrome()
bro.get('https://www.baidu.com')
bro.implicitly_wait(10)  # 设置等待---》从页面中找标签，如果找不到，就等待
# 最大化
bro.maximize_window()
# print(bro.page_source) # 当前页面的html内容
# 找到登录按钮--》选择器---》css选择器
# a_login=bro.find_element(by=By.NAME,value='tj_login')
# a_login=bro.find_element(by=By.ID,value='s-top-loginbtn')
a_login = bro.find_element(by=By.LINK_TEXT, value='登录')  # a 标签连接文字
time.sleep(2)
# 点击
a_login.click()

# 找到短信登录 点击
sms_login = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__changeSmsCodeItem')
sms_login.click()
time.sleep(1)
user_login = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__changePwdCodeItem')
user_login.click()
time.sleep(1)
username = bro.find_element(by=By.NAME, value='userName')
# 往输入框中写文字
username.send_keys('lqz@qq.com')
password = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__password')
# 往输入框中写文字
password.send_keys('lqz@qq.com')

agree = bro.find_element(By.ID, 'TANGRAM__PSP_11__isAgree')
agree.click()
time.sleep(1)

submit = bro.find_element(By.ID, 'TANGRAM__PSP_11__submit')
submit.click()

time.sleep(3)
bro.close()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

5 selenium其它用法

5.1 无头浏览器

# 如果我们做爬虫，我们只是为了获取数据，不需要非有浏览器在显示---》隐藏浏览器图形化界面


import time

from selenium import webdriver
from selenium.webdriver.common.by import By


from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度
chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
bro = webdriver.Chrome(options=chrome_options)


bro.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html')

print(bro.page_source)
time.sleep(3)
bro.close()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

5.2 搜索标签

1 搜索标签
By.ID  # 根据id号查找标签
By.NAME  # 根据name属性查找标签
By.TAG_NAME  # # 根据标签查找标签
By.CLASS_NAME # 按类名找
By.LINK_TEXT # a标签文字
By.PARTIAL_LINK_TEXT # a标签文字，模糊匹配

---------selenium 自己的--------
By.CSS_SELECTOR # 按css选择器找
By.XPATH  #按xpath找


2 获取标签的属性，文本，大小，位置
print(tag.get_attribute('src'))
print(tag.id)  # 这个id不是id号，不需要关注
print(tag.location)
print(tag.tag_name)
print(tag.size)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By


from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度
chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
bro = webdriver.Chrome(options=chrome_options)


bro.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html')

#### 不建议使用----》selenium提供的查找
# soup=BeautifulSoup(bro.page_source,'html.parser')
# print(soup.find(title='下载哔哩哔哩视频').attrs.get('href'))


# selenium提供的查找
# By.ID  # 根据id号查找标签
# By.NAME  # 根据name属性查找标签
# By.TAG_NAME  # # 根据标签查找标签
# By.CLASS_NAME # 按类名找
# By.LINK_TEXT # a标签文字
# By.PARTIAL_LINK_TEXT # a标签文字，模糊匹配
#---------selenium 自己的--------
# By.CSS_SELECTOR # 按css选择器找
# By.XPATH  #按xpath找

#### 找到标签后，获取标签属性，文本，位置，大小等
# print(tag.get_attribute('src'))
# print(tag.id)  # 这个id不是id号，不需要关注
# print(tag.location)
# print(tag.tag_name)
# print(tag.size)
div=bro.find_element(By.ID,'cnblogs_post_body')
# res=div.get_attribute('class')   # 获取标签属性
print(div.get_attribute('class'))
print(div.id)  # 这个id不是id号，不需要关注
print(div.location) # 在页面中位置： x y轴效果---》
print(div.tag_name) # 标签名
print(div.size) # 标签大小  x y
print(div.text) # 文本内容


## 找到页面中所有div
# divs=bro.find_elements(By.TAG_NAME,'div')
# print(len(divs))

# 按类名找
# div=bro.find_element(By.CLASS_NAME,'postDesc').text
# print(div)


# 按css选择器
# div=bro.find_element(By.CSS_SELECTOR,'div.postDesc').text
# div=bro.find_element(By.CSS_SELECTOR,'#topics > div > div.postDesc').text
# print(div)

# 按xpath选择---专门学xpath的语法
# div=bro.find_element(By.XPATH,'//*[@id="topics"]/div/div[3]').text
# print(div)


time.sleep(1)
bro.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

相关阅读:
技术速览｜Meta Llama 2 下一代开源大型语言模型
 Python爬虫教程，从入门到成神
 面向对象进阶第三天
 [附源码]JAVA毕业设计流行病调查平台（系统+LW）
后端程序员利用 AI 给网站制作专业 favicon
Vue3 企业级优雅实战 - 组件库框架 - 6 搭建example环境
 2023.10.27 常见的锁策略详解
 3.3 AOP之AOP概念及相关术语
 SpringBoot 配置文件使用详解
 如何检测出你们安装的依赖是否安全
原文地址：https://blog.csdn.net/weixin_44145338/article/details/134245898

3-爬虫-搜索文档树(find和find_all)、bs4其它用法、css选择器、selenium基本使用以及其他、selenium(无头浏览器、搜索标签)

遍历文档树

1 搜索文档树

1.1 find和find_all

1.2 爬取图片

2 bs4其它用法

< h1 id=“css”>3 css选择器

4 selenium基本使用

4.1 模拟登录

5 selenium其它用法

5.1 无头浏览器

5.2 搜索标签