爬虫 — Bs4 数据解析

一、介绍

Bs4（beautifulsoup4）：是一个可以从 HTML 或 XML 文件中提取数据的网页信息提取库。

Bs4 与 XPath 区别

XPath：根据路径找数据

Bs4：使用封装好的方法获取数据

二、使用

安装第三方库

pip install beautifulSoup4

pip install lxml

# 导入
from bs4 import BeautifulSoup
# 创建对象，网页源码，解析器
soup = BeautifulSoup(html_doc, 'lxml')
1
2
3
4

三、Bs4 对象种类

# 导入库
from bs4 import BeautifulSoup

# 模拟数据
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

1、tag：标签

# soup.title：查找的是 title 标签
print(soup.title)  # The Dormouse's story
print(type(soup.title))  # 
print(soup.p)  # The Dormouse's story
print(soup.a)  # Elsie
1
2
3
4
5

2、NavigableString ：可导航的字符串

# 标签里面的文本内容
title_tag = soup.title
print(title_tag.string)  # The Dormouse's story
print(type(title_tag.string))  # 
1
2
3
4

3、BeautifulSoup：bs对象

print(type(soup))  # 
1

4、Comment：注释

html = ''
soup2 = BeautifulSoup(html, "lxml")
print(soup2.b.string)  # 好好坚持学习python
print(type(soup2.b.string))  # 
1
2
3
4

四、遍历文档树

# 导入库
from bs4 import BeautifulSoup

# 模拟数据
html_doc = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and the
ir names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

1、遍历子节点

1）contents：返回的是一个所有子节点的列表

print(soup.head.contents)  # [The Dormouse's story]
1

2）children：返回的是一个子节点的迭代器

# 通过循环取出迭代器里面的内容
for i in soup.head.children:
    print(i)  # The Dormouse's story
1
2
3

3）descendants：返回的是一个生成器遍历子子孙孙

for i in soup.head.descendants:
    print(i)
# The Dormouse's story
# The Dormouse's story  
1
2
3
4

2、获取节点内容

1）string：获取标签里面的内容

# 只能获取单个
print(soup.title.string)  # The Dormouse's story
print(soup.head.string)  # The Dormouse's story
print(soup.html.string)  # None
1
2
3
4

2）strings：返回的是一个生成器对象用过来获取多个标签内容

# 返回生成器
print(soup.html.strings)  # 
# 通过循环取出生成器里面的内容，使用 strings，会出现多个换行符
for i in soup.html.strings:
    print(i)
# The Dormouse's story
#
#
#
#
# The Dormouse's story
#
#
# Once upon a time there were three little sisters; and the
# ir names were
#
# Elsie
# ,
#
# Lacie
#  and
#
# Tillie
# ;
# and they lived at the bottom of a well.
#
#
# ...
#
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

3）stripped_strings：和 strings 基本一致，但是它可以把多余的空格去掉

print(soup.html.stripped_strings)  # 
# 通过循环取出生成器里面的内容
for i in soup.html.stripped_strings:
    print(i)
# The Dormouse's story
# The Dormouse's story
# Once upon a time there were three little sisters; and the
# ir names were
# Elsie
# ,
# Lacie
# and
# Tillie
# ;
# and they lived at the bottom of a well.
# ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

3、遍历父节点

1）parent：直接获得父节点

print(soup.title.parent)  # The Dormouse's story
# 在 Bs4 中，html的父节点是 BeautifulSoup 对象
print(soup.html.parent)
# The Dormouse's story
# 
# The Dormouse's story
# Once upon a time there were three little sisters; and the
# ir names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...
# 
print(type(soup.html.parent))  # 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

2）parents：获取所有的父节点

# parents：获取所有的父节点
print(soup.a.parents)  # 
for i in soup.a.parents:
    print(i.name, soup.name)
# p [document]
# body [document]
# html [document]
# [document] [document]
1
2
3
4
5
6
7
8

4、遍历兄弟节点

# 导入库
from bs4 import BeautifulSoup

# 模拟数据
html_doc = '''
bbbcccddd
'''

# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1
2
3
4
5
6
7
8
9
10

1）next_sibling：下一个兄弟节点

# 紧挨着的
print(soup.b.next_sibling)  # ccc
1
2

2）previous_sibling：上一个兄弟节点

print(soup.c.previous_sibling)  # bbb
1

3）next_siblings：下一个所有兄弟节点

for i in soup.b.next_siblings:
    print(i)
# ccc
# ddd
1
2
3
4

4）previous_siblings：上一个所有兄弟节点

for i in soup.d.previous_siblings:
    print(i)
# ccc
# bbb
1
2
3
4

五、常用方法

# 导入库
from bs4 import BeautifulSoup

# 模拟数据
html_doc = """

    
        
            职位名称
            职位类别
            人数
            地点
            发布时间
        
        
            22989-金融云区块链高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-25
        
        
            22989-金融云高级后台开发
            技术类
            2
            深圳
            2017-11-25
        
        
            SNG16-腾讯音乐运营开发工程师（深圳）
            技术类
            2
            深圳
            2017-11-25
        
        
            SNG16-腾讯音乐业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-25
        
        
            TEG03-高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24
        
        
            TEG03-高级图像算法研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24
        
        
            TEG11-高级AI开发工程师（深圳）
            技术类
            4
            深圳
            2017-11-24
        
        
            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24
        
        
            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24
        
        
            SNG11-高级业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-24
        
    

"""

# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

1、find_all()

以列表形式返回所有的搜索到的标签数据

2、find()

返回搜索到的第一条数据

# 1、获取第一个 tr 标签
tr = soup.find('tr')  # 默认查找一个
print(tr)

# 2、获取所有的 tr 标签
trs = soup.find_all('tr')  # 返回列表，每一组 tr 存放在列表
print(trs)

# 3、获取第二个 tr 标签
tr2 = soup.find_all('tr')[1]  # 返回列表，取下标即可
print(tr2)

# 4、获取 class="odd" 的标签
# 方法一
odd = soup.find_all('tr', class_='odd')
for tr in odd:
    print(tr)
# 方法二
odd2 = soup.find_all('tr', attrs={'class': 'odd'})
for tr in odd2:
    print(tr)

# 5、获取所有 a 标签里面的 href 属性值
lst = soup.find_all('a')
for a in lst:
    print(a['href'])

# 6、获取所有的岗位信息
lst_data = soup.find_all('a')
for a in lst_data:
    print(a.string)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

六、CSS选择器

# 导入库
from bs4 import BeautifulSoup

# 模拟数据
html_doc = """ 
睡鼠的故事 
 
睡鼠的故事

从前有三个小姐妹；他们的名字是
Elsie、
Lacie 和
Tillie；
他们住在井底。

... 
"""

# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

相关语法：http://www.w3cmap.com/cssref/css-selectors.html

# 获取 class 为 sister 的数据
print(soup.select('.sister'))

# 获取 id 为 link1 的数据
print(soup.select('#link1'))

# 获取 title 标签里面的文本内容
print(soup.select('title')[0].string)

# 获取 p 标签下的 a 标签
print(soup.select('p a'))
1
2
3
4
5
6
7
8
9
10
11

七、案例

目标网站：http://www.weather.com.cn/textFC/hb.shtml

需求：获取全国的天气，包括城市名称和最低温度，并将数据保存到 csv 文件当中

# 导入
import requests
from bs4 import BeautifulSoup
import csv

# 表格数据
lst = []

# 获取网页源码
def get_html(url):
    # 发请求
    html = requests.get(url)
    # 发现乱码，处理编码
    html.encoding = 'utf-8'
    # 得到网页源码
    html = html.text
    # 返回到函数调用处
    return html

# 解析网页数据
def parse_html(html):
    # 创建对象
    soup = BeautifulSoup(html,'html5lib')
    # 解析
    conMidtab = soup.find('div', class_='conMidtab')
    # print(conMidtab)
    tables = conMidtab.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index, tr in enumerate(trs):
            dic = {}
            # 拿到对应的标签
            if index == 0: # 判断是否是第一个城市
                # 第一个城市
                city_td = tr.find_all('td')[1]
            else:
                # 其他城市
                city_td = tr.find_all('td')[0]
            temp_td = tr.find_all('td')[-2]
            # print(city_td,temp_td)
            # 对应的标签里面拿文本内容
            dic['city'] = list(city_td.stripped_strings)[0]
            dic['temp'] = temp_td.string
            lst.append(dic)

# 保存数据
def save_data():
    # 规定表头
    head = ('city','temp')
    # csv 文件写入
    with open('weather.csv','w',encoding='utf-8-sig',newline='') as f:
        # 创建 csv 对象
        writer = csv.DictWriter(f, fieldnames=head)
        # 写入表头
        writer.writeheader()
        # 写入数据
        writer.writerows(lst)

# 获取不同地区 url
def area(link):
    # 获取网页源码
    link = get_html(link)
    # 创建对象
    soup = BeautifulSoup(link, 'html5lib')
    # 解析
    conMidtab = soup.find('ul', class_='lq_contentboxTab2')
    # 找到 a 链接
    tagas = conMidtab.find_all('a')
    # url 列表
    hrefs = []
    # 循环获取 url
    for i in tagas:
        hrefs.append('http://www.weather.com.cn' + i.get('href'))
    # 打印 url 列表
    # print(hrefs)
    # 返回函数值
    return hrefs

# 处理主逻辑
def main():
    # 确定 url
    link = 'http://www.weather.com.cn/textFC/hb.shtml'
    # 不同地区 url
    lst = area(link)
    # print(lst)
    for i in lst:
        url = i
        # 获取网页源码
        html = get_html(url)
        # 数据解析
        parse_html(html)
    # 保存内容
    save_data()

# 运行主程序
main()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

记录学习过程，欢迎讨论交流，尊重原创，转载请注明出处~

相关阅读:
windows系统如何查看Linux文件系统中的图片缩略图
 Sql力扣算法：262. 行程和用户
 springboot+视频网站毕业设计-附源码240925
js的试题
 带妹妹学密码系列三 ——分组密码（二）
超级详细Spring AI+ChatGPT（java接入OpenAI大模型）
Symfony多语言支持实现指南：打造国际化Web应用
 quick3-hydra
哈希-闭散列
 Elasticsearch个人学习笔记
原文地址：https://blog.csdn.net/muyuhen/article/details/132827449

职位名称	职位类别	人数	地点	发布时间
22989-金融云区块链高级研发工程师（深圳）	技术类	1	深圳	2017-11-25
22989-金融云高级后台开发	技术类	2	深圳	2017-11-25
SNG16-腾讯音乐运营开发工程师（深圳）	技术类	2	深圳	2017-11-25
SNG16-腾讯音乐业务运维工程师（深圳）	技术类	1	深圳	2017-11-25
TEG03-高级研发工程师（深圳）	技术类	1	深圳	2017-11-24
TEG03-高级图像算法研发工程师（深圳）	技术类	1	深圳	2017-11-24
TEG11-高级AI开发工程师（深圳）	技术类	4	深圳	2017-11-24
15851-后台开发工程师	技术类	1	深圳	2017-11-24
15851-后台开发工程师	技术类	1	深圳	2017-11-24
SNG11-高级业务运维工程师（深圳）	技术类	1	深圳	2017-11-24