Python爬虫——BautifulSoup 常用函数的使用

文章目录

Python爬虫——BautifulSoup 常用函数

Python爬虫——BautifulSoup 常用函数

1、find_all()函数

find_all() 函数（常用）：搜索当前标签的所有子节点，并判断这些节点是否符合过滤条件，将所有符合条件的结果以列表形式返回

使用语法：

find_all(name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)
1

参数介绍：

name：标签名，如a，img，字符串对象会被自动忽略掉。
attributes：定义一个字典来搜索包含特殊属性的标签。
recursive：是否递归。如果是，就会查找tag的所有子孙标签，默认true。
text：标签的文本内容去匹配，而不是标签的属性。
limit: 限制搜索的数据个数，其实find() 函数就是limit=1。
keyword：选择那些具有指定属性的标签。

待解析的html文本文件：

DOCTYPE html>
<html lang="en">
  <head>
   <title>
   title>
  head>
  <body>
   <p class="title"/>

   <a href="http://localhost:8080"/>
   <p class="story">
        <a class="s1" href="www.baidu.com" id="l1">a>
        <a class="s2" href="" id="l2">a>
        <a class="s3" href="" id="l3">a>
       <span> span span>
   p>
  body>
html>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

使用实例：

使用 lxml 解析器实例化BeautifulSoup对象：

from bs4 import BeautifulSoup

#使用 lxml 解析器
soup = BeautifulSoup(open('test.html',encoding='utf-8'),'lxml')
1
2
3
4

查找所有a标签并返回：

print(soup.findAll('a'))
1

查找所有a标签和span标签并返回：

print(soup.findAll(['a','span']))
1

查找所有a标签，只返回前3条数据：

print(soup.find_all("a",limit=3))
1

根据标签属性以及属性值查找内容：

print(soup.find_all("a",class_="s2"))
print(soup.find_all("a",id="l3"))
print(soup.find_all("a",id="id"))
1
2
3

执行结果：

第一个输出：
[, , , ]

第二个输出：
[, , , ,  span ]

第三个输出：
[, , ]

第四个输出：
[]

第五个输出：
[]

第六个输出：没找到返回空列表[]
[]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

2、find()函数

find() 函数：搜索当前标签的所有子节点，返回一个符合过滤条件的结果

使用语法：

find(name=None, attrs={}, recursive=True, text=None,**kwargs)
1

参数介绍：

name：标签名，如a，img，字符串对象会被自动忽略掉。
attributes：定义一个字典来搜索包含特殊属性的标签。
recursive：是否递归。如果是，就会查找tag的所有子孙标签，默认true。
text：标签的文本内容去匹配，而不是标签的属性。
keyword：选择那些具有指定属性的标签。
因为find()函数仅返回一个符合过滤条件的结果，所以find() 没有limit参数

待解析的html文本文件：

DOCTYPE html>
<html lang="en">
    <head>
        <title>
        title>
    head>
    <body>
        <p class="title"/>
        <a href="http://localhost:8080"/>
        <p class="story">
            <a class="s1" href="www.baidu.com" id="l1">a>
            <a class="s2" href="" id="l2">a>
            <a class="s3" href="" id="l3">a>
        p>
        <p class="story"/>
    body>
html>  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

使用实例：

使用 lxml 解析器实例化BeautifulSoup对象：

import urllib.request
from bs4 import BeautifulSoup
import re

#使用 lxml 解析器
soup = BeautifulSoup(open('test.html',encoding='utf-8'),'lxml')
1
2
3
4
5
6

获取第一个标签并返回结果：

print(soup.find('a'))
1

根据指定href属性值查找a标签并返回结果，没找到返回None：

print(soup.find('a',href='www.baidu.com'))
print(soup.find('a',href='www.alibaba.com'))
1
2

根据属性选择器查找，查找有class属性的a标签：

print(soup.select('a[class]'))
1

根据指定class属性值查找a标签并返回结果,class为python关键字，使用时要加下划线区分：

print(soup.find('a',class_='s2'))
1

根据属性值正则匹配：

print(soup.find(class_=re.compile('le')))
1

attrs参数值匹配：

print(soup.find(attrs={'class':'title'}))
print(soup.find(attrs={'id':'l3'}))
1
2

执行结果：

第一个输出：


第二个输出：


第三个输出：找不到目标时None
None

第四个输出：


第五个输出：


第六个输出：


第七个输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

find() 函数和 find_all() 函数的区别

find_all() 函数的返回结果类型是列表类型，find() 函数返回的结果是找到的第一个节点
find_all() 函数没有找到目标时返回空列表 []，find() 函数没有找不到目标时返回的是 None
使用attributes参加检索时，有些标签属性在检索时不能使用，如 HTML5 中的 data-* 属性，像下面这样就会报错：
```
soup.find_all(data-foo='value')
1
```
但是可以通过 find_all() 函数的 attributes参数定义一个字典参数来检索包含特殊属性的标签，如下：
```
soup.find_all(attrs={"data-foo": "value"})
1
```

3、select()函数

BeautifulSoup 支持大部分的 CSS 选择器，比如常见的标签选择器、类选择器、id 选择器，以及层级选择器。向BeautifulSoup 的select() 函数中传入CSS 选择器作为参数，就可以在 HTML 文档中检索到与之对应的内容，返回类型为列表类型。

待解析的html文本文件：

DOCTYPE html>
<html lang="en">
<head>
<title>testtitle>
head>
<body>
<p class="title"/>

<a href="http://localhost:8080"/>
<div>
    <p class="story">
        <a class="s1" href="www.baidu.com" id="l1">a>
        <a class="s2" href="" id="l2">a>
        <a class="s3" href="" id="l3">a>
        <span> span span>
    p>
div>
body>
html>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

使用实例：

使用 lxml 解析器实例化BeautifulSoup对象：

from bs4 import BeautifulSoup

#使用 lxml 解析器
soup = BeautifulSoup(open('test.html',encoding='utf-8'),'lxml')
1
2
3
4

根据元素标签查找：

print(soup.select('a'))
1

根据多个元素标签查找：

print(soup.select('a,span'))
1

根据属性选择器查找，查找有class属性的a标签：

print(soup.select('a[class]'))
1

查找class属性值为s3的a标签：

print(soup.select('a[class="s3"]'))
1

根据class选择器查找：

print(soup.select('.s1'))
1

通过id选择器查找：

print(soup.select('#l1'))
1

后代选择器查找，查找html下面的head下面的title标签：

print(soup.select('html head title'))
1

子代选择器(一级子标签),查找div下面的p下面的span标签，在bs4中可以不加空格：

print(soup.select('div > p > span'))
1

执行结果：

第一个输出：
[, , , ]

第二个输出：
[, , , ,  span ]

第三个输出：
[, , ]

第四个输出：
[]

第五个输出：
[]

第六个输出：
[test]

第七个输出：
[]

第八个输出：
[ span ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

相关阅读:
Intellij IDEA 运行时报 Command line is too long
第四章文件管理五、文件存储空间管理
 LeetCode题解：剑指 Offer 03. 数组中重复的数字，原地置换，JavaScript，详细注释
 【正点原子FPGA连载】第二十五章双路高速AD实验摘自【正点原子】DFZU2EG/4EV MPSoC 之FPGA开发指南V1.0
Linux初探 - 概念上的理解和常见指令的使用
 conda安装pytorch
程序员对代码注释可以说是又爱又恨又双标……怎么看待程序员不写注释这一事件的呢？
C#界面里的winform ContextMenuStrip属性
 RH850P1X芯片学习笔记-Generic Timer Module -ATOM
外卖项目（项目优化1）10---缓存优化
原文地址：https://blog.csdn.net/wpc2018/article/details/126179938