数据提取2

数据提取之正则

目标：掌握正则表达式的常见语法

掌握re模块的常见语法

掌握原始字符串 r 的用法

1. 什么是正则表达式

用事先定义好的一些特定字符，及这些特定字符的组合，组成一个规则字符串，这个规则字符串用来表达对字符串的一种过滤逻辑

2. 正则表达式的常见语法

知识点：

正则中的字符

正则中的预定义字符集

正则中的数量词

在这里插入图片描述

正则的语法很多，不能够全部复习，对于其他的语法，可以临时查阅资料，比如：表示或还能使用 |

练习：下面的输出是什么？

import re
str1 = 'adacc/sd/sdef/24'

result = re.findall(r'<.*>', str1)
print(result)
1
2
3
4
5

3. re模块的常见方法

pattern.match（从头找一个）

pattern.search（找一个）

pattern.findall（找所有）

贪婪模式在整个表达式匹配成功的前提下，尽可能多的匹配

非贪婪模式在整个表达式匹配成功的前提下，尽可能少的匹配

返回一个列表，没有就是空列表

 re.findall("\d", "aef5teacher2") >>>>  ['5', '2']
1

pattern.sub（替换）

re.sub("\d", "_", "aef5teacher2") >>>>  ['aef_teacher_']
1

re.compile（编译）

返回一个模型p，具有和re一样的方法，但是传递的参数不同

匹配模式需要传到compile中

p = re.compile("\d", re.s)
p.findall("aef_teacher")
1
2

4. python中原始字符串 r 的用法

定义：所有的字符串都是直接按照字面的意思来使用，没有转义的特殊字符或不能打印的字符，原始字符串往往针对特殊字符而言，例如：“\n"的原始字符串就是”\n"

原始字符串的长度：

len('\n')
# 结果  1
len(r'\n')
# 结果  2
'\n'[0]
# 结果  '\n'
r'\n'[0]
# 结果  '\\'
1
2
3
4
5
6
7
8

数据提取之lxml模块与xpath工具

目标：了解xpath的定义

了解lxml

掌握xpath的语法

lxml是一款高新能的python html/xml解析器，我们可以利用xpath，来快速的定位特定元素以及获取节点信息

1. 了解 lxml模块和xpath语法

对html或xml形式的文本提取特定的内容，就需要我们掌握lxml模块的使用和xpath语法。

lxml模块可以利用XPath规则语法，来快速的定位HTML\XML 文档中特定元素以及获取节点信息（文本内容、属性值）
XPath (XML Path Language) 是一门在 HTML\XML 文档中查找信息的语言，可用来在 HTML\XML 文档中对元素和属性进行遍历。
- W3School官方文档：http://www.w3school.com.cn/xpath/index.asp
提取xml、html中的数据需要lxml模块和xpath语法配合使用

2. 谷歌浏览器xpath helper插件的安装和使用

要想利用lxml模块提取数据，需要我们掌握xpath语法规则。接下来我们就来了解一下xpath helper插件，它可以帮助我们练习xpath语法**(安装包见课件–工具文件夹)**

下载Chrome插件 XPath Helper
- 可以在chrome应用商城进行下载
将rar压缩包解压到当前文件夹
打开谷歌浏览器---->右上角三个点---->更多工具---->扩展程序
在扩展程序界面，点击右上角的开关，进入开发者模式后，将xpath文件夹拖进，释放鼠标即可

安装完成，校验

3. xpath的节点关系

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D6IwuiMy-1658853026856)(../img/节点.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lRUWk5Nv-1658853026858)(../img/xpath中节点的关系.png)]

4. xpath语法-基础节点选择语法

XPath 使用路径表达式来选取 XML 文档中的节点或者节点集。
这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。
使用chrome插件选择标签时候，选中时，选中的标签会添加属性class=“xh-highlight”

xpath定位节点以及提取属性或文本内容的语法

表达式	描述
nodename	选中该元素。
/	从根节点选取、或者是元素和元素间的过渡。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
…	选取当前节点的父节点。
@	选取属性。
text()	选取文本。

5. xpath语法-节点修饰语法

可以根据标签的属性值、下标等来获取特定的节点

5.1 节点修饰语法

路径表达式	结果
//title[@lang=“eng”]	选择lang属性值为eng的所有title元素
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()>1]	选择bookstore下面的book元素，从第二个开始选择
//book/title[text()=‘Harry Potter’]	选择所有book下的title元素，仅仅选择文本为Harry Potter的title元素
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

5.2 关于xpath的下标

在xpath中，第一个元素的位置是1
最后一个元素的位置是last()
倒数第二个是last()-1

6. xpath语法-其他常用节点选择语法

// 的用途

//a 当前html页面上的所有的a
bookstore//book bookstore下的所有的book元素

@ 的使用

//a/@href 所有的a的href
//title[@lang=“eng”] 选择lang=eng的title标签

text() 的使用

//a/text() 获取所有的a下的文本
//a[texts()=‘下一页’] 获取文本为下一页的a标签
a//text() a下的所有的文本

xpath查找特定的节点

//a[1] 选择第一个s
//a[last()] 最后一个
//a[position()<4] 前三个

包含

//a[contains(text(),“下一页”)]选择文本包含下一页三个字的a标签**
//a[contains(@class,‘n’)] class包含n的a标签

7. lxml模块的安装与使用示例

lxml模块是一个第三方模块，安装之后使用

7.1 lxml模块的安装

对发送请求获取的xml或html形式的响应内容进行提取

pip/pip3 install lxml
1

7.2 爬虫对html提取的内容

提取标签中的文本内容
提取标签中的属性的值
- 比如，提取a标签中href属性的值，获取url，进而继续发起请求

7.3 lxml模块的使用

导入lxml 的 etree 库

from lxml import etree
利用etree.HTML，将html字符串（bytes类型或str类型）转化为Element对象，Element对象具有xpath的方法，返回结果的列表
```
html = etree.HTML(text) 
ret_list = html.xpath("xpath语法规则字符串")
1
2
```
xpath方法返回列表的三种情况
- 返回空列表：根据xpath语法规则字符串，没有定位到任何元素
- 返回由字符串构成的列表：xpath字符串规则匹配的一定是文本内容或某属性的值
- 返回由Element对象构成的列表：xpath规则字符串匹配的是标签，列表中的Element对象可以继续进行xpath

lxml模块使用示例

运行下面的代码，查看打印的结果

from lxml import etree
text = ''' 
 
   
    
      first item
     
    
      second item
     
    
      third item
     
    
      fourth item
     
    
      a href="link5.html">fifth item
   

'''

html = etree.HTML(text)

#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")

#组装成字典
for href in href_list:
    item = {}
    item["href"] = href
    item["title"] = title_list[href_list.index(href)]
    print(item)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

练习

将下面的html文档字符串中，将每个class为item-1的li标签作为1条新闻数据。提取a标签的文本内容以及链接，组装成一个字典。

text = '''   
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
 
 '''
1
2
3
4
5
6
7

注意：
- 先分组，再提取数据，可以避免数据的错乱
- 对于空值要进行判断

每一组中继续进行数据的提取

 for li in li_list:
      item = {}
      item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
      item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
      print(item)
1
2
3
4
5

##### 知识点：掌握 lxml模块中使用xpath语法定位元素提取属性值或文本内容
##### lxml模块中etree.tostring函数的使用
####> 运行下边的代码，观察对比html的原字符串和打印输出的结果

from lxml import etree
html_str = '''   
      first item 
      second item 
      third item 
      fourth item 
      fifth item 
      
 
 '''

html = etree.HTML(html_str)

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

现象和结论

打印结果和原来相比：

自动补全原本缺失的li标签
自动补全html等标签

<html><body><div> <ul> 
<li class="item-1"><a href="link1.html">first itema>li> 
<li class="item-1"><a href="link2.html">second itema>li> 
<li class="item-inactive"><a href="link3.html">third itema>li> 
<li class="item-1"><a href="link4.html">fourth itema>li> 
<li class="item-0"><a href="link5.html">fifth itema> 
li>ul> div> body>html>
1
2
3
4
5
6
7

结论：

lxml.etree.HTML(html_str)可以自动补全标签
lxml.etree.tostring函数可以将转换为Element对象再转换回html字符串
爬虫如果使用lxml来提取数据，应该以lxml.etree.tostring的返回结果作为提取数据的依据

课后练习

初步使用

我们利用它来解析 HTML 代码，简单示例：

# lxml_test.py

# 使用 lxml 的 etree 库
from lxml import etree

html = '''

    
         first item
         second item
         third item
         fourth item
         fifth item # 注意，此处缺少一个  闭合标签
     
 
'''

#利用etree.HTML，将字符串解析为HTML文档
xml_doc = etree.HTML(html)

# 按字符串序列化HTML文档
html_doc = etree.tostring(xml_doc)

print(html_doc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

输出结果：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first itema>li>
         <li class="item-1"><a href="link2.html">second itema>li>
         <li class="item-inactive"><a href="link3.html">third itema>li>
         <li class="item-1"><a href="link4.html">fourth itema>li>
         <li class="item-0"><a href="link5.html">fifth itema>li>
ul>
 div>
body>html>
1
2
3
4
5
6
7
8
9
10
11

lxml 可以自动修正 html 代码，例子里不仅补全了 li 标签，还添加了 body，html 标签。

文件读取：

除了直接读取字符串，lxml还支持从文件里读取内容。我们新建一个hello.html文件：



<div>
    <ul>
         <li class="item-0"><a href="link1.html">first itema>li>
         <li class="item-1"><a href="link2.html">second itema>li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third itemspan>a>li>
         <li class="item-1"><a href="link4.html">fourth itema>li>
         <li class="item-0"><a href="link5.html">fifth itema>li>
     ul>
 div>
1
2
3
4
5
6
7
8
9
10
11

再利用 etree.parse() 方法来读取文件。

# lxml_parse.py

from lxml import etree

# 读取外部文件 hello.html
html = etree.parse('./hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)
1
2
3
4
5
6
7
8
9

输出结果与之前相同：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first itema>li>
         <li class="item-1"><a href="link2.html">second itema>li>
         <li class="item-inactive"><a href="link3.html">third itema>li>
         <li class="item-1"><a href="link4.html">fourth itema>li>
         <li class="item-0"><a href="link5.html">fifth itema>li>
ul>
 div>
body>html>
1
2
3
4
5
6
7
8
9
10
11

XPath实例测试

1. 获取所有的
标签

# xpath_li.py

from lxml import etree

xml_doc = etree.parse('hello.html')
print type(html)  # 显示etree.parse() 返回类型

result = xml_doc.xpath('//li')

print result  # 打印标签的元素列表
print len(result)
print type(result)
print type(result[0])
1
2
3
4
5
6
7
8
9
10
11
12
13

输出结果：

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type 'list'>
<type 'lxml.etree._Element'>
1
2
3
4
5

2. 继续获取
标签的所有 `class`属性

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/@class')

print result
1
2
3
4
5
6
7
8

运行结果

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
1

3. 继续获取
标签下`href` 为 `link1.html` 的标签

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a[@href="link1.html"]')

print result
1
2
3
4
5
6
7
8

运行结果

[]
1

4. 获取
标签下的所有标签

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

#result = html.xpath('//li/span')
#注意这么写是不对的：
#因为 / 是用来获取子元素的，而  并不是  的子元素，所以，要用双斜杠

result = html.xpath('//li//span')

print result
1
2
3
4
5
6
7
8
9
10
11
12
13

运行结果

[]
1

5. 获取
标签下的标签里的所有 class

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a//@class')

print result
1
2
3
4
5
6
7
8

运行结果

['blod']
1

6. 获取最后一个
里边的的 href属性值

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li[last()]/a/@href')
# 谓语 [last()] 可以找到最后一个元素

print result
1
2
3
4
5
6
7
8
9
10

运行结果

['link5.html']
1

7. 获取倒数第二个元素的内容

# xpath_li.py

from lxml import etree

<a href="www.xxx.com">abcd</a>

html = etree.parse('hello.html')
result = html.xpath('//li[last()-1]/a')

# text 方法可以获取元素内容
print result[0].text
1
2
3
4
5
6
7
8
9
10
11

运行结果

fourth item
1

8. 获取 `class` 值为 `bold` 的标签名

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//*[@class="bold"]')

# tag方法可以获取标签名
print result[0].tag
1
2
3
4
5
6
7
8
9
10

运行结果

span
1

数据提取之BeautifuSoup模块与Css选择器(拓展)

"""
# BeautifulSoup
是一个高效的网页解析库，可以从HTML或XML文件中提取数据
支持不同的解析器，比如，对HTML解析，对XML解析，对HTML5解析
就是一个非常强大的工具，爬虫利器
一个灵感又方便的网页解析库，处理高效，支持多种解析器
利用它就不用编写正则表达式也能方便的实现网页信息的抓取
"""

# 安装 pip3 install BeautifulSoup4


# 标签选择器
### 通过标签选择
#### .string() --获取文本节点及内容

html = """

    
        The Dormouse's story
    
    
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
    ,
    Lacie and
    Tillie;
    and they lived at the bottom of a well.
    ...
"""
from bs4 import BeautifulSoup   # 导包
soup = BeautifulSoup(html, 'lxml')  # 参数1：要解析的html  参数2：解析器

# print(soup.prettify())  # 代码补全

print(soup.html.head.title.string)

print(soup.title.string)  #title是个节点， .string是属性 作用是获取字符串文本

# 选取整个head，包含标签本身
print(soup.head) # 包含head标签在内的所有内容

print(soup.p) # 返回匹配的第一个结果


#%% md

### 获取名称
#### .name()  --获取标签本身名称  

#%%

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.title.name)  # 结果为标签本身  --> title
print(soup.p.name)  # --> 获取标签名

#%% md

### 获取属性值

#### .attrs()  --通过属性拿属性值 

#%%

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.p.attrs['name'])# 获取p标签name属性的属性值

print(soup.p.attrs['id']) # 获取p标签id属性的属性值
print(soup.p['id']) #第二种写法

print(soup.p['class']) # 以列表得形式保存
print(soup.a['href'])  # 也是只返回第一个值

#%% md

### 嵌套选择

一定要有子父级关系

#%%

html = """
The Dormouse's story

The abc Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.body.p.b.string)  #层层往下找

#%% md

### 子节点和子孙节点

#%%

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 标签选择器只能拿到部分内容 ，不能拿到所有，那如何解决？？

# .contents属性可以将tag(标签)的子节点以列表的形式输出
# print(soup.p.contents)  # 获取P标签所有子节点内容 返回一个list

for i in soup.p.contents:
    print(i)


#%%



#%%

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# .children是一个list类型的迭代器
print(soup.p.children)  # 获取子节点  返回一个迭代器

for i in soup.p.children:
    print(i)

for i, child in enumerate(soup.p.children):  
    print(i, child)

#%%

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)  # 获取子孙节点  返回一个迭代器
for i, child in enumerate(soup.p.descendants):
    print(i, child)

#%% md

### 父节点和祖先节点

#%%

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)  # 获取父节点

#%%

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))  # 获取祖先节点

#%% md

### 兄弟节点

#%%

html = """

    
        The Dormouse's story
    
    
        
            abcqweasd
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))  # 后边的所有的兄弟节点
print('---'*15)
print(list(enumerate(soup.a.previous_siblings))) # 前边的

#%% md

## 实用：标准选择器

#%% md

### find_all( name , attrs , recursive , text , **kwargs )

#%% md

可根据标签名、属性、内容查找文档

#%% md

#### 使用find_all根据标签名查找

#%%

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo-2
            Bar-2
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.find_all('ul'))  # 拿到所有ul标签及其里面内容
print(soup.find_all('ul')[0])

ul = soup.find_all('ul')
print(ul) # 拿到整个ul标签及其里面内容
print('____________'*10)

for ul in soup.find_all('ul'):
#     print(ul)  # 遍历ul标签
    for li in ul:
#         print(li)  #遍历li标签
        print(li.string)  # 拿到所有li标签里的文本内容

#%% md

#### 获取文本值

#%%

for ul in soup.find_all('ul'):
    for i in ul.find_all("li"):
        print(i.string)

#%% md

#### 根据属性查找

#%%

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 第一种写法 通过attrs
# print(soup.find_all(attrs={'id': 'list-1'})) # 根据id属性
print("-----"*10)
# print(soup.find_all(attrs={'name': 'elements'}))  # 根据name属性

for ul in soup.find_all(attrs={'name': 'elements'}):
    print(ul)  
    print(ul.li.string)  #只能给你返回第一个值
# # # #     print('-----')
    for li in ul:
#         print(li)
        print(li.string)

#%% md

#### 特殊的属性查找  

#%%

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 第二种写法
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))  # class属于Python关键字，做特殊处理 _

# 推荐的查找方法 li标签下的class属性
print(soup.find_all('li',{'class','element'}))  

#%% md

####  根据文本值选择 text

#%%

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(text='Foo')) # 可以做内容统计用
print(soup.find_all(text='Bar'))
print(len(soup.find_all(text='Foo'))) # 统计数量

#%% md

### find( name , attrs , recursive , text , **kwargs )

#%% md

find返回单个元素，find_all返回所有元素

#%%

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul')) # 只返回匹配到的第一个
print(soup.find('li'))

print(soup.find('page')) # 如果标签不存在返回None

#%% md

### find_parents()  find_parent()

#%% md

find_parents()返回所有祖先节点，find_parent()返回直接父节点。

#%% md

### find_next_siblings()  find_next_sibling()

#%% md

find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

#%% md

### find_previous_siblings()  find_previous_sibling()

#%% md

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

#%% md

### find_all_next()  find_next()

#%% md

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

#%% md

### find_all_previous() 和 find_previous()

#%% md

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

#%% md

## CSS选择器

#%% md

通过select()直接传入CSS选择器即可完成选择

如果对HTML里的CSS选择器很熟悉可以考虑用此方法

#%% md

注意：

    1，写CSS时，标签名不加任何修饰，类名前加. , id名前加# 
    
    2，用到的方法时soup.select()，返回类型是list
    
    3，多个过滤条件需要用空格隔开,从前往后是逐层筛选


#%%

html='''
q321312321

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 层级 ul li
print(soup.select('ul li'))  # 标签不加任何修饰
print("----"*10)
print(soup.select('.panel .panel-heading')) # 类名前加.
print("----"*10)

print(soup.select('#list-1 .element')) 
print("----"*10)


#%%

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    for i in ul.select('li'):
        print(i.string)
        

### 获取属性


html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# []获取id属性  attrs获取class属性
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['class'])


### 获取内容
### get_text()    

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.string)
    print(li.get_text())  # 获取内容

* 推荐使用lxml解析库，必要时使用html.parser
* 标签选择筛选功能弱但是速度快
* 建议使用find()、find_all() 查询匹配单个结果或者多个结果
* 如果对CSS选择器熟悉建议使用select()
* 记住常用的获取属性和文本值的方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689

数据提取之CSS选择器

css 语法概要

熟悉前端的同学对 css 选择器一定不会陌生，比如 jquery 中通过各种 css 选择器语法进行 DOM 操作等

数据提取性能比较

在爬虫中使用css选择器，代码教程

>>> from requests_html import session

# 返回一个Response对象
>>> r = session.get('https://python.org/')

# 获取所有链接
>>> r.html.links
{'/users/membership/', '/about/gettingstarted/'}

# 使用css选择器的方式获取某个元素
>>> about = r.html.find('#about')[0]

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

相关阅读:
艾美捷ProSci丨ProSci TM4SF1 抗体解决方案
 QString类与整型，浮点数互转
 三、C语言存储类
 【算法刷题】1 Python基础篇
 【毕业季】这四年一路走来都很值得——老学长の忠告
 Python实现WOA智能鲸鱼优化算法优化循环神经网络分类模型(LSTM分类算法)项目实战
 测试岗面试，一份好的简历总可以让人眼前一亮
 如何化解从数据到数据资源入表的难题
 gRPC调试, 用 Apipost
Java的集合框架总结
原文地址：https://blog.csdn.net/weixin_51550438/article/details/126005977

数据提取2

数据提取之正则

1. 什么是正则表达式

2. 正则表达式的常见语法

3. re模块的常见方法

4. python中原始字符串 r 的用法

数据提取之lxml模块与xpath工具

1. 了解 lxml模块和xpath语法

2. 谷歌浏览器xpath helper插件的安装和使用

3. xpath的节点关系

4. xpath语法-基础节点选择语法

xpath定位节点以及提取属性或文本内容的语法

5. xpath语法-节点修饰语法

5.1 节点修饰语法

5.2 关于xpath的下标

6. xpath语法-其他常用节点选择语法

7. lxml模块的安装与使用示例

7.1 lxml模块的安装

7.2 爬虫对html提取的内容

7.3 lxml模块的使用

lxml模块使用示例

练习

现象和结论

课后练习

初步使用

文件读取：

XPath实例测试

1. 获取所有的 标签

2. 继续获取 标签的所有 class属性

3. 继续获取标签下href 为 link1.html 的 标签

4. 获取 标签下的所有 标签

5. 获取 标签下的标签里的所有 class

6. 获取最后一个 里边的 的 href属性值

7. 获取倒数第二个元素的内容

8. 获取 class 值为 bold 的标签名

数据提取之BeautifuSoup模块与Css选择器(拓展)

Hello

Hello

Hello

Hello

Hello

Hello

Hello

Hello

Hello

数据提取之CSS选择器

css 语法概要

在爬虫中使用css选择器，代码教程

1. 获取所有的
标签

2. 继续获取
标签的所有 `class`属性

3. 继续获取
标签下`href` 为 `link1.html` 的标签

4. 获取
标签下的所有标签

5. 获取
标签下的标签里的所有 class

6. 获取最后一个
里边的的 href属性值

8. 获取 `class` 值为 `bold` 的标签名