【零基础一看就会】Python爬虫从入门到应用（上）

【零基础一看就会】Python爬虫从入门到应用（上）
目录

一、beautifulsoup解析

1.1 beautifulsoup的简单使用

安装

解析器

解析器对比　

快速开始

如何使用

对象的种类

Tag

tag的名字

name和attributes属性

NavigableString(字符串)

BeautifulSoup

Comment

1.2 beautifulsoup的遍历文档树

子节点

contents 和 .children

descendants

节点内容

.string

.text

多个内容

.strings

.stripped_strings

父节点

.parent

.parents

1.3 beautifulsoup的搜索文档树

find_all

name 参数

keyword 参数

text 参数

limit 参数

find()

find_parents() 和 find_parent()

1.4 beautifulsoup的css选择器

通过标签名查找

通过类名查找

id名查找

组合查找

属性查找

二、xpath解析

1.1 xpath安装与使用

1.2 解析流程与使用

1.3 xpath语法

路径表达式

谓语（Predicates）

选取未知节点

逻辑运算

属性查询

获取第几个标签索引从1开始

模糊查询

内容查询

属性值获取

获取所有

获取节点内容转换成字符串

一、beautifulsoup解析

1.1 beautifulsoup的简单使用

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

安装
```
pip install beautifulsoup4
```
解析器

Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。
```
pip install lxml
```
解析器对比　

官网文档

快速开始

下面的一段HTML代码将作为例子被多次用到，这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)：
```
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
```
使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出：
```
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
# html进行美化
print(soup.prettify())
```
打印代码：
```
<html>
 <head>
  <title>
   The Dormouse's story
  title>
 head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   b>
  p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   a>
   ;
and they lived at the bottom of a well.
  p>
  <p class="story">
   ...
  p>
 body>
html>
```
几个简单的浏览结构化数据的方法：
```
soup.title  # 获取标签title
# The Dormouse's story
 
soup.title.name   # 获取标签名称
# 'title'
 
soup.title.string   # 获取标签title内的内容
# 'The Dormouse's story'
 
soup.title.parent  # 获取父级标签
 
soup.title.parent.name  # 获取父级标签名称
# 'head'
 
soup.p
# The Dormouse's story
 
soup.p['class']  # 获取p的class属性值
# 'title'
 
soup.a
# Elsie
 
soup.find_all('a')
# [Elsie,
#  Lacie,
#  Tillie]
 
soup.find(id="link3")  # 获取id为link3的标签
# Tillie
```
从文档中找到所有标签的链接：
```
for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie
```
从文档中获取所有文字内容：
```
print(soup.get_text())
```
如何使用

将一段文档传入BeautifulSoup 的构造方法，就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄。
```
from bs4 import BeautifulSoup
 
soup = BeautifulSoup(open("index.html"), 'lxml')
 
soup = BeautifulSoup("data", 'lxml')
```
然后Beautiful Soup选择最合适的解析器来解析这段文档，如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档。

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为

Tag , NavigableString , BeautifulSoup , Comment .

Tag

通俗点讲就是 HTML 中的一个个标签，Tag 对象与XML或HTML原生文档中的tag相同：
```
soup = BeautifulSoup('Extremely bold')
tag = soup.b
type(tag)
# 
```
tag的名字

soup对象再以爱丽丝梦游仙境的html_doc为例，操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取标签,只要用 soup.head ：
```
soup.head
# The Dormouse's story
 
soup.title
# The Dormouse's story
```
这是个获取tag的小窍门，可以在文档树的tag中多次调用这个方法。下面的代码可以获取标签中的第一个标签：

soup.body.b # The Dormouse's story

通过点取属性的方式只能获得当前名字的第一个tag：

soup.a # Elsie

如果想要得到所有的标签，或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法，比如: find_all()

soup.find_all('a')# [Elsie, # Lacie, # Tillie]

我们可以利用 soup加标签名轻松地获取这些标签的内容，注意，它查找的是在所有内容中的第一个符合要求的标签。

name和attributes属性

Tag有很多方法和属性，现在介绍一下tag中最重要的属性：name和attributes

每个tag都有自己的名字，通过 .name 来获取：

tag.name # 'b' tag['class'] # 'boldest' tag.attrs # {'class': 'boldest'}

tag的属性可以被添加，删除或修改。再说一次，tag的属性操作方法与字典一样（了解）。

tag['class'] = 'verybold' tag['id'] = 1 tag # Extremely bold del tag['class'] del tag['id'] tag # Extremely bold tag['class'] # KeyError: 'class' print(tag.get('class')) # None

NavigableString(字符串)

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可。

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串：

tag.string # 'Extremely bold' type(tag.string) #

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性。

print(type(soup.name)) # print(soup.name) # [document] print(soup.attrs) # {} 空字典

Comment

如果字符串内容为注释则为Comment。

html_doc='' soup = BeautifulSoup(html_doc, 'html.parser') print(soup.a.string) # Elsie print(type(soup.a.string)) #

a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。

1.2 beautifulsoup的遍历文档树

还拿”爱丽丝梦游仙境”的文档来做例子：

html_doc = """ The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ... """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')

通过这段例子来演示怎样从文档的一段内容找到另一段内容。

子节点

一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性。

注意：Beautiful Soup中字符串节点不支持这些属性，因为字符串没有子节点。

contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出：

head_tag = soup.head head_tag # The Dormouse's story head_tag.contents [The Dormouse<span class="hljs-string"><span class="hljs-string">'s story] title_tag = head_tag.contents[0] title_tag # The Dormouse'</span>s story title_tag.contents # [u'The Dormouse's story']

字符串没有 .contents 属性，因为字符串没有子节点：

text = title_tag.contents[0] text.contents # AttributeError: 'NavigableString' object has no attribute 'contents'

.children它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象。

通过tag的 .children 生成器，可以对tag的子节点进行循环：

print(title_tag.children) # print(type(title_tag.children)) # for child in title_tag.children: print(child) # The Dormouse's story

descendants

.contents 和 .children 属性仅包含tag的直接子节点。例如标签只有一个直接子节点</p> <pre data-index="23" class="set-code-show" name="code"><code class="language-python hljs"><ol class="hljs-ln" style="width:100%"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">head_tag.contents</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-comment"># [<title>The Dormouse's story]

但是标签也包含一个子节点：字符串 “The Dormouse’s story”，这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点。</p> <p><code>.descendants</code> 属性可以对所有tag的子孙节点进行递归循环。</p> <pre data-index="24" class="set-code-show" name="code"><code class="language-python hljs"><ol class="hljs-ln" style="width:100%"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">for</span> child <span class="hljs-keyword">in</span> head_tag.descendants:</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-built_in">print</span>(child)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="3"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-comment"># <title>The Dormouse's story

# The Dormouse's story

上面的例子中，标签只有一个子节点，但是有2个子孙节点：节点和的子节点，BeautifulSoup 有一个直接子节点(节点)，却有很多子孙节点：

len(list(soup.children)) # 1 len(list(soup.descendants)) # 25

节点内容

.string

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。例如：

print (soup.head.string) #The Dormouse's story # <b>The Dormouse's story</b> print (soup.title.string) #The Dormouse's story

如果tag包含了多个子节点，tag就无法确定，string 方法应该调用哪个子节点的内容，.string 的输出结果是 None。

print (soup.html.string) #None

.text

如果tag包含了多个子节点，text则会返回内部所有文本内容。

print (soup.html.text)

注意：

strings和text都可以返回所有文本内容

区别：text返回内容为字符串类型 strings为生成器generator

多个内容

.strings .stripped_strings 属性

.strings

获取多个内容，不过需要遍历获取，比如下面的例子：

for string in soup.strings: print(repr(string)) ''' '\n' "The Dormouse's story" '\n' '\n' "The Dormouse's story" '\n' 'Once upon a time there were three little sisters; and their names were\n' 'Elsie' ',\n' 'Lacie' ' and\n' 'Tillie' ';\nand they lived at the bottom of a well.' '\n' '...' '\n' '''

.stripped_strings

输出的字符串中可能包含了很多空格或空行，使用 .stripped_strings 可以去除多余空白内容

for string in soup.stripped_strings: print(repr(string)) ''' "The Dormouse's story" "The Dormouse's story" 'Once upon a time there were three little sisters; and their names were' 'Elsie' ',' 'Lacie' 'and' 'Tillie' ';\nand they lived at the bottom of a well.' '...' '''

父节点

继续分析文档树，每个tag或字符串都有父节点被包含在某个tag中

.parent

通过 .parent 属性来获取某个元素的父节点。在例子“爱丽丝”的文档中，标签是标签的父节点：</p> <pre data-index="32" class="set-code-show" name="code"><code class="language-python hljs"><ol class="hljs-ln" style="width:100%"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">title_tag = soup.title</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">title_tag</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="3"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-comment"># <title>The Dormouse's story

表达式	描述
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
./	当前节点再次进行xpath
@	选取属性。

路径表达式	结果
/html	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
//li	选取所有li 子元素，而不管它们在文档中的位置。
//ul//a	选择属于 ul元素的后代的所有 li元素，而不管它们位于 ul之下的什么位置。
节点对象.xpath('./div')	选择当前节点对象里面的第一个div节点

路径表达式	结果
/ul/li[1]	选取属于 ul子元素的第一个 li元素。
/ul/li[last()]	选取属于 ul子元素的最后一个 li元素。
/ul/li[last()-1]	选取属于 ul子元素的倒数第二个 li元素。
//ul/li[position()<3]	选取最前面的两个属于 ul元素的子元素的 li元素。
//a[@title]	选取所有拥有名为 title的属性的 a元素。
//a[@title='xx']	选取所有 a元素，且这些元素拥有值为 xx的 title属性。
//a[@title>10] `> < >= <= !=`	选取 a元素的所有 title元素，且其中的 title元素的值须大于 10。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

通配符	描述
*	匹配任何元素节点。一般用于浏览器copy xpath会出现
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

路径表达式	结果
/ul/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。
//node()	获取所有节点

路径表达式	结果
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

一、beautifulsoup解析

1.1 beautifulsoup的简单使用

安装

解析器

解析器对比

快速开始

如何使用

对象的种类

Tag

tag的名字

name和attributes属性

NavigableString(字符串)

BeautifulSoup

Comment

1.2 beautifulsoup的遍历文档树

子节点

contents 和 .children

descendants

节点内容

.string

.text

多个内容

.strings

.stripped_strings

父节点

.parent

.parents

1.3 beautifulsoup的搜索文档树

find_all

name 参数

keyword 参数

text 参数

limit 参数

find()

find_parents() 和 find_parent()

1.4 beautifulsoup的css选择器

通过标签名查找

通过类名查找

id名查找

组合查找

属性查找

二、xpath解析

1.1 xpath安装与使用

1.2 解析流程与使用

1.3 xpath语法

路径表达式

谓语（Predicates）

选取未知节点

逻辑运算

属性查询

获取第几个标签 索引从1开始

模糊查询

内容查询

属性值获取

获取所有

获取节点内容转换成字符串

解析器对比　

获取第几个标签索引从1开始