• 技术学习:Python(16)|lxml模块和Xpath(爬虫篇)


    活动地址:CSDN21天学习挑战赛

    🏮1 爬虫

    • 什么是爬虫?
      简单来说,就是自动抓取互联网信息的程序。

    • 爬虫提取网页数据流程
      在这里插入图片描述

    • lxml模块和Xpath
      lxml是基于libxml2这一XML解析库的Python封装,是python的库。lxml支持XML和HTML的解析,也支持XPath的方式解析,解析效率也比较高。

    参考重要文档: https://lxml.de/
    项目开源地址在:https://github.com/lxml/lxml

    🏮2 lxml模块

    lxml库的模块中,使用最多的要数lxml.etree ,其次就是 Element 对象。在解析数据的时候,大量的代码都是基于 Element 对象的 API 实现。

    🎈2.1 安装

    打开终端,输入安装命令pip install lxml,提示有Successfully则表示安装成功。

    Aion.$ python -m pip install lxml
    Collecting lxml
      Downloading lxml-4.9.1.tar.gz (3.4 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 23.8 kB/s eta 0:00:00
      Preparing metadata (setup.py) ... done
    Building wheels for collected packages: lxml
      Building wheel for lxml (setup.py) ... done
      Created wheel for lxml: filename=lxml-4.9.1-cp310-cp310-macosx_12_0_x86_64.whl size=1672209 sha256=5b6b5d05a4dc63d8a25a267b045a908dc6249f1e4bc0e847b0b8fbdf2d1a100c
      Stored in directory: xxx/a4/ec/7b/8acde6da24b5aabeee049213d5bec12d1e9214d3cae276387b
    Successfully built lxml
    Installing collected packages: lxml
    Successfully installed lxml-4.9.1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    下载过程有点慢,需要等待几分钟。

    🎈2.2 解析HTML字符串

    >>>from lxml import etree
    
    >>>text = '''
    >>>
    >>>    
    >>>
    小明
    >>>
    21
    >>>
    广东省广州市天河区白云路123号
    >>>
    >>> >>>'''
    # 开始初始化,传入一个html形式的字符串 >>>html = etree.HTML(text) >>>print(html) <Element html at 0x10af8cbc0> >>>print(type) <class 'type'> # 将字符串序列化为html字符串 >>>result = etree.tostring(html).decode('utf-8') >>>print(result) <html><body> <div class="key"> <div class="name">&#23567;明
    <div class="age">21</div> <div class="address">&#24191;东省广州市天河区白云路123号
    </div> </body></html> >>>print(type(result)) <class 'str'>
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    🎈2.3 解析HTML网页文件

    • 创建实验文件

    模拟实验从HTML文件解析,首先创建一个实验使用的html文件,命名为c17.html,内容还是2.2所示的字符串内容即可。
    在这里插入图片描述
    在这里插入图片描述

    • 解析
    >>> from lxml import etree
    >>> html = etree.parse('c17.html')
    >>>
    >>> result = etree.tostring(html).decode('utf-8')
    >>>
    >>> print(result)
    <html>
    	<body>
    	<div class="key">
    		<div class="name">&#23567;明
<div class="age">21</div> <div class="address">&#24191;东省广州市天河区白云路123号
</div> </body> </html> >>> >>> print(type(result)) <class 'str'> >>> >>> html = etree.HTML(result) >>> >>> print(html) <Element html at 0x10d605880> >>> >>> print(type) <class 'type'> >>>

🏮3 XPath 详解

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素,并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

🎈3.1 创建一个HTML文件

我是在豆瓣电影主页任意抽取一段网页代码,然后拷贝到一个文件中,命名文件为c16.html

在这里插入图片描述

<html>

<div class="billboard-bd">
            <table>
                    <tbody><tr>
                        <td class="order">1td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/35056243/">13条命a>td>
                    tr>
                    <tr>
                        <td class="order">2td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/35338562/">全职a>td>
                    tr>
                    <tr>
                        <td class="order">3td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/35073886/">分手的决心a>td>
                    tr>
                    <tr>
                        <td class="order">4td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/35447590/">失踪的莱昂纳多a>td>
                    tr>
                    <tr>
                        <td class="order">5td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/27124451/">a>td>
                    tr>
                    <tr>
                        <td class="order">6td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/26305582/">猫王a>td>
                    tr>
                    <tr>
                        <td class="order">7td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/35408460/">太他妈好相处了a>td>
                    tr>
                    <tr>
                        <td class="order">8td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/26642033/">小黄人大眼萌:神偷奶爸前传a>td>
                    tr>
                    <tr>
                        <td class="order">9td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/34940681/">小野田的丛林万夜a>td>
                    tr>
                    <tr>
                        <td class="order">10td>
                        <td class="title"><a onclick="moreurl(this, {from:'mv_rk'})" href="https://movie.douban.com/subject/35490166/">同车异路a>td>
                    tr>
            tbody>table>
        div>

<html>

🎈3.2 获取所有标签

很是奇怪,我按照老师博文的步骤来执行,是报错的,错误如下:

>>> html_c16 = etree.parse("c16.html")
Traceback (most recent call last):
  File "", line 1, in <module>
  File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "c16.html", line 7
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xCC 0xF5 0xC3 0xFC, line 7, column 138

看下错误的原因lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !,这才看明白,既然需要这个参数(其实是我的html文件不规范导致),那就增加参数来妥协吧。

>>> parser = etree.HTMLParser(encoding='utf-8')
>>> html_c16 = etree.parse('c16.html', parser=parser)

>>> html_tr = html_c16.xpath('//tr')
>>> print(html_tr)
[<Element tr at 0x10b915980>, <Element tr at 0x10b9159c0>, <Element tr at 0x10b915a00>, <Element tr at 0x10b915a40>, <Element tr at 0x10b915a80>, <Element tr at 0x10b915b00>, <Element tr at 0x10b915b40>, <Element tr at 0x10b915b80>, <Element tr at 0x10b915bc0>, <Element tr at 0x10b915ac0>]
>>>

🎈3.3 获取所有class属性

>>> html_c16 = etree.parse('c16.html',parser=parser)
>>> HTML_td_class = html_c16.xpath('//td/@class')
>>>
>>> print(HTML_td_class)
['order', 'title', 'order', 'title', 'order', 'title', 'order', 'title', 'order', 'title', 'order', 'title', 'order', 'title', 'order', 'title', 'order', 'title', 'order', 'title']
>>>

🎈3.4 获取所有下的特定标签

例如,这里我想获取标签下面超链接hrehttps://movie.douban.com/subject/35408460/a标签

>>>
>>> html_a = html_c16.xpath('//td/a[@href="https://movie.douban.com/subject/35408460/"]')
>>>
>>> print(html_a)
[<Element a at 0x10b916740>]

🎈3.5 获取所有的下的全部

>>> html_a_list = html_c16.xpath('//td//a')
>>> print(html_a_list)
[<Element a at 0x10b916840>, <Element a at 0x10b916880>, <Element a at 0x10b9168c0>, <Element a at 0x10b916900>, <Element a at 0x10b916940>, <Element a at 0x10b9169c0>, <Element a at 0x10b916740>, <Element a at 0x10b916a00>, <Element a at 0x10b916a40>, <Element a at 0x10b916980>]
>>>
  • 1
  • 2
  • 3
  • 4

🎈3.6 获取所有下全部的方法(函数)

>>> html_a_method = html_c16.xpath('//td/a//@onclick')
>>> print(html_a_method)
["moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})", "moreurl(this, {from:'mv_rk'})"]
>>>
  • 1
  • 2
  • 3
  • 4

🎈3.7 获取最后一个href属性

实验了下,我没有打印出来。参考老师博文

🎈3.8 获取td中倒数第三个a元素的内容

实验了下,我没有打印出来。参考老师博文

🎈3.9 获取所有class=order标签

>>> index_result = html_c16.xpath('//*[@class="order"]')
>>> print(index_result)
[<Element td at 0x10b915e00>, <Element td at 0x10b915f00>, <Element td at 0x10b916000>, <Element td at 0x10b916140>, <Element td at 0x10b916240>, <Element td at 0x10b916300>, <Element td at 0x10b916400>, <Element td at 0x10b916500>, <Element td at 0x10b916600>, <Element td at 0x10b916700>]
>>> print(index_result[0].tag)
td
>>>

🏮4 总结

🎈4.1 略

🎈4.2 略

  • 相关阅读:
    【踩坑专栏】Cannot resolve org.springframework.boot:spring-boot-starter-web:unknown
    SpringBoot整合Spring Security【超详细教程】
    改进LSTM的脱轨系数预测方法
    淘宝、1688、拼多多、苏宁商品详情API接口(网络爬虫数据示例)
    测试人生 | 资深外包逆袭大厂测试开发:面试官的“歧视”表情深深刺痛了我
    工具使用:vue2使用 swiper5
    JavaScript Promise
    多线程编程中的条件变量及其优化
    MyBatis执行SQL的两种方式
    Swagger问题记录
  • 原文地址:https://blog.csdn.net/L_Lycos/article/details/126395230