XPath表达式描述了从一个节点到另一个节点的路径,通过标签的属性或名称定位到上级标签,再通过路径定位到该上级标签的任意下级标签
from lxml import etree
html = etree.parse("test1.html")
print(html)

import requests
from lxml import etree
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69"
}
response = requests.get(url="https://www.baidu.com/",headers=headers)
result = response.text
html = etree.HTML(result)

import requests
from lxml import etree
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69"
}
response = requests.get(url="https://www.baidu.com/",headers=headers)
result = response.text
html = etree.HTML(result)
print(html.xpath("//*[@class = "title"]"))

除此之外还可以用其他属性来替换*,(如html.xpath(‘//p[@class = ‘title’]’))
逻辑运算符也可以加入到当中html.xpath('//p[@class="title" and @name = "color"]') html.xpath('//p[@class="title" or @name = "color"]')
- 1
- 2
html.xpath('//*[@class="title"]/text()')
html.xpath('//*[@class="title"]//text()')
html.xpath('//*[@class="title"]/@id')
/text(): 提取该节点的直系文本内容
//text() : 提取该节点下的所有文本内容
/@ 属性名 : 提取该节点的指定属性值
