爬虫入门到精通_框架篇18(Scrapy中选择器用法)_sector,xpath,css,re

爬虫入门到精通_框架篇18(Scrapy中选择器用法)_sector,xpath,css,re
官方文档

 Using selectors

To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:
https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
```
  
    
    Example website
  
  
    
      Name: My image 1 

      Name: My image 2 

      Name: My image 3 

      Name: My image 4 

      Name: My image 5 

    
  

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
```
进入命令行交互模式：
```
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
1
```
输入
```
response.selector
1
```
输出：request内置的selector选择器

 XPath选择器

let’s construct an XPath for selecting the text inside the title tag:
```
response.xpath("//title/text()")
1
```
输出选择器与内容.

css选择器
```
response.css("title::text").get()
1
```
xpath和css的运用

xpath查找images标签
```
response.xpath('//div[@id="images"]')
1
```
```
response.xpath('//div[@id="images"]').css("img")
1
```
css可以用::attr()获取属性:
```
response.xpath('//div[@id="images"]').css("img::attr(src)").extract()
1
```
default:查不到内容返回default里内容

href标签：

contains

找属性名称包含image的所有的超链接可以使用contains选项，第一个参数是属性名，第二个属性是要查找的值
```
response.xpath('//a[contains(@href,"image")]/@href').extract()
1
```
CSS的写法：
```
response.css('a[href*=image]::attr(href)').extract()
1
```
假如我们要选择所有a标签里的img里面的src属性，用上contains：
```
response.xpath('//a[contains(@href,"image")]/img/@src').extract()
1
```
CSS:注意[]之后要有空格
```
response.css('a[href*=image] img::attr(src)').extract()
1
```
正则表达式

提取内容

提取冒号后的内容，就需要正则表达式了，注意，\用来对：进行转义。
```
 response.css('a::text').re('Name\:(.*)')
1
```
与extract()方法类似，re也提供了取得列表中第一个元素的方法：re_first()
```
response.css('a::text').re_first('Name\:(.*)')
1
```
进一步地，可以使用strip()方法，去掉返回结果中前后的空格：
```
response.css('a::text').re_first('Name\:(.*)').strip()
1
```
小结

response为我们提供了几个提取方法：
- xpath
- CSS
- re
返回的结果都是Selector类型，可以进行嵌套循环。
a) 对css来说:
- 获取a标签中的文本内容：response.css(‘a::text’)
- 获取a标签中的某个属性：response.css(‘a::attr(属性)’)
(b)对xpath来说：
- 获取a标签中的文本内容：response.xpath(‘//a/text()’)
- 获取a标签中的某个属性：response.xpath(‘//a/@href’)
两种选择方法，写法不同，效果类似。

要从selector变为数据，则在后面加上.extract() 或 .extract()_first() 或.extract()[x]（x为list中元素的下标）。
如果要提取更具体的信息，可以用正则表达式的方法，在后面加上 .re() 或 .re()_first 进行嵌套选择。
相关阅读:
【花雕体验】14 行空板pinpong库测试外接传感器模块（之一）
【数据结构】【程序填空】赫夫曼解码
 Apk安装后不显示桌面图标问题
 【0104】查找PostgreSQL数据库和表的大小
 【云原生实战】KubeSphere实战——多租户系统实战
 启动虚拟机就蓝屏
 第二章：字节码指令集与解析案例
 洛谷 P4815 狼人游戏题解
 IbBBX24–IbTOE3–IbPRX17模块通过清除甘薯中的活性氧来增强甘薯对非生物胁迫耐受性
 frxJSON用法
原文地址：https://blog.csdn.net/weixin_41865866/article/details/136664800

Using selectors

XPath选择器

css选择器

xpath和css的运用

contains

正则表达式

小结