一、安装Scrapy依赖包
pip install Scrapy
二、创建Scrapy项目(tutorial)
scrapy startproject tutorial
项目目录包含以下内容
- tutorial/
- scrapy.cfg # deploy configuration file
- tutorial/ # project's Python module, you'll import your code from here
- __init__.py
- items.py # project items definition file
- middlewares.py # project middlewares file
- pipelines.py # project pipelines file
- settings.py # project settings file
- spiders/ # a directory where you'll later put your spiders
- __init__.py
三、tutorial/spiders目录下编写蜘蛛(quotes_spider.py)
1、蜘蛛(version1.0)
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- def start_requests(self):
- urls = [
- 'http://quotes.toscrape.com/page/1/',
- 'http://quotes.toscrape.com/page/2/',
- ]
- for url in urls:
- yield scrapy.Request(url=url, callback=self.parse)
- def parse(self, response):
- page = response.url.split("/")[-2]
- filename = f'quotes-{page}.html'
- with open(filename, 'wb') as f:
- f.write(response.body)
- self.log(f'Saved file {filename}')
蜘蛛类QuotesSpider必须得继承scrapy.Spider,并定义以下属性和方法:
name:蜘蛛名称(它在一个项目中必须是唯一的)
start_requests():蜘蛛开始请求(方法返回的是请求的iterable)
请求:scrapy.Request(url=url, callback=self.parse)
parse():蜘蛛解析请求的响应(Response参数是响应的页面内容)
2、蜘蛛(version2.0)
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- 'http://quotes.toscrape.com/page/2/',
- ]
- def parse(self, response):
- page = response.url.split("/")[-2]
- filename = f'quotes-{page}.html'
- with open(filename, 'wb') as f:
- f.write(response.body)
只需定义start_urls属性,不需重写start_requests() 方法,由默认的start_requests()方法根据start_urls属性开始请求也可。
3、蜘蛛(version3.0)
页面HTML代码:
- <div class="quote">
- <span class="text">“The world as we have created it is a process of our
- thinking. It cannot be changed without changing our thinking.”span>
- <span>
- by <small class="author">Albert Einsteinsmall>
- <a href="/author/Albert-Einstein">(about)a>
- span>
- <div class="tags">
- Tags:
- <a class="tag" href="/tag/change/page/1/">changea>
- <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughtsa>
- <a class="tag" href="/tag/thinking/page/1/">thinkinga>
- <a class="tag" href="/tag/world/page/1/">worlda>
- div>
- div>
蜘蛛代码:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- 'http://quotes.toscrape.com/page/2/',
- ]
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('small.author::text').get(),
- 'tags': quote.css('div.tags a.tag::text').getall(),
- }
该蜘蛛将提取页面中的text、author和tags,并使用终端打印。
4、蜘蛛(version4.0)
下一页HTML代码:
- <ul class="pager">
- <li class="next">
- <a href="/page/2/">Next <span aria-hidden="true">→span>a>
- li>
- ul>
蜘蛛代码:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- ]
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('small.author::text').get(),
- 'tags': quote.css('div.tags a.tag::text').getall(),
- }
- next_page = response.css('li.next a::attr(href)').get()
- if next_page is not None:
- next_page = response.urljoin(next_page)
- yield scrapy.Request(next_page, callback=self.parse)
该蜘蛛将提取页面中的text、author和tags,使用终端打印,提取下一页的地址并开始请求,以此递归。
- #做如下测试操作
- urljoin("http://www.xoxxoo.com/a/b/c.html", "d.html")
- #结果为:'http://www.xoxxoo.com/a/b/d.html'
- urljoin("http://www.xoxxoo.com/a/b/c.html", "/d.html")
- #结果为:'http://www.xoxxoo.com/d.html'
- urljoin("http://www.xoxxoo.com/a/b/c.html", "../d.html")
- #结果为:'http://www.xoxxoo.com/a/d.html'
5、蜘蛛(version5.0)
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- ]
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('span small::text').get(),
- 'tags': quote.css('div.tags a.tag::text').getall(),
- }
- next_page = response.css('li.next a::attr(href)').get()
- if next_page is not None:
- yield response.follow(next_page, callback=self.parse)
蜘蛛(version4.0)和蜘蛛(version5.0)的区别:scrapy.Request支持绝对URL,而response.follow支持相对URL,并返回一个请求实例。
6、蜘蛛(version6.0)
- import scrapy
- class AuthorSpider(scrapy.Spider):
- name = 'author'
- start_urls = ['http://quotes.toscrape.com/']
- def parse(self, response):
- author_page_links = response.css('.author + a')
- yield from response.follow_all(author_page_links, self.parse_author)
- pagination_links = response.css('li.next a')
- yield from response.follow_all(pagination_links, self.parse)
- def parse_author(self, response):
- def extract_with_css(query):
- return response.css(query).get(default='').strip()
- yield {
- 'name': extract_with_css('h3.author-title::text'),
- 'birthdate': extract_with_css('.author-born-date::text'),
- 'bio': extract_with_css('.author-description::text'),
- }
使用选择器选择元素,会默认返回href属性,因此response.css('li.next a::attr(href)')与response.css('li.next a')是一样的。
response.follow_all与response.follow的区别是:前者返回多个请求实例的iterable,而后者返回一个请求 实例。
通过设置 DUPEFILTER_CLASS,scrapy可以过滤掉已经访问过的URL,避免了重复请求。
四、使用蜘蛛参数
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- def start_requests(self):
- url = 'http://quotes.toscrape.com/'
- tag = getattr(self, 'tag', None)
- if tag is not None:
- url = url + 'tag/' + tag
- yield scrapy.Request(url, self.parse)
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('small.author::text').get(),
- }
- next_page = response.css('li.next a::attr(href)').get()
- if next_page is not None:
- yield response.follow(next_page, self.parse)
运行命令行
scrapy crawl quotes -O quotes-humor.json -a tag=humor
通过-a传递的参数,被蜘蛛的 __init__ 方法赋值为蜘蛛的属性。在本例中,通过getattr()获取该参数并使用。
五、CSS选择器与XPath
scrapy shell是一个交互式shell,可以用来快速调试scrape代码,特别是测试数据提取代码(CSS选择器与XPath)。打开命令如下:
scrapy shell "http://quotes.toscrape.com/page/1/"
网页源码:
- <head>
- <meta charset="UTF-8">
- <title>Quotes to Scrapetitle>
- <link rel="stylesheet" href="/static/bootstrap.min.css">
- <link rel="stylesheet" href="/static/main.css">
- head>
CSS选择器提取代码
- //获取符合的数据的列表
- >>>response.css('title::text').getall()
- //获取第一个符合的数据,没有数据返回None
- >>>response.css('title::text').get()
- //获取第一个符合的数据,没有数据会引发IndexError
- >>>response.css('title::text')[0].get()
- //css选择器选择出来后进行正则匹配
- >>>response.css('title::text').re(r'Quotes.*')
XPath提取代码
- >>>response.xpath('//title')
- [<Selector xpath='//title' data='
Quotes to Scrape '>] - >>>response.xpath('//title/text()').get()
- 'Quotes to Scrape'
更多爬虫知识以及实例源码,可关注微信公众号:angry_it_man