Scrapy基本概念——命令行工具

一、构建项目的命令行使用

1、多项目的目录结构


scrapy.cfg
firstproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...
secondproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

2、多项目的配置（scrapy.cfg）


default = firstproject.settings
project2 = secondproject.settings

3、多项目的切换


>>> scrapy settings --get BOT_NAME
firstproject
>>> set SCRAPY_PROJECT=project2
>>> scrapy settings --get BOT_NAME
secondproject

4、创建项目

scrapy startproject  [project_dir]

5、打开项目目录

cd

6、创建蜘蛛

scrapy genspider [-t template]

7、查看scrapy和命令行的帮助

scrapy -h scrapy  -h

二、全局命令

1、startproject


语法： scrapy startproject  [project_dir]
用法：创建项目
项目依赖： no
例如：scrapy startproject myproject

2、genspider


语法：scrapy genspider [-t template]  
用法：创建蜘蛛
项目依赖：no
例如：
>>> scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
>>> scrapy genspider example example.com
Created spider 'example' using template 'basic'
>>> scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

3、settings


语法：scrapy settings [options]
用法：获取Scrapy设置的值
项目依赖：no
例如：
>>> scrapy settings --get BOT_NAME
scrapybot
>>> scrapy settings --get DOWNLOAD_DELAY
0

4、runspider


语法：scrapy runspider 
 
用法：不需要创建项目，直接运行一个包含在python文件中的蜘蛛
项目依赖：no
例如：
>>> scrapy runspider myspider.py
[ ... spider starts crawling ... ]

5、shell


语法：scrapy shell [url]
项目依赖：no
用法：根据url启动scrapy shell
 
选项：
--spider=SPIDER：绕过Spider自动检测并强制使用特定Spider
-c code：评估shell中的代码，打印结果并退出
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循）
例如：
>>> scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
>>> scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')
# shell follows HTTP redirects by default
>>> scrapy shell --nolog http://httpbin.org/redirect-to?url=http://example.com/ -c '(response.status, response.url)'
(200, 'http://example.com/')
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
>>> scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http://example.com/ -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http://example.com/')

6、fetch


语法：scrapy fetch 
项目依赖：no
用法：使用蜘蛛的设置获取响应页面并写入标准输出，如项目外使用，则使用scrapy下载器默认的设置
选项：
--spider=SPIDER：绕过Spider自动检测并强制使用特定Spider
--headers：打印响应的HTTP头而不是响应的正文
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循）
例如：
>>> scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
>>> scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

7、view


语法：scrapy view 
项目依赖：no
用法：使用蜘蛛的设置获取响应页面，下载并用浏览器打开
选项：
--spider=SPIDER：绕过Spider自动检测并强制使用特定Spider
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循）
例如：
>>> scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

8、version


语法：scrapy version [-v]
项目依赖：no
用法：打印版本。如果使用-v还打印python、twisted和platform的信息

三、项目命令

1、crawl


语法：scrapy crawl 
用法：使用蜘蛛爬取
项目依赖：yes
例如：
>>> scrapy crawl myspider
[ ... myspider starts crawling ... ]

2、check


语法：scrapy check [-l] 
用法：合约检查
项目依赖：yes
例如：
>>> scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item
>>> scrapy check
[FAILED] first_spider:parse_item
'RetailPricex' field is missing
[FAILED] first_spider:parse
Returned 92 requests, expected 0..4

3、list


语法：scrapy list
 
用法：列出当前项目中所有可用蜘蛛
项目依赖：yes
例如：
>>> scrapy list
spider1
spider2

4、edit


语法：scrapy edit 
用法：在可编辑环境使用编辑器编写蜘蛛代码
项目依赖：yes
例如：
>>> scrapy edit spider1

5、parse


语法：scrapy parse [options] 
用法：测试解析
项目依赖：yes
选项：
--spider=SPIDER ：绕过Spider自动检测并强制使用特定Spider
--a NAME=VALUE ：set spider参数（可以重复）
--callback 或 -c ：用作分析响应的回调的spider方法
--meta 或 -m ：将传递给回调请求的附加请求元。这必须是有效的JSON字符串。示例：--meta='“foo”：“bar”'
--cbkwargs ：将传递给回调的其他关键字参数。这必须是有效的JSON字符串。示例：--cbkwargs='“foo”：“bar”'
--pipelines ：通过管道处理项目
--rules 或 -r 使用 CrawlSpider 发现用于解析响应的回调（即spider方法）的规则
--noitems ：不显示爬取的项目
--nolinks ：不显示提取的链接
--nocolour ：避免使用Pygments对输出着色
--depth 或 -d ：应递归执行请求的深度级别（默认值：1）
--verbose 或 -v ：显示每个深度级别的信息
--output 或 -o ：将刮取的项目转储到文件
例如：
>>> scrapy parse --spider=toscrape-css -c parse -d 2 http://quotes.toscrape.com/

6、bench


语法：scrapy bench
用法：快速基准测试
项目依赖：no

更多爬虫知识以及实例源码，可关注微信公众号：angry_it_man

相关阅读:
浅析Java设计模式【2.1】——代理
 SSD写放大的优化策略要统一标准了吗？
numpy.around
pytorch 多卡分布式训练调用all_gather_object 出现阻塞等待死锁的问题
 快速平方根倒数计算
 ChatGLM2-6B 部署
 Linux 源码安装Ansible 参考篇
 剑指 Offer II 049. 从根节点到叶节点的路径数字之和
 Three.js中加载和渲染3D Tiles
关于Thread 类及其基本用法
原文地址：https://blog.csdn.net/xuyuanfan77/article/details/127947164