• Scrapy第二篇:构建scrapy项目


    1.安装依赖

    1. pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
    2. 或者(如果已经安装了Anaconda)
    3. conda install -c conda-forge scrapy

    如果anaconda安装失败:参考安装scrapy失败CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.o_苍穹之跃的博客-CSDN博客修改Anaconda镜像源conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/conda config --set show_channel_urls yes此时在C:\Users\Administrator(这里是电脑用户名)下就会生成配置文件.condarcchannels: - https://mirrors.tuna.tsinghua.edu.cn/anacohttps://blog.csdn.net/wenxingchen/article/details/120161236 

    2.构建项目 

    新建存放项目的文件夹scrapyproject,并黑窗口cd到该文件夹下:

    创建项目

    scrapy startproject 项目名称

    PyCharm打开项目:

      

     创建一个爬虫:打开Terminal执行如下【限制域只能是域名,不能含有http前缀】

    scrapy genspider 爬虫名称 要爬取的限制域

    执行完毕后,在spiders文件夹下会有一个基础爬虫文件 

    3.调试爬虫

    根路径下新建文件main.py,内容如下:【修改一下爬虫名称】

    1. from scrapy.cmdline import execute
    2. import os
    3. import sys
    4. if __name__ == '__main__':
    5. sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    6. execute(['scrapy', 'crawl', '爬虫名称'])

    4.启动爬虫

            启动main.py即可启动爬虫程序。

    5.默认启动参数

    Setting初始化只定义了四个参数,其他都是注释掉的。

    6.其他启动参数

    1. # Scrapy settings for testscrapy project
    2. #
    3. # For simplicity, this file contains only settings considered important or
    4. # commonly used. You can find more settings consulting the documentation:
    5. #
    6. # https://docs.scrapy.org/en/latest/topics/settings.html
    7. # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    8. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    9. BOT_NAME = 'testscrapy' # 爬虫名称,爬取的时候user-agent默认就是这个名字
    10. SPIDER_MODULES = ['testscrapy.spiders'] # 爬虫模块
    11. NEWSPIDER_MODULE = 'testscrapy.spiders' # 爬虫模块
    12. # Crawl responsibly by identifying yourself (and your website) on the user-agent
    13. # USER_AGENT = 'testscrapy (+http://www.yourdomain.com)'
    14. # Obey robots.txt rules
    15. ROBOTSTXT_OBEY = True # 是否遵守网站爬虫协议,当设置为False时,就代表,无论此网站让我不让我爬,我都要爬。
    16. # Configure maximum concurrent requests performed by Scrapy (default: 16)
    17. # CONCURRENT_REQUESTS = 32 # 并发请求数,如果对方没有做反爬虫机制,就可以开很大的并发,这样就可以一下子返回很多的数据
    18. # Configure a delay for requests for the same website (default: 0)
    19. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    20. # See also autothrottle settings and docs
    21. # DOWNLOAD_DELAY = 3 # 下载延迟秒数(下载速度),限制频繁访问,防止被封号,不给别人的网站造成太大的压力
    22. # The download delay setting will honor only one of:
    23. # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 针对每个域名限制 n 个并发,最大为 16
    24. # CONCURRENT_REQUESTS_PER_IP = 16 # 单IP访问的并发数,如果有值则忽略,对域名并发树的限制,并且下载速度也应用在每个IP,因为一个域名可以有很多个IP,公司可以有很多个服务器,对外部来说只有一个域名
    25. # Disable cookies (enabled by default)
    26. # COOKIES_ENABLED = False # 返回的 response 是否解析 cookies
    27. # Disable Telnet Console (enabled by default)
    28. # TELNETCONSOLE_ENABLED = False # 通过Telnet 可以监听当前爬虫的状态、信息、操作爬虫等,使用方法是:打开cmd,使用 telent 127.0.0.1 6023 以及est() ,即可进入操作界面
    29. # Override the default request headers:
    30. # DEFAULT_REQUEST_HEADERS = {
    31. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    32. # 'Accept-Language': 'en',
    33. # } # 默认的请求头,每个请求都可以携带,如果针对每个请求头设置,也可以在此界面进行设置
    34. # Enable or disable spider middlewares
    35. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    36. # SPIDER_MIDDLEWARES = {
    37. # 'testscrapy.middlewares.TestscrapySpiderMiddleware': 543,
    38. # } # 爬虫中间件
    39. # Enable or disable downloader middlewares
    40. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    41. # DOWNLOADER_MIDDLEWARES = {
    42. # 'testscrapy.middlewares.TestscrapyDownloaderMiddleware': 543,
    43. # } # 下载中间件
    44. # Enable or disable extensions
    45. # See https://docs.scrapy.org/en/latest/topics/extensions.html
    46. # EXTENSIONS = {
    47. # 'scrapy.extensions.telnet.TelnetConsole': None,
    48. # } # 其他扩展
    49. # Configure item pipelines
    50. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    51. # ITEM_PIPELINES = {
    52. # 'testscrapy.pipelines.TestscrapyPipeline': 300,
    53. # } # 自定义PIPELINES处理请求,主要为了存储数据使用,每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内
    54. # Enable and configure the AutoThrottle extension (disabled by default)
    55. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    56. # AUTOTHROTTLE_ENABLED = True
    57. # The initial download delay
    58. # AUTOTHROTTLE_START_DELAY = 5
    59. # The maximum download delay to be set in case of high latencies
    60. # AUTOTHROTTLE_MAX_DELAY = 60
    61. # The average number of requests Scrapy should be sending in parallel to
    62. # each remote server
    63. # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    64. # Enable showing throttling stats for every response received:
    65. # AUTOTHROTTLE_DEBUG = False
    66. # Enable and configure HTTP caching (disabled by default)
    67. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    68. # HTTPCACHE_ENABLED = True
    69. # HTTPCACHE_EXPIRATION_SECS = 0
    70. # HTTPCACHE_DIR = 'httpcache'
    71. # HTTPCACHE_IGNORE_HTTP_CODES = []
    72. # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  • 相关阅读:
    java毕业设计——基于java+J2EE+sqlserver的SMART在线考试系统设计与实现(毕业论文+程序源码)——在线考试系统
    mmpose管道pipelines
    关于Linux命令的使用
    继承【C++】
    【C++学习笔记】C++20的jthread
    计算机操作系统学习(二)计算机系统结构
    Linux命令大全
    50-C语言-输入n个数,并且从中输出奇数,按升序排列
    nginx服务---2
    本地笔记同步到博客
  • 原文地址:https://blog.csdn.net/wenxingchen/article/details/126328635