• Python爬虫——scrapy-4


    目录

    免责声明

    目标

    过程

    先修改配置文件

    再修改pipelines.py

    最后的结果是这样的

    read.py

    pipelines.py

    items.py

    settings.py

    scrapy日志信息以及日志级别

    settings.py文件设置

    用百度实验一下

    指定日志级别

    WARNING

    日志文件

    注意 

    scrapy的post请求

    简介

    爬取百度翻译

    总结


    免责声明

    本文章仅用于学习交流,无任何商业用途

    目标

    这次我们要学习把爬取到的数据存入数据库之中

    过程

    先修改配置文件

    settings中添加下面的内容

    # todo 配置 mysql数据库
    # 这里是我的阿里云地址,你填你mysql的地址
    DB_HOST = 'xx.xx.xx.xx'
    DB_PORT = 3306
    DB_USER = 'root'
    DB_PASSWORD = '12345678'
    DB_NAME = 'spider01'
    DB_CHARSET = 'utf-8'

    再修改pipelines.py

    添加下面的代码

    1. class MysqlPipeline:
    2. def process_item(self, item, spider):
    3. return item

    再添加配置

    ITEM_PIPELINES = {
       "scrapy_readbook_090.pipelines.ScrapyReadbook090Pipeline": 300,
       # MysqlPipeline
       "scrapy_readbook_090.pipelines.MysqlPipeline": 301
    }

    。。。。

    最后的结果是这样的

    read.py

    1. import scrapy
    2. from scrapy.linkextractors import LinkExtractor
    3. from scrapy.spiders import CrawlSpider, Rule
    4. from scrapy_readbook_090.items import ScrapyReadbook090Item
    5. class ReadSpider(CrawlSpider):
    6. name = "read"
    7. allowed_domains = ["www.dushu.com"]
    8. start_urls = ["https://www.dushu.com/book/1188_1.html"]
    9. rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"),
    10. callback="parse_item",
    11. # true代表是否跟进
    12. # 打开follow为true就会爬取全部网页
    13. follow=True),)
    14. def parse_item(self, response):
    15. img_list = response.xpath('//div[@class="bookslist"]//img')
    16. for img in img_list:
    17. name = img.xpath('./@alt').extract_first()
    18. img_src = img.xpath('./@data-original').extract_first()
    19. book = ScrapyReadbook090Item(name=name, src=img_src)
    20. yield book

    pipelines.py

    1. # Define your item pipelines here
    2. #
    3. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    4. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    5. # useful for handling different item types with a single interface
    6. from itemadapter import ItemAdapter
    7. class ScrapyReadbook090Pipeline:
    8. def open_spider(self, spider):
    9. self.fp = open('book.json', 'w', encoding='utf-8')
    10. def process_item(self, item, spider):
    11. self.fp.write(str(item))
    12. return item
    13. def close_spider(self, spider):
    14. self.fp.close()
    15. # 加载settings文件
    16. from scrapy.utils.project import get_project_settings
    17. import pymysql
    18. class MysqlPipeline:
    19. def open_spider(self, spider):
    20. settings = get_project_settings()
    21. self.host = settings['DB_HOST']
    22. self.port = settings['DB_PORT']
    23. self.user = settings['DB_USER']
    24. self.password = settings['DB_PASSWORD']
    25. self.name = settings['DB_NAME']
    26. self.charset = settings['DB_CHARSET']
    27. self.connect()
    28. def connect(self):
    29. self.conn = pymysql.connect(
    30. host=self.host,
    31. port=self.port,
    32. user=self.user,
    33. password=self.password,
    34. db=self.name,
    35. charset=self.charset
    36. )
    37. # 可执行sql语句
    38. self.cursor = self.conn.cursor()
    39. def process_item(self, item, spider):
    40. sql = 'insert into book2(name,src) values("{}","{}")'.format(item['name'], item['src'])
    41. # 执行SQL语句
    42. self.cursor.execute(sql)
    43. # 提交
    44. self.conn.commit()
    45. return item
    46. def close_spider(self, spider):
    47. self.cursor.close()
    48. self.conn.close()

    items.py

    1. # Define here the models for your scraped items
    2. #
    3. # See documentation in:
    4. # https://docs.scrapy.org/en/latest/topics/items.html
    5. import scrapy
    6. class ScrapyReadbook090Item(scrapy.Item):
    7. # define the fields for your item here like:
    8. # name = scrapy.Field()
    9. name = scrapy.Field()
    10. src = scrapy.Field()

    settings.py

    1. # Scrapy settings for scrapy_readbook_090 project
    2. #
    3. # For simplicity, this file contains only settings considered important or
    4. # commonly used. You can find more settings consulting the documentation:
    5. #
    6. # https://docs.scrapy.org/en/latest/topics/settings.html
    7. # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    8. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    9. BOT_NAME = "scrapy_readbook_090"
    10. SPIDER_MODULES = ["scrapy_readbook_090.spiders"]
    11. NEWSPIDER_MODULE = "scrapy_readbook_090.spiders"
    12. # Crawl responsibly by identifying yourself (and your website) on the user-agent
    13. #USER_AGENT = "scrapy_readbook_090 (+http://www.yourdomain.com)"
    14. # Obey robots.txt rules
    15. ROBOTSTXT_OBEY = True
    16. # Configure maximum concurrent requests performed by Scrapy (default: 16)
    17. #CONCURRENT_REQUESTS = 32
    18. # Configure a delay for requests for the same website (default: 0)
    19. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    20. # See also autothrottle settings and docs
    21. #DOWNLOAD_DELAY = 3
    22. # The download delay setting will honor only one of:
    23. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    24. #CONCURRENT_REQUESTS_PER_IP = 16
    25. # Disable cookies (enabled by default)
    26. #COOKIES_ENABLED = False
    27. # Disable Telnet Console (enabled by default)
    28. #TELNETCONSOLE_ENABLED = False
    29. # Override the default request headers:
    30. #DEFAULT_REQUEST_HEADERS = {
    31. # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    32. # "Accept-Language": "en",
    33. #}
    34. # Enable or disable spider middlewares
    35. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    36. #SPIDER_MIDDLEWARES = {
    37. # "scrapy_readbook_090.middlewares.ScrapyReadbook090SpiderMiddleware": 543,
    38. #}
    39. # Enable or disable downloader middlewares
    40. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    41. #DOWNLOADER_MIDDLEWARES = {
    42. # "scrapy_readbook_090.middlewares.ScrapyReadbook090DownloaderMiddleware": 543,
    43. #}
    44. # Enable or disable extensions
    45. # See https://docs.scrapy.org/en/latest/topics/extensions.html
    46. #EXTENSIONS = {
    47. # "scrapy.extensions.telnet.TelnetConsole": None,
    48. #}
    49. # todo 配置 mysql数据库
    50. DB_HOST = '8.137.20.36'
    51. # 端口号要是整形
    52. DB_PORT = 3306
    53. DB_USER = 'root'
    54. DB_PASSWORD = '12345678'
    55. DB_NAME = 'spider01'
    56. # utf-8的 - 不要写
    57. DB_CHARSET = 'utf8'
    58. # Configure item pipelines
    59. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    60. ITEM_PIPELINES = {
    61. "scrapy_readbook_090.pipelines.ScrapyReadbook090Pipeline": 300,
    62. # MysqlPipeline
    63. "scrapy_readbook_090.pipelines.MysqlPipeline": 301
    64. }
    65. # Enable and configure the AutoThrottle extension (disabled by default)
    66. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    67. #AUTOTHROTTLE_ENABLED = True
    68. # The initial download delay
    69. #AUTOTHROTTLE_START_DELAY = 5
    70. # The maximum download delay to be set in case of high latencies
    71. #AUTOTHROTTLE_MAX_DELAY = 60
    72. # The average number of requests Scrapy should be sending in parallel to
    73. # each remote server
    74. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    75. # Enable showing throttling stats for every response received:
    76. #AUTOTHROTTLE_DEBUG = False
    77. # Enable and configure HTTP caching (disabled by default)
    78. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    79. #HTTPCACHE_ENABLED = True
    80. #HTTPCACHE_EXPIRATION_SECS = 0
    81. #HTTPCACHE_DIR = "httpcache"
    82. #HTTPCACHE_IGNORE_HTTP_CODES = []
    83. #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
    84. # Set settings whose default value is deprecated to a future-proof value
    85. REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
    86. TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    87. FEED_EXPORT_ENCODING = "utf-8"

    最后是找到了4000条数据

            可能是io进服务器的顺序问题,军娃不是最后一个,但是一页40本书,一共100页也是没有一点毛病了。(* ^ ▽ ^ *)


    scrapy日志信息以及日志级别

            Scrapy是一个基于Python的网络爬虫框架,它提供了强大的日志功能。Scrapy的日志信息以及日志级别如下:

    1. DEBUG:调试级别,用于输出详细的调试信息,一般在开发和测试阶段使用。

    2. INFO:信息级别,用于输出一些重要的信息,如爬虫的启动信息、请求的URL等。

    3. WARNING:警告级别,用于输出一些不太严重的警告信息,如某个网页的解析出错,但不影响整个爬虫的执行。

    4. ERROR:错误级别,用于输出一些错误信息,如爬虫的配置出错、网络连接异常等。

    5. CRITICAL:严重级别,用于输出一些非常严重的错误信息,如爬虫的关键逻辑出错、无法连接到目标网站等。

            默认的日志级别是DEBUG        

            Scrapy的日志信息可以在控制台中直接输出,也可以保存到文件中。可以通过设置Scrapy的配置文件或使用命令行参数来调整日志级别和输出方式。

    以下是Scrapy的日志信息的示例:

    1. 2021-01-01 12:00:00 [scrapy.core.engine] INFO: Spider opened
    2. 2021-01-01 12:00:01 [scrapy.core.engine] DEBUG: Crawled 200 OK
    3. 2021-01-01 12:00:01 [scrapy.core.engine] DEBUG: Crawled 404 Not Found
    4. 2021-01-01 12:00:02 [scrapy.core.engine] WARNING: Ignoring response <404 Not Found>
    5. 2021-01-01 12:00:02 [scrapy.core.engine] DEBUG: Crawled 200 OK
    6. 2021-01-01 12:00:02 [scrapy.core.engine] ERROR: Spider error processing <GET http://example.com>: Error parsing HTML
    7. 2021-01-01 12:00:03 [scrapy.core.engine] DEBUG: Crawled 200 OK
    8. 2021-01-01 12:00:03 [scrapy.core.engine] INFO: Closing spider (finished)
    9. 2021-01-01 12:00:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats

    settings.py文件设置

    默认的级别是DEBUG,会显示上面的所有信息

    在配置文件中 settings.py

    LOG_FILE : 将屏幕显示的信息全部记录到文件中,屏幕不再显示,注意文件后最有一定是 .log

    LOG_LEVEL : 设置日志的等级,就是显示那些,不显示那些

    用百度实验一下

    先把 “君子协议” 撕碎

    # ROBOTSTXT_OBEY = True

    指定日志级别

    WARNING

    在settings.py中添加下述代码

    1. # 指定日志的级别
    2. LOG_LEVEL = 'WARNING'

    ==========是我在log.py中添加要打印的

    就可以发现没有日志了

    日志文件

    我们先把上面配置的等级删除掉,再加上下述的代码

    1. # 日志文件
    2. LOG_FILE = 'logDemo.log'

    运行

    世界依然清晰

    但是日志已经存储在日志文件中了

    注意 

    其实一般来说不要修改log的等级,如果报错也太难发现是什么问题了,所以一般为了控制台别打印那么多东西


    scrapy的post请求

    简介

            在Scrapy中进行POST请求可以通过scrapy.FormRequest类来实现。下面是一个使用Scrapy进行POST请求的示例:

    1. import scrapy
    2. class MySpider(scrapy.Spider):
    3. name = 'example.com'
    4. start_urls = ['http://www.example.com/login']
    5. def parse(self, response):
    6. # 提取登录页的csrf token
    7. csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()
    8. # 构建POST请求的表单数据
    9. formdata = {
    10. 'username': 'myusername',
    11. 'password': 'mypassword',
    12. 'csrf_token': csrf_token
    13. }
    14. # 发送POST请求
    15. yield scrapy.FormRequest(url='http://www.example.com/login', formdata=formdata, callback=self.after_login)
    16. def after_login(self, response):
    17. # 检查登录是否成功
    18. if response.url == 'http://www.example.com/home':
    19. self.log('Login successful')
    20. # 处理登录成功后的响应数据
    21. # ...
    22. else:
    23. self.log('Login failed')

            在上面的示例中,首先在parse方法中抓取登录页,并提取登录页的csrf token。然后构建一个包含用户名、密码和csrf token的字典,作为formdata参数传递给FormRequest对象。最后使用yield关键字发送POST请求,并指定回调函数after_login来处理登录后的响应。

            在after_login方法中,可以根据响应的URL来判断登录是否成功。如果URL为登录后的首页URL,则登录成功,否则登录失败。可以在登录成功时做进一步的处理,如抓取用户信息,然后在控制台或日志中输出相应的信息。

            需要注意的是,Scrapy的POST请求默认使用application/x-www-form-urlencoded方式来编码数据。如果需要发送JSON或其他类型的请求,可以通过设置headers参数来指定请求头,如:yield scrapy.FormRequest(url='http://www.example.com/login', formdata=formdata, headers={'Content-Type': 'application/json'}, callback=self.after_login)

            另外,如果需要在POST请求中上传文件,可以使用scrapy.FormRequestfiles参数,将文件的路径作为值传递给表单字段。更多关于POST请求的用法和参数配置,请查阅Scrapy官方文档。

    爬取百度翻译

     只需要修改testpost.py这个自己创建的文件就行了

    1. import scrapy
    2. import json
    3. class TestpostSpider(scrapy.Spider):
    4. name = "testpost"
    5. allowed_domains = ["fanyi.baidu.com"]
    6. # post请求如果没有参数,那抹这个请求将没有任何的意义
    7. # 所以 start_urls 也是没有用
    8. # 而且 parse 方法也没有用了
    9. # 所以直接注释掉
    10. # TODO
    11. # start_urls = ["https://fanyi.baidu.com/sug"]
    12. #
    13. # def parse(self, response):
    14. # print("==========================")
    15. # post请求就使用这个方法
    16. def start_requests(self):
    17. url = 'https://fanyi.baidu.com/sug'
    18. data = {
    19. 'kw': 'final'
    20. }
    21. yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse_second)
    22. def parse_second(self, response):
    23. content = response.text
    24. obj = json.loads(content, encoding='utf-8')
    25. print(obj)

    总结

            从2月29号,到今天3月9号,一共过去了十天,完成了爬虫的入门,从urllib到scrapy,这条路很长但是也很简单,中间的配置Python软件包的版本问题时常可以阻碍我的脚步,但是我都一一将他们解决,困难毕竟只是困难,人定胜天,我命由我不由天,加油!!!ヾ(◍°∇°◍)ノ゙

    ヾ( ̄▽ ̄)Bye~Bye~

    完结撒花

  • 相关阅读:
    2023APMCM亚太杯/小美赛数学建模竞赛优秀论文模板分享
    HCIA回顾笔记
    用Python订正数据
    WPF性能优化:性能分析工具
    JSON.stringify()与Qs.stringify()区别 应用场景
    MySQL从基础到毕业【完整篇】
    LogicFlow 学习笔记——10. LogicFlow 进阶 边
    【自动控制原理】时域分析法:稳定性分析(稳)、误差分析和计算(准)
    SpringBoot 整合 JustAuth 实现第三方登录 | gitee登录
    觉哥java网站踩坑+bug记录(困扰一分钟以上的问题全记录)
  • 原文地址:https://blog.csdn.net/DDDDWJDDDD/article/details/136572727