• Scrapy使用和学习笔记


    前言

    Scrapy是非常优秀的一个爬虫框架,基于twisted异步编程框架。yield的使用如此美妙。基于调度器,下载器可以对scrapy扩展编程。插件也是非常丰富,和Selenium,PlayWright集成也比较轻松。

    当然,对网页中的ajax请求它是无能无力的,但结合mitmproxy几乎无所不能:Scrapy + PlayWright模拟用户点击,mitmproxy则在后台抓包取数据,登录一次,运行一天。

    最终,我通过asyncio把这几个工具整合到了一起,基本达成了自动化无人值守的稳定运行,一篇篇的文章送入我的ElasticSearch集群,经过知识工厂流水线,变成知识商品。

    ”爬虫+数据,算法+智能“,这是一个技术人的理想。

    配置与运行

    安装:

    pip install scrapy

    当前目录下有scrapy.cfg和settings.py,即可运行scrapy

    命令行运行:

    scrapy crawl ArticleSpider

    在程序中运行有三种写法:

    from scrapy.cmdline import execute
    
    execute('scrapy crawl ArticleSpider'.split())
    
    • 1
    • 2
    • 3

    采用CrawlerRunner:

    # 采用CrawlerRunner
    from twisted.internet.asyncioreactor import AsyncioSelectorReactor
    reactor = AsyncioSelectorReactor()
    
    runner = CrawlerRunner(settings)
    runner.crawl(ArticleSpider)
    reactor.run()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    采用CrawlerProcess

    # 采用CrawlerProcess
    process = CrawlerProcess(settings)
    process.crawl(ArticleSpider)
    process.start()
    
    • 1
    • 2
    • 3
    • 4

    和PlayWright的集成

    使用PlayWright的一大好处就是用headless browser做自动化数据采集。A headless browser 是一种特殊的Web浏览器,它为自动化提供API。通过安装 asyncio reactor ,则可以集成 asyncio 基础库,用于处理无头浏览器。

    import scrapy
    from playwright.async_api import async_playwright
    
    class PlaywrightSpider(scrapy.Spider):
        name = "playwright"
        start_urls = ["data:,"]  # avoid using the default Scrapy downloader
    
        async def parse(self, response):
            async with async_playwright() as pw:
                browser = await pw.chromium.launch()
                page = await browser.new_page()
                await page.goto("https:/example.org")
                title = await page.title()
                return {"title": title}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    使用 playwright-python 与上面的示例一样,直接绕过了大多数scrapy组件(中间件、dupefilter等)。建议使用 scrapy-playwright 进行整合。

    安装

    pip install scrapy-playwright
    playwright install
    playwright install firefox chromium

    settings.py配置

    BOT_NAME = 'ispider'
    
    SPIDER_MODULES = ['ispider.spider']
    
    TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
    DOWNLOAD_HANDLERS = {
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }
    
    CONCURRENT_REQUESTS = 32
    PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 4
    CLOSESPIDER_ITEMCOUNT = 100
    
    PLAYWRIGHT_CDP_URL = "http://localhost:9900"
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    爬虫定义

    class ArticleSpider(Spider):
    
        name = "ArticleSpider"
        custom_settings = {
            # "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            # "DOWNLOAD_HANDLERS": {
            #     "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            #     "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # },
            # "CONCURRENT_REQUESTS": 32,
            # "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 4,
            # "CLOSESPIDER_ITEMCOUNT": 100,
    
        }
        start_urls = ["https://blog.csdn.net/nav/lang/javascript"]
    
        def __init__(self, name=None, **kwargs):
            super().__init__(name, **kwargs)
            logger.debug('ArticleSpider initialized.')
    
        def start_requests(self):
            for url in self.start_urls:
                yield Request(
                    url,
                    meta={
                        "playwright": True,
                        "playwright_context": "first",
                        "playwright_include_page": True,
                        "playwright_page_goto_kwargs": {
                            "wait_until": "domcontentloaded",
                        },
                    },
                )
    
    
        async def parse(self, response: Response, current_page: Optional[int] = None) -> Generator:
            content = response.text
            page = response.meta["playwright_page"]
            context = page.context
            title = await page.title()
            while True:
                ## 垂直滚动下拉,不断刷新数据
                page.mouse.wheel(delta_x=0, delta_y=200)
                time.sleep(3)
    
            pass
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46

    参考链接

  • 相关阅读:
    【NLP】一文了解词性标注CRF模型
    java版opencv之Javacv各种场景使用案例
    【LeetCode-257】二叉树的所有路径
    携创教育:成人自考本科全部流程!
    【WCN685X】WCN6856 hostapd配置5G 802.11a/n/ac/ax 20M/40M/80M/160M
    二叉排序树的插入(递归、非递归)和构造
    Redis常见异常汇总
    怎么把备忘录中的视频导到手机相册里
    C#中System.ArgumentOutOfRangeException:“提供的行索引超出范围。
    C++实现kafka的消费者客户端
  • 原文地址:https://blog.csdn.net/jgku/article/details/134246480