• 手写myscrapy(八)


    项目地址:https://gitee.com/wyu_001/myscrapy
    接下来接着说明如何多线程运行多个爬虫脚本:
    项目的根目录下有个batch.py文件,这个就是批量运行多个爬虫的脚本,这里使用了线程池,同时运行spider下的多个爬虫类,也可以在setting.py文件中设置运行的爬虫文件:

    #batch
    #批量运行默认情况下运行spider下继承myspider类的子类
    #批量运行脚本参数定义,一次并发线程数
    
    BATCH_THREADS =10
    
    #batch run files in list
    #自定义运行spider下脚本文件
    BATCH_FILES =['dxyqueryhospital.py',
                  'haodfqueryhospital.py'
                  ]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    下面是batch.py脚本代码:

    import inspect
    from os import listdir,getcwd
    from os.path import isfile,join
    import importlib
    
    from config.setting import BATCH_THREADS
    from config.setting import BATCH_FILES
    
    from concurrent.futures import ThreadPoolExecutor,as_completed
    
    crawls=[]
    
    lib_dir = "spider"
    file_path = join(getcwd(),lib_dir)
    crawl_files = [ f for f in listdir(file_path) if isfile(join(file_path,f))]
    
    crawls_sets = set(crawl_files)
    batch_sets = set(BATCH_FILES)
    
    
    if len(batch_sets):
        crawl_files = list(crawls_sets.intersection(batch_sets))
    
    for file in crawl_files:
    
        if file != "__init__.py" :
            file = f'.{file.split(".")[0]}'
            moudle = importlib.import_module(file,lib_dir)
    
            for name ,obj in inspect.getmembers(moudle,inspect.isclass):
                if obj.__base__.__name__ == "MySpider":
                    crawls.append(obj())
    
    thread_num = 0
    
    tasks = []
    
    with ThreadPoolExecutor(max_workers= BATCH_THREADS) as tp:
    
        while(len(crawls)):
            task = tp.submit(crawls.pop().start_request)
            tasks.append(task)
    
            thread_num += 1
            if thread_num >= BATCH_THREADS :
                for future in as_completed(tasks):
                    finish = future.result()
    
                thread_num  = 0
    
        for future in as_completed(tasks):
            finish = future.result()
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
  • 相关阅读:
    STM32智能物流机器人系统教程
    软件保护工具VMProtect将许可系统集成到应用程序(8):硬件锁定
    MySQL的视图,用户管理,C语言连接
    【c代码】【字符串数组排序】
    java计算机毕业设计基于安卓Android的订餐系统APP
    python 使用Softmax回归处理IrIs数据集
    JavaFx 使用字体图标记录
    第8章 Spring(一)
    搭建ELK+Filebead+zookeeper+kafka实验
    加解密和加签验签
  • 原文地址:https://blog.csdn.net/semicolon_hello/article/details/136303804