• 爬虫学习日记第七篇(爬取github搜索仓库接口,其实不算爬虫)


    github提供的搜索仓库的API https://api.github.com/

    # 连接数据库
    db = mysql.connector.connect(
        host="***",
        user="***",
        password="***",
        database="***"
    )
    # 创建游标
    cursor = db.cursor()
    # 从数据库中读取CVE ID
    cursor.execute("SELECT cve_id FROM vules WHERE cve_id != '无CVE' AND poc != '暂无可利用代码'")
    cve_ids = cursor.fetchall()
    print(cve_ids)
    
    # 遍历CVE ID列表
    for cve_id in cve_ids:
        cve_id = cve_id[0]  # 提取CVE ID值
        # 在GitHub上搜索CVE ID
        URL = f'https://api.github.com/search/repositories?q={cve_id}&sort=stars'
        r = requests.get(URL)
        response_dict = r.json()
        print(response_dict)
        repo_dicts = response_dict['items']
        results = []
        for i in range(len(repo_dicts)):
            results.append(repo_dicts[i]["html_url"])
        print(results)
    # 关闭数据库连接
    db.close()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29

    报错,限制了API访问速率
    {‘message’: “API rate limit exceeded for ******. (But here’s the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)”, ‘documentation_url’: ‘https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting’}

    需要添加Authentication认证
    在github个人主页setting/ Developer settings/personal access token/generate new token,把生成的token复制保存下来

    headers = {'User-Agent':'Mozilla/5.0',
               'Authorization': 'token ef802a122df2e4d29d9b1b868a6fefb14f22b272',    //填写拿到的token
               'Content-Type':'application/json',
               'Accept':'application/json'
              }
    
    • 1
    • 2
    • 3
    • 4
    • 5

    加上token之后速率好了一些,但还是又报错了

    {‘message’: ‘API rate limit exceeded for user ID ******. If you reach out to GitHub Support for help, please include the request ID FCA0:25083D:2521AF:27C7D1:6528B0F7.’, ‘documentation_url’: ‘https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting’}

    用不了多线程,遂只能加上sleep慢慢读取,和try except。

        try:
            print(cve_id)
            cve_id = cve_id[0]  # 提取CVE ID值
            # 在GitHub上搜索CVE ID
            URL = f'https://api.github.com/search/repositories?q={cve_id}&sort=stars'
            r = requests.get(URL,headers=headers)
            response_dict = r.json()
            print(response_dict)
            repo_dicts = response_dict['items']
            results = []
            for i in range(len(repo_dicts)):
                results.append(repo_dicts[i]["html_url"])
            result = ','.join(results)
            sql = "UPDATE vules SET repositories=%s WHERE cve_id=%s;"
            values = (result, cve_id)
            cursor.execute(sql, values)
            db.commit()
    
            print(results)
            sleep(1)
        except Exception as e:
            # 捕获到异常后的处理代码
            # 打印异常信息
            print("发生异常:", str(e))
            # 等待几秒后继续执行循环
            sleep(5)
            continue
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
  • 相关阅读:
    接口测试鉴权测试
    足球比分动态易语言代码
    Java 全栈体系(四)
    06【NIO核心组件之Selector】
    分布式ID生成
    联表更新数据以及You can‘t specify target table ‘xxx‘ for update in FROM clause
    iPhone 15有始终显示功能吗?它会出现在更多的苹果手机上吗?
    传智健康_第3章 预约管理-检查组管理
    【Unity记录】如何优雅地在Unity中订阅与退订C#事件
    Openstack云计算架构及前期服务搭建
  • 原文地址:https://blog.csdn.net/qq_55675216/article/details/133805853