• 爬虫学习日记第七篇(爬取github搜索仓库接口,其实不算爬虫)


    github提供的搜索仓库的API https://api.github.com/

    # 连接数据库
    db = mysql.connector.connect(
        host="***",
        user="***",
        password="***",
        database="***"
    )
    # 创建游标
    cursor = db.cursor()
    # 从数据库中读取CVE ID
    cursor.execute("SELECT cve_id FROM vules WHERE cve_id != '无CVE' AND poc != '暂无可利用代码'")
    cve_ids = cursor.fetchall()
    print(cve_ids)
    
    # 遍历CVE ID列表
    for cve_id in cve_ids:
        cve_id = cve_id[0]  # 提取CVE ID值
        # 在GitHub上搜索CVE ID
        URL = f'https://api.github.com/search/repositories?q={cve_id}&sort=stars'
        r = requests.get(URL)
        response_dict = r.json()
        print(response_dict)
        repo_dicts = response_dict['items']
        results = []
        for i in range(len(repo_dicts)):
            results.append(repo_dicts[i]["html_url"])
        print(results)
    # 关闭数据库连接
    db.close()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29

    报错,限制了API访问速率
    {‘message’: “API rate limit exceeded for ******. (But here’s the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)”, ‘documentation_url’: ‘https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting’}

    需要添加Authentication认证
    在github个人主页setting/ Developer settings/personal access token/generate new token,把生成的token复制保存下来

    headers = {'User-Agent':'Mozilla/5.0',
               'Authorization': 'token ef802a122df2e4d29d9b1b868a6fefb14f22b272',    //填写拿到的token
               'Content-Type':'application/json',
               'Accept':'application/json'
              }
    
    • 1
    • 2
    • 3
    • 4
    • 5

    加上token之后速率好了一些,但还是又报错了

    {‘message’: ‘API rate limit exceeded for user ID ******. If you reach out to GitHub Support for help, please include the request ID FCA0:25083D:2521AF:27C7D1:6528B0F7.’, ‘documentation_url’: ‘https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting’}

    用不了多线程,遂只能加上sleep慢慢读取,和try except。

        try:
            print(cve_id)
            cve_id = cve_id[0]  # 提取CVE ID值
            # 在GitHub上搜索CVE ID
            URL = f'https://api.github.com/search/repositories?q={cve_id}&sort=stars'
            r = requests.get(URL,headers=headers)
            response_dict = r.json()
            print(response_dict)
            repo_dicts = response_dict['items']
            results = []
            for i in range(len(repo_dicts)):
                results.append(repo_dicts[i]["html_url"])
            result = ','.join(results)
            sql = "UPDATE vules SET repositories=%s WHERE cve_id=%s;"
            values = (result, cve_id)
            cursor.execute(sql, values)
            db.commit()
    
            print(results)
            sleep(1)
        except Exception as e:
            # 捕获到异常后的处理代码
            # 打印异常信息
            print("发生异常:", str(e))
            # 等待几秒后继续执行循环
            sleep(5)
            continue
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
  • 相关阅读:
    android 如何给strings.xml文件内容加密?
    分享一下在微信公众号上怎么实现预约功能
    CI2454 2.4g无线MCU芯片应用
    CSS中常用属性
    深入篇【C++】总结智能指针的使用与应用意义&&(auto_ptr/unique_ptr/shared_ptr/weak_ptr)底层原理剖析+模拟实现
    基因组 DNA 分离丨Worthington核糖核酸酶A
    牛客网—链表的回文结构
    【大数据】Flink SQL 语法篇(四):Group 聚合、Over 聚合
    面试高频手撕算法 - 背包问题2
    七牛云 OSS 文件上传demo
  • 原文地址:https://blog.csdn.net/qq_55675216/article/details/133805853