• urllib.error.URLError 提示错误解决方法和爬虫基本知识


    urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

    在代码的头一行加入:就可以解决

    1. import ssl
    2. ssl._create_default_https_context = ssl._create_unverified_context

    代码如下:

    1. import urllib
    2. import urllib.request
    3. import ssl
    4. ssl._create_default_https_context = ssl._create_unverified_context
    5. data1 = bytes(urllib.parse.urlencode({'name': 'geometry'}), encoding='utf-8')
    6. response = urllib.request.urlopen('https://www.httpbin.org/post', data=data1)
    7. print(response.read().decode('utf-8'))

    运行结果如下:说明服务请求post data数据,服务器返回结果

    1. {
    2. "args": {},
    3. "data": "",
    4. "files": {},
    5. "form": {
    6. "name": "geometry"
    7. },
    8. "headers": {
    9. "Accept-Encoding": "identity",
    10. "Content-Length": "13",
    11. "Content-Type": "application/x-www-form-urlencoded",
    12. "Host": "www.httpbin.org",
    13. "User-Agent": "Python-urllib/3.10",
    14. "X-Amzn-Trace-Id": "Root=1-630329c8-79bee34a3feaa8dc06de8d21"
    15. },
    16. "json": null,
    17. "origin": "117.30.119.96",
    18. "url": "https://www.httpbin.org/post"
    19. }

    在学习requests

    安装一下:pip3 install request

    1. pip3 install requests
    2. Collecting requests
    3. Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ca/91/6d9b8ccacd0412c08820f72cebaa4f0c0441b5cda699c90f618b6f8a1b42/requests-2.28.1-py3-none-any.whl (62 kB)
    4. |████████████████████████████████| 62 kB 502 kB/s
    5. Collecting charset-normalizer<3,>=2
    6. Downloading https://pypi.tuna.tsinghua.edu.cn/packages/db/51/a507c856293ab05cdc1db77ff4bc1268ddd39f29e7dc4919aa497f0adbec/charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
    7. Requirement already satisfied: certifi>=2017.4.17 in /Users/apple/PycharmProjects/spydemo1/venv/lib/python3.10/site-packages (from requests) (2022.6.15)
    8. Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/apple/PycharmProjects/spydemo1/venv/lib/python3.10/site-packages (from requests) (1.26.9)
    9. Requirement already satisfied: idna<4,>=2.5 in /Users/apple/PycharmProjects/spydemo1/venv/lib/python3.10/site-packages (from requests) (3.3)
    10. Installing collected packages: charset-normalizer, requests
    11. Successfully installed charset-normalizer-2.1.1 requests-2.28.1

    实例引用

    1. import requests
    2. urlbase = 'https://www.baidu.com'
    3. r = requests.get(urlbase)
    4. print(type(r))
    5. print(r.status_code)
    6. print(type(r.text))
    7. print(r.text[:100])
    8. print(r.cookies)

    运行结果:

    1. <class 'requests.models.Response'>
    2. 200
    3. <class 'str'>
    4. html>
    5. <html> <head><meta http-equiv=content-type content=text/html;charse
    6. <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

    https的基本请求,get post delete等。这里请求一下带参数的get请求

    get带参数

    1. urlparam = 'https://www.httpbin.org/get?name=germey&age=25'
    2. r1 = requests.get(urlparam)
    3. print(r1.text)

    运行结果,请求要获得了germey 25岁的数据

    1. {
    2. "args": {
    3. "age": "25",
    4. "name": "germey"
    5. },
    6. "headers": {
    7. "Accept": "*/*",
    8. "Accept-Encoding": "gzip, deflate",
    9. "Host": "www.httpbin.org",
    10. "User-Agent": "python-requests/2.28.1",
    11. "X-Amzn-Trace-Id": "Root=1-63032c31-5d23881d3de2fc404c11c5e5"
    12. },
    13. "origin": "117.30.119.129",
    14. "url": "https://www.httpbin.org/get?name=germey&age=25"
    15. }

    也可以获取一个json格式的数据

    print(r.json())

    {'args': {'age': '25', 'name': 'germey'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'www.httpbin.org', 'User-Agent': 'python-requests/2.28.1', 'X-Amzn-Trace-Id': 'Root=1-63032cb8-3f780dc507c9344509ebe52b'}, 'origin': '117.30.118.141', 'url': 'https://www.httpbin.org/get?name=germey&age=25'}
    

    抓取二进制数据

    获取的实际是图片

    可以进行

    输出看看是什么?通过写入一个文件,获取图片为

     

     requests 往往需要添加请求头,才可以进行实际请求内容

    不添加请求头可以试试

    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='ssr1.csrape.centor', port=443): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
    

    提示请求错误的。需要添加请求头;代码如下:

    1. urlbase4 = 'https://www.sina.com.cn/'
    2. myheaders = {
    3. 'User-Agent':'Mozilla/5.0 (Macintosh; Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/ 537.36'
    4. #'User-Agent': 'Mozilla/5.0 (Windows;U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)'
    5. }
    6. r3 = requests.get(urlbase4, headers=myheaders)
    7. print(r3.text[:400])

    请求头获取比较简单,抓个包就可以知道,获取一下页面的400个字符串,请求为新浪地址,中间有中文,还没有进行转码

    1. html>
    2. <html>
    3. Content-type" content="text/html; charset=utf-8" />
    4. content="IE=edge" />
    5. æ°æµªé¦é¡µ
    6. content="æ°æµª,æ°æµªç½,SINA,sina,sina.com.cn,æ°æµªé¦é¡µ,é¨æ·,èµè®¯" />
    7. content="æ°æµªç½ä¸ºå¨çç¨

    Requests 和 urllib 支持https 1.1 协议,如果要求用http2.0 协议,那么就要用httpx

    安装httpx 

    1. Installing collected packages: rfc3986, h11, anyio, httpcore, httpx
    2. Attempting uninstall: h11
    3. Found existing installation: h11 0.13.0
    4. Uninstalling h11-0.13.0:
    5. Successfully uninstalled h11-0.13.0
    6. Successfully installed anyio-3.6.1 h11-0.12.0 httpcore-0.15.0 httpx-0.23.0 rfc3986-1.5.0

    用法基本和requests的类似

    1. import httpx
    2. urlbase = 'https://www.httpbin.org/get'
    3. with httpx.Client() as client:
    4. response = client.get(urlbase)
    5. print(response)

     异步爬虫

    1. import asyncio
    2. async def execute(x):
    3. print('Number:', x)
    4. return x
    5. coroutine =execute(1)
    6. print('Coroutine:', coroutine)
    7. print('After Calling execute')
    8. task =asyncio.ensure_future(coroutine)
    9. print('Task:', task)
    10. loop = asyncio.get_event_loop()
    11. loop.run_until_complete(task)
    12. print('Task:', task)
    13. print('After calling loop')

  • 相关阅读:
    TapTap自动获得游戏的MD5码
    【实用软件】电脑wifi密码查看器
    【寒武纪(4)】图像处理硬件加速,基于CNCVE
    Solr搜索参数详解
    我第一份Python自动化测试工作能找到13k的工作,就是掌握了这些技术栈
    【Python | 入门】 五分钟速通语法
    Docker基本配置及使用
    译:零信任对 Kubernetes 意味着什么
    【C】逆序字符串(俩种递归思路)
    C++数据结构:并查集
  • 原文地址:https://blog.csdn.net/keny88888/article/details/126466179