• Python 爬虫基础


    Python 爬虫基础

    1.1 理论

    在浏览器通过网页拼接【/robots.txt】来了解可爬取的网页路径范围

    例如访问: https://www.csdn.net/robots.txt

    User-agent: *
    Disallow: /scripts
    Disallow: /public
    Disallow: /css/
    Disallow: /images/
    Disallow: /content/
    Disallow: /ui/
    Disallow: /js/
    Disallow: /scripts/
    Disallow: /article_preview.html*
    Disallow: /tag/
    Disallow: /?
    Disallow: /link/
    Disallow: /tags/
    Disallow: /news/
    Disallow: /xuexi/

    通过Python Requests 库发送HTTP【Hypertext Transfer Protocol “超文本传输协议”】请求

    通过Python Beautiful Soup 库来解析获取到的HTML内容

    HTTP请求

    在这里插入图片描述

    HTTP响应

    在这里插入图片描述

    1.2 实践代码 【获取价格&书名】

    import requests
    # 解析HTML
    from bs4 import BeautifulSoup
    
    # 将程序伪装成浏览器请求
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    requests = requests.get("http://books.toscrape.com/",headers= head)
    # 指定编码
    # requests.encoding= 'gbk'
    if requests.ok:
        # file = open(r'C:\Users\root\Desktop\Bug.html', 'w')
        # file.write(requests.text)
        # file.close
        content =  requests.text
        ## html.parser 指定当前解析 HTML 元素
        soup = BeautifulSoup(content, "html.parser")
        
        ## 获取价格
        all_prices = soup.findAll("p", attrs={"class":"price_color"})
        for price in all_prices:
            print(price.string[2:])
    
        ## 获取名称
        all_title = soup.findAll("h3")
        for title in all_title:
            ## 获取h3下面的第一个a元素
            print(title.find("a").string)
    else:
        print(requests.status_code)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    1.3 实践代码 【获取 Top250 的电影名】

    import requests
    # 解析HTML
    from bs4 import BeautifulSoup
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    # 获取 TOP 250个电影名
    for i in range(0,250,25):
        response = requests.get(f"https://movie.douban.com/top250?start={i}", headers= head)
        if response.ok:
            content =  response.text
            soup = BeautifulSoup(content, "html.parser")
            all_titles = soup.findAll("span", attrs={"class": "title"})
            for title in all_titles:
                if "/" not in title.string:
                    print(title.string) 
        else:
            print(response.status_code)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    1.4 实践代码 【下载图片】

    import requests
    # 解析HTML
    from bs4 import BeautifulSoup
    
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    response = requests.get(f"https://www.maoyan.com/", headers= head)
    if response.ok:
        soup = BeautifulSoup(response.text, "html.parser")
        for img in soup.findAll("img", attrs={"class": "movie-poster-img"}):
            img_url = img.get('data-src')
            alt = img.get('alt')
            path = 'img/' + alt + '.jpg'
            res = requests.get(img_url)
            with open(path, 'wb') as f:
                f.write(res.content)
    else:
        print(response.status_code)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    1.5 实践代码 【千图网图片 - 爬取 - 下载图片】

    import requests
    # 解析HTML
    from bs4 import BeautifulSoup
    
    
    # 千图网图片 - 爬取
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    # response = requests.get(f"https://www.58pic.com/piccate/53-0-0.html", headers= head)
    # response = requests.get(f"https://www.58pic.com/piccate/53-598-2544.html", headers= head)
    response = requests.get(f"https://www.58pic.com/piccate/53-527-1825.html", headers= head)
    if response.ok:
        soup = BeautifulSoup(response.text, "html.parser")
        for img in soup.findAll("img", attrs={"class": "lazy"}):
            img_url = "https:" + img.get('data-original')
            alt = img.get('alt')
            path = 'imgqiantuwang/' + str(alt) + '.jpg'
            res = requests.get(img_url)
            with open(path, 'wb') as f:
                f.write(res.content)
    else:
        print(response.status_code)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
  • 相关阅读:
    [附源码]SSM计算机毕业设计 图书管理系统 JAVA
    YOLOv8独家最新改进《新颖高效AsDDet检测头》独一无二的结构改进,公开数据集mAP高效涨点,即插即用|检测头新颖改进,性能高效涨点
    wireshark抓包ssl数据出现ignored unknown record的原因
    自动化测试 | selenium自动化测试框架,优化提高selenium的执行速度......
    spring-aop源码分析(2)_AnnotationAwareAspectJAutoProxyCreator后置处理器
    微信小程序demo 调用支付jsapi缺少参数 total_fee,支付签名验证失败 究极解决方案
    Python中的self与类的理解
    【目标跟踪网络训练 Market-1501 数据集】DeepSort 训练自己的跟踪网络模型
    Spark SQL结构化数据文件处理
    华为ICT——第二章-数字图像处理私人笔记
  • 原文地址:https://blog.csdn.net/qq_43935317/article/details/134249236