• 爬虫实战:从网页到本地,如何轻松实现小说离线阅读


    今天我们将继续进行爬虫实战,除了常规的网页数据抓取外,我们还将引入一个全新的下载功能。具体而言,我们的主要任务是爬取小说内容,并实现将其下载到本地的操作,以便后续能够进行离线阅读。

    为了确保即使在功能逐渐增多的情况下也不至于使初学者感到困惑,我特意为你绘制了一张功能架构图,具体如下所示:

    image

    让我们开始深入解析今天的主角:小说网

    小说解析

    书单获取

    在小说网的推荐列表中,我们可以选择解析其中的某一个推荐内容,而无需完全还原整个网站页面的显示效果,从而更加高效地获取我们需要的信息。

    以下是一个示例代码,帮助你更好地理解:

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    req = Request("https://www.readnovel.com/",headers=headers)
    # 发出请求,获取html
    # 获取的html内容是字节,将其转化为字符串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    soup = bf(html_text,'html.parser')
    
    for li in soup.select('#new-book-list li'):
        a_tag = li.select_one('a[data-eid="qd_F24"]')
        p_tag = li.select_one('p')
        book = {
            'href': a_tag['href'],
            'title': a_tag.get('title'),
            'content': p_tag.get_text()
        }
        print(book)
    
    

    书籍简介

    在通常情况下,我们会先查看书单,然后对书籍的大致内容进行了解,因此直接解析相关内容即可。以下是一个示例代码:

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
    # 发出请求,获取html
    # 获取的html内容是字节,将其转化为字符串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    soup = bf(html_text,'html.parser')
    og_title = soup.find('meta', property='og:title')['content']
    og_description = soup.find('meta', property='og:description')['content']
    og_novel_author = soup.find('meta', property='og:novel:author')['content']
    og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
    og_novel_status = soup.find('meta', property='og:novel:status')['content']
    og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
    # 查找内容为"免费试读"的a标签
    div_tag = soup.find('div', id='j-catalogWrap')
    list_items = div_tag.find_all('li', attrs={'data-rid': True})
    for li in list_items:
        link_text = li.find('a').text
        if '第' in link_text:
            link_url = li.find('a')['href']
            link_obj = {'link_text':link_text,
                    'link_url':link_url}
            free_trial_link.append(link_obj)
    print(f"书名:{og_title}")
    print(f"简介:{og_description}")
    print(f"作者:{og_novel_author}")
    print(f"最近更新:{og_novel_update_time}")
    print(f"当前状态:{og_novel_status}")
    print(f"最近章节:{og_novel_latest_chapter_name}")
    

    在解析过程中,我们发现除了获取书籍的大致内容外,还顺便解析了相关的书籍目录。将这些目录保存下来会方便我们以后进行试读操作,因为一旦对某本书感兴趣,我们接下来很可能会阅读一下。如果确实对书籍感兴趣,可能还会将其加入书单。为了避免在阅读时再次解析,我们在这里直接保存了这些目录信息。

    免费试读

    在这一步,我们的主要任务是解析章节的名称以及章节内容,并将它们打印出来,为后续封装成方法以进行下载或阅读做准备。这样做可以更好地组织和管理数据,提高代码的复用性和可维护性。下面是一个示例代码,展示了如何实现这一功能:

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    req = Request(f"https://www.readnovel.com{link}",headers=headers)
    # 发出请求,获取html
    # 获取的html内容是字节,将其转化为字符串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    soup = bf(html_text, 'html.parser')
    name = soup.find('h1',class_='j_chapterName')
    chapter = {
        'name':name.get_text()
    }
    print(name.get_text())
    ywskythunderfont = soup.find('div', class_='ywskythunderfont')
    if ywskythunderfont:
        p_tags = ywskythunderfont.find_all('p')
        chapter['text'] = p_tags[0].get_text()
        print(chapter)
    

    小说下载

    当我们完成内容解析后,已经成功获取了小说的章节内容,接下来只需执行下载操作即可。对于下载操作的具体步骤,如果有遗忘的情况,我来帮忙大家进行回顾一下。

    file_name = 'a.txt'
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write('尝试下载')
    print(f'文件 {file_name} 下载完成!')
    

    包装一下

    按照老规矩,以下是源代码示例。即使你懒得编写代码,也可以直接复制粘贴运行一下,然后自行琢磨其中的细节。这样能够更好地理解代码的运行逻辑和实现方式。

    # 导入urllib库的urlopen函数
    from urllib.request import urlopen,Request
    # 导入BeautifulSoup
    from bs4 import BeautifulSoup as bf
    from random import choice,sample
    from colorama import init
    from termcolor import colored
    from readchar import  readkey
    FGS = ['green', 'yellow', 'blue', 'cyan', 'magenta', 'red']
    book_list = []
    free_trial_link = []
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    
    def get_hot_book():
        print(colored('开始搜索书单!',choice(FGS)))
        book_list.clear()
        req = Request("https://www.readnovel.com/",headers=headers)
        # 发出请求,获取html
        # 获取的html内容是字节,将其转化为字符串
        html = urlopen(req)
        html_text = bytes.decode(html.read())
        soup = bf(html_text,'html.parser')
    
        for li in soup.select('#new-book-list li'):
            a_tag = li.select_one('a[data-eid="qd_F24"]')
            p_tag = li.select_one('p')
            book = {
                'href': a_tag['href'],
                'title': a_tag.get('title'),
                'content': p_tag.get_text()
            }
            book_list.append(book)
    
    def get_book_detail(link):
        global free_trial_link
        free_trial_link.clear()
        req = Request(f"https://www.readnovel.com{link}#Catalog",headers=headers)
        # 发出请求,获取html
        # 获取的html内容是字节,将其转化为字符串
        html = urlopen(req)
        html_text = bytes.decode(html.read())
        soup = bf(html_text,'html.parser')
        og_title = soup.find('meta', property='og:title')['content']
        og_description = soup.find('meta', property='og:description')['content']
        og_novel_author = soup.find('meta', property='og:novel:author')['content']
        og_novel_update_time = soup.find('meta', property='og:novel:update_time')['content']
        og_novel_status = soup.find('meta', property='og:novel:status')['content']
        og_novel_latest_chapter_name = soup.find('meta', property='og:novel:latest_chapter_name')['content']
        # 查找内容为"免费试读"的a标签
        div_tag = soup.find('div', id='j-catalogWrap')
        list_items = div_tag.find_all('li', attrs={'data-rid': True})
        for li in list_items:
            link_text = li.find('a').text
            if '第' in link_text:
                link_url = li.find('a')['href']
                link_obj = {'link_text':link_text,
                        'link_url':link_url}
                free_trial_link.append(link_obj)
        print(colored(f"书名:{og_title}",choice(FGS)))
        print(colored(f"简介:{og_description}",choice(FGS)))
        print(colored(f"作者:{og_novel_author}",choice(FGS)))
        print(colored(f"最近更新:{og_novel_update_time}",choice(FGS)))
        print(colored(f"当前状态:{og_novel_status}",choice(FGS)))
        print(colored(f"最近章节:{og_novel_latest_chapter_name}",choice(FGS)))
    
    def free_trial(link):
        req = Request(f"https://www.readnovel.com{link}",headers=headers)
        # 发出请求,获取html
        # 获取的html内容是字节,将其转化为字符串
        html = urlopen(req)
        html_text = bytes.decode(html.read())
        soup = bf(html_text, 'html.parser')
        name = soup.find('h1',class_='j_chapterName')
        chapter = {
            'name':name.get_text()
        }
        print(colored(name.get_text(),choice(FGS)))
        ywskythunderfont = soup.find('div', class_='ywskythunderfont')
        if ywskythunderfont:
            p_tags = ywskythunderfont.find_all('p')
            chapter['text'] = p_tags[0].get_text()
        return chapter
    
    def download_chapter(chapter):
        file_name = chapter['name'] + '.txt'
        with open(file_name, 'w', encoding='utf-8') as file:
            file.write(chapter['text'].replace('\u3000\u3000', '\n'))
        print(colored(f'文件 {file_name} 下载完成!',choice(FGS)))
    
    def print_book():
        for i in range(0, len(book_list), 3):
            names = [f'{i + j}:{book_list[i + j]["title"]}' for j in range(3) if i + j < len(book_list)]
            print(colored('\t\t'.join(names),choice(FGS)))
    
    def read_book(page):
        if not free_trial_link:
            print(colored('未选择书单,无法阅读!',choice(FGS)))
        
        print(colored(free_trial(free_trial_link[page]['link_url'])['text'],choice(FGS)))
    
    get_hot_book()
    
    init() ## 命令行输出彩色文字
    print(colored('已搜索完毕!',choice(FGS)))
    print(colored('m:返回首页',choice(FGS)))
    print(colored('d:免费试读',choice(FGS)))
    print(colored('x:全部下载',choice(FGS)))
    print(colored('n:下一章节',choice(FGS)))
    print(colored('b:上一章节',choice(FGS)))
    print(colored('q:退出阅读',choice(FGS)))
    my_key = ['q','m','d','x','n','b']
    current = 0
    while True:
        while True:
            move = readkey()
            if move in my_key:
                break
        if move == 'q': ## 键盘‘Q’是退出
            break 
        if move == 'd':  
            read_book(current)
        if move == 'x':  ## 这里只是演示为主,不循环下载所有数据了
            download_chapter(free_trial(free_trial_link[0]['link_url']))
        if move == 'b':  
            current = current - 1
            if current < 0 :
                current = 0
            read_book(current)
        if move == 'n':  
            current = current + 1
            if current > len(free_trial_link) :
                current = len(free_trial_link) - 1
            read_book(current)
        if move == 'm':
            print_book()
            current = 0
            num = int(input('请输入书单编号:=====>'))
            if num <= len(book_list):
                get_book_detail(book_list[num]['href'])
    

    总结

    今天在爬虫实战中,除了正常爬取网页数据外,我们还添加了一个下载功能,主要任务是爬取小说并将其下载到本地,以便离线阅读。为了避免迷糊,我为大家绘制了功能架构图。我们首先解析了小说网,包括获取书单、书籍简介和免费试读章节。然后针对每个功能编写了相应的代码,如根据书单获取书籍信息、获取书籍详细信息、免费试读章节解析和小说下载。最后,将这些功能封装成方法,方便调用和操作。通过这次实战,我们深入了解了爬虫的应用,为后续的项目提供了基础支持。

  • 相关阅读:
    更改docker存储路径
    使用cgroup控制CPU使用率
    013-第二代上位机开发环境搭建
    php练习02
    SQL 转置查询
    DiskPressure(磁盘压力)
    单链表的递归详解 (leetcode习题+ C++实现)
    Kubernetes IPVS和IPTABLES
    ORACLE常见错误编码大全
    【MATLAB源码-第71期】基于matlab的萤火虫算法(FA)的无人机三维地图路径规划,输出最短路径和适应度曲线。
  • 原文地址:https://www.cnblogs.com/guoxiaoyu/p/18069448