Python爬取小说

1.获取网页

2.分析获取内容

3.保存到文本

具体步骤放代码里了，注释很清晰了。


# 爬取小说
 
#requests是一个常用的 HTTP 请求库，可以方便地向网站发送 HTTP 请求，并获取响应结果。
#pip install requests
#lxml是python的一个解析库，支持HTML和XML的解析，支持XPath解析方式
#pip install lxml
from lxml import etree
import requests
#网站地址
url = "https://dldl1.nsbuket.cc/xiaoshuo/douluodalu/1.html"
while True:
    #伪装用户
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0'
    }
    # 发送请求，get方式
    resp = requests.get(url,headers=headers)
    #设置编码
    resp.encoding='utf-8'
    #响应信息
    # print(resp.text)
 
    # 获取文本
    e=etree.HTML(resp.text)
    info='\n'.join(e.xpath('//div[@class="m-post"]/p/text()'))#章节内容
    title=e.xpath('//h1/text()')[0]#章节标题
    # print(title)
    # print(info)
 
    # 获取下一章节地址
    url = f'https://dldl1.nsbuket.cc{e.xpath("//td[2]/a/@href")[0]}'
    print(title)
 
    #保存
    with open('斗罗大陆.txt','a',encoding='utf-8') as f:
        f.write(title+'\n\n'+info+'\n\n')
 
    #退出循环
    if url=='https://dldl1.nsbuket.cc/xiaoshuo/douluodalu/217333.html':
        break

相关阅读:
815 - Flooded! （UVA）
有什么自定义表单工具功能较好？
基于 selenium 实现网站图片采集
TCP 与 UDP 如何互通
【华为机试真题 JAVA】第k个排列-100
【CV】第 2 章：使用本地二进制模式的内容识别
从单体到微服务：使用Spring Boot构建事件驱动的Java应用程序
《向量数据库指南》——用 Milvus Cloud和 NVIDIA Merlin 搭建高效推荐系统结果
pytorch-损失函数-分类和回归区别
哈希表（二）

原文地址：https://blog.csdn.net/qq_42683732/article/details/138635447