爬虫学习日记

引言：

1.语言：python

html：


html>
<html>
    <head>
        <title>czy_demotitle>
        <meta charset="UTF-8"> 
    head>
    <body>
        <h1>一级标题（h1~h6）h1>
        <p>普通文本<b>加粗b><i>斜体i><u>下划线u>p>
        <img src="1.jpg" width="500px">
        <br><a href="http://t.csdnimg.cn/DvHJ6" target="_blank">CSDN链接a>
        <p>这是多个span展示:<span style="background-color: bisque">span1span><span style="background-color: aquamarine">span2span>p>
        <ol>
            <li>有序列表li>
            <li>有序列表li>
            <li>有序列表li>
        ol>
        <ul>
            <li>无序列表li>
            <li>无序列表li>
            <li>无序列表li>
        ul>
 
    <table border="1">
        <thead>
            <tr>头部有几个就写几行trtr>
            <tr>第二行头部标签tr>
        thead>
        <tbody>
            <tr>
                <td>第一行*单元格1td>
                <td>第一行*单元格2td>
                <td>第一行*单元格3td>
            tr>
            <tr>
                <td>第二行*单元格1td>
                <td>第二行*单元格2td>
                <td>第二行*单元格3td>
            tr>
        tbody>
 
    table>
 
    body>
html>

爬虫代码

1.两个需要的包


from bs4 import BeautifulSoup
import requests

2.爬原代码


response = requests.get('http:.......')
print(response) #  响应
print(response.status_code) #  状态码---200[ok]
print(response.text) #  打印源码

3.爬指定的内容


response = requests.get('http:........')
content =response.text
soup = BeautifulSoup(content,"html.parser") # 解析器html
 
all_p=soup.findAll("p",attrs={"class":""})
for p in all_p:
    print(p.string)
 
all_p=soup.findAll("h3")
for p in all_p:
    p1=p.findAll("a")
    for p2 in p1:
        print(p2.string)

3.下载图片


from bs4 import BeautifulSoup
import requests
 
headers={
    'User-Agent': 【替换成目标网页的User-Agent】
}
response = requests.get('http://data.shouxi.com/item.php?id=1239786',headers=headers)
response.encoding = 'GBK'
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text,"html.parser") # 解析器html
 
# print(response.text)
 
i=soup.findAll("img")
 
num=1;
for Img in i:
    img_url=Img.get("src")
    if not img_url.startswith('http:'):
        img_url="http:....【替换成网页地址】"+img_url # 将相对地址转换成绝对地址
    # 发送请求下载图片
    img_response = requests.get(img_url, headers=headers)
    with open(f'image.{num}.jpg', mode='wb') as f:
        f.write(img_response.content)
        print(f'图片已保存: images.{num}')
    num = num + 1

相关阅读:
[pytest] requests模块
一篇文章就足够解决大数据实时面试
java计算机毕业设计惠购网站MyBatis+系统+LW文档+源码+调试部署
Betaflight关于STM32F405 SBUS协议兼容硬件电气特性问题
【DTEmpower案例操作教程】智能模型预警
JVM学习08——JVM调优
字符串数字出现的新功能
要远离职场中的哪几类人
华为HCIA学习（一）
文娱行业搜索最佳实践

原文地址：https://blog.csdn.net/wuhu_czy/article/details/140360967