python安全工具开发笔记（六）——Python爬虫BeautifulSoup模块的介绍

一、Python爬虫基础知识介绍

1.1 Python相关库

1、requests、re
2、BeautifulSoup
3、hackhttp

1.2 Python BeautifulSoup

Python BeautifulSoup模块的使用介绍∶

1、解析内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

2、浏览数据
soup.title
soup.title.string

3、BeautifulSoup正则使用
soup.find_all(name= ‘x’,attrs={‘xx’:re.compile(‘xxx’)})

二、Python爬虫简单实现

示例一：

import requests
from bs4 import BeautifulSoup

url = 'https://www.ichunqiu.com/competition/all'
headers = {

    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0',
    }
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,'lxml')#解析html

print(soup.title)#输出title标签
print(soup.title.string)#输出title标签的值
1
2
3
4
5
6
7
8
9
10
11
12
13

在这里插入图片描述

示例二：

import requests
from bs4 import BeautifulSoup

url = 'https://www.ichunqiu.com/competition/all'
headers = {

    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0',
    }
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,'lxml')#解析html

#print(soup.title)#输出title标签
#print(soup.title.string)#输出title标签的值

com_nes = soup.find_all(name='a')#name=标签值，attrs=标签内的内容(可以不),示例：com_nes = soup.find_all(name='a',attrs={'class':'ui_colorG'}),
for coms in com_nes:
    if coms.string != None:
        print(coms.string)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

在这里插入图片描述
示例三：

import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.ichunqiu.com/competition/all'
headers = {

    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0',
    }
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,'lxml')#解析html

#print(soup.title)#输出title标签
#print(soup.title.string)#输出title标签的值

com_nes = soup.find_all(name='a',attrs={'href':re.compile('ichunqiu')})#获取标签a中超链接含有ichunqiu的所有链接

for coms in com_nes:
    print(coms)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

万金油的使用方法BeautifulSoup
在这里插入图片描述

三、Python爬虫基础知识介绍

3.1 Python hackhttp

Python hackhttp模块的使用介绍∶

安装： pip install hackhttp

import hackhttp
hh = hackhttp.hackhttp()
url = “http://www.baidu.com”
code, head, html, redirect_url, log = hh.http(url)

发起get、post请求，发起http原始数据包

hackhttp介绍补充链接：
http://www.voidcc.com/project/hack-requests

示例一：

import HackRequests
from bs4 import BeautifulSoup
import re

url = 'https://www.cnvd.org.cn/'
hack = HackRequests.hackRequests()
url = "http://www.hacking8.com"
hh = hack.http(url)
print(hh.status_code)#
print(hh.content())


1
2
3
4
5
6
7
8
9
10
11
12

在这里插入图片描述

相关阅读:
【规范】Git分支管理，看看我司是咋整的
骑马钉根据列行页数生成排序规则 java版 JavaScript版 python版
python代码轻松下载youtube视频
Java最基础模糊知识点快过
通过二级域名解决1台云服务器搭建多个公众号后端服务的问题
Linux shell编程学习笔记4：修改命令行提示符格式（内容和颜色）
shell小练习2
Spark的Master、Worker、Dirver和Executor，对比Flink的Jobmanager、Taskmanager、Slot异同
基于区块链的去中心化数字身份研究及验证
面试经典 150 题 4 —（数组 / 字符串）— 80. 删除有序数组中的重复项 II

原文地址：https://blog.csdn.net/weixin_40412037/article/details/126817566