Python爬虫有哪些库，分别怎么用

Python爬虫有哪些库，分别怎么用
目录

Python常用爬虫库

代码示例

requests + BeautifulSoup

Scrapy

Selenium

PyQuery

Axios

requests-html

pyppeteer

总结

Python是一种非常流行的编程语言，因其易学易用和广泛的应用而受到开发者的喜爱。在Python中，有许多库可以用于爬虫程序的开发，这些库可以帮助我们快速地从互联网上抓取数据。本文将介绍一些常用的Python爬虫库及其用法。

Python常用爬虫库

Python的爬虫库非常丰富，以下是一些常用的库及其用法：
1. requests：用于发送HTTP请求，获取响应内容。用法：安装requests库后，导入库，使用get或post方法发送请求，接收响应对象，从中提取所需信息。
2. BeautifulSoup：用于解析HTML或XML文档，提取所需数据。用法：安装BeautifulSoup库后，导入库，将待解析的页面源码作为参数传入BeautifulSoup的构造函数中，使用选择器定位所需元素，使用属性或方法获取数据。
3. Scrapy：一个基于Twisted框架的爬虫框架，可用于大规模数据采集。用法：安装Scrapy框架后，创建Scrapy项目，编写Spider和Item Pipeline等组件，运行Scrapy命令进行数据采集和存储。
4. Selenium：用于模拟浏览器行为，动态获取网页数据。用法：安装Selenium库后，导入库，创建WebDriver对象，使用对象执行浏览器行为（如点击、输入等），获取动态生成的数据。
5. PyQuery：用于解析HTML或XML文档，与jQuery选择器类似。用法：安装PyQuery库后，导入库，将待解析的页面源码作为参数传入PyQuery的构造函数中，使用选择器定位所需元素，使用属性或方法获取数据。
6. Axios：用于发送HTTP请求，获取响应内容，支持Promise和async/await用法：安装Axios库后，导入库，使用get或post方法发送请求，接收响应对象，从中提取所需信息。
7. requests-html：基于requests库的扩展库，可解析HTML页面。用法：安装requests-html库后，导入库，使用get或post方法发送请求，接收响应对象，从中提取所需信息。
8. pyppeteer：用于模拟浏览器行为，动态获取网页数据，支持headless模式。用法：安装pyppeteer库后，导入库，创建Browser对象，使用对象创建Page对象，执行浏览器行为（如点击、输入等），获取动态生成的数据。
以上是一些常用的Python爬虫库及其用法，不同的库适用于不同的场景和需求。选择合适的库和方法可以大大提高数据采集的效率和准确性。

代码示例

requests + BeautifulSoup
```
import requests  
from bs4 import BeautifulSoup  
  
url = 'https://www.example.com'  
response = requests.get(url)  
soup = BeautifulSoup(response.text, 'html.parser')  
  
# 获取网页标题  
title = soup.title.string  
print('网页标题：', title)  
  
# 获取网页内容  
content = soup.p.string  
print('网页内容：', content)
```
Scrapy
```
import scrapy  
  
class ExampleSpider(scrapy.Spider):  
    name = 'example'  
    start_urls = ['https://www.example.com']  
  
    def parse(self, response):  
        # 提取所需数据  
        title = response.css('title::text').get()  
        content = response.css('p::text').get()  
        yield {'title': title, 'content': content}
```
Selenium
```
from selenium import webdriver  
  
# 初始化WebDriver，使用Chrome浏览器  
driver = webdriver.Chrome()  
  
# 打开指定URL  
driver.get('https://www.example.com')  
  
# 定位元素并输入文本  
element = driver.find_element_by_id('username')  
element.send_keys('myusername')  
  
# 定位元素并点击  
element = driver.find_element_by_id('password')  
element.send_keys('mypassword')  
element.submit()  
  
# 等待页面加载完成  
driver.implicitly_wait(10)  
  
# 定位元素并检查文本内容  
element = driver.find_element_by_id('welcome-message')  
assert 'Welcome, myusername!' in element.text  
  
# 关闭浏览器窗口  
driver.quit()
```
PyQuery
```
from pyquery import PyQuery as pq  
  
# 加载HTML文档  
html = """  
  
  
    Example  
  
  
      
        Hello, World!
  
        This is a paragraph.
  
          
            
Item 1
  
```
Item 2

const axios = require('axios'); axios.get('https://api.example.com/data') .then(function (response) { console.log(response.data); }) .catch(function (error) { console.log(error); });

axios.post('https://api.example.com/data', { name: 'John Doe', email: 'john@example.com' }) .then(function (response) { console.log(response.data); }) .catch(function (error) { console.log(error); });

from requests_html import HTMLSession # 创建一个 HTMLSession 实例 session = HTMLSession() # 使用 get 方法获取一个网页 response = session.get('https://example.com') # 使用 BeautifulSoup 来解析网页内容 soup = response.html # 输出页面的标题 print(soup.title) # 输出所有的段落标签 for p in soup.find_all('p'): print(p.text)

import asyncio from pyppeteer import launch async def main(): # 启动浏览器 browser = await launch() page = await browser.newPage() # 打开网页 await page.goto('http://example.com') # 截图 await page.screenshot({'path': 'example.png'}) # 关闭浏览器 await browser.close() asyncio.get_event_loop().run_until_complete(main())

Python常用爬虫库

代码示例

requests + BeautifulSoup

Scrapy

Selenium

PyQuery

Hello, World!

Axios

requests-html

pyppeteer

总结