以下是一些常见的Python爬虫库,按照一般热门程度的排序:
requests
库是非常流行的用于发送HTTP请求的库,因其简洁易用和广泛的社区支持而备受青睐。以下是一些简单的Python爬虫库的示例用例:
Requests:
import requests
# 发送GET请求
response = requests.get("https://www.example.com")
print(response.text)
Beautiful Soup:
from bs4 import BeautifulSoup
import requests
# 发送GET请求并解析HTML
response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, 'html.parser')
# 提取标题文本
title = soup.title.string
print("Title:", title)
Scrapy:
使用Scrapy来爬取整个网站的所有标题链接:
scrapy startproject myproject
创建一个Spider并定义抓取规则:
import scrapy
class MySpider(scrapy.Spider):
name = "example"
start_urls = [
"https://www.example.com",
]
def parse(self, response):
for title in response.css('h1'):
yield {
'title': title.get(),
}
运行Spider:
scrapy crawl example
Selenium:
使用Selenium来打开一个网页并截取屏幕截图:
from selenium import webdriver
# 启动Chrome浏览器
driver = webdriver.Chrome()
# 打开网页
driver.get("https://www.example.com")
# 截取屏幕截图
driver.save_screenshot("screenshot.png")
# 关闭浏览器
driver.quit()
lxml:
使用lxml来解析HTML并提取链接:
from lxml import html
import requests
# 发送GET请求
response = requests.get("https://www.example.com")
# 解析HTML
tree = html.fromstring(response.content)
# 提取所有链接
links = tree.xpath('//a/@href')
for link in links:
print("Link:", link)
PyQuery:
使用PyQuery来解析HTML并提取标题:
from pyquery import PyQuery as pq
import requests
# 发送GET请求
response = requests.get("https://www.example.com")
# 创建PyQuery对象
doc = pq(response.text)
# 提取标题文本
title = doc('title').text()
print("Title:", title)
Splash:
使用Splash来渲染JavaScript并获取渲染后的页面内容:
import requests
import json
# 请求Splash服务来渲染页面
url = "http://localhost:8050/render.html"
params = {
'url': "https://www.example.com",
'wait': 2 # 等待2秒钟,以确保JavaScript加载完成
}
response = requests.get(url, params=params)
# 解析渲染后的页面内容
rendered_html = response.text
print(rendered_html)
Tornado:
使用Tornado构建一个简单的异步爬虫:
import tornado.ioloop
import tornado.httpclient
async def fetch_url(url):
http_client = tornado.httpclient.AsyncHTTPClient()
response = await http_client.fetch(url)
print("Fetched URL:", url)
return response.body
async def main():
urls = ["https://www.example.com", "https://www.example2.com"]
for url in urls:
html = await fetch_url(url)
# 在这里处理HTML内容
if __name__ == "__main__":
tornado.ioloop.IOLoop.current().run_sync(main)
Gevent:
使用Gevent来并发地获取多个URL:
import gevent
import requests
def fetch_url(url):
response = requests.get(url)
print("Fetched URL:", url)
# 在这里处理HTML内容
urls = ["https://www.example.com", "https://www.example2.com"]
jobs = [gevent.spawn(fetch_url, url) for url in urls]
gevent.joinall(jobs)
Aiohttp:
使用Aiohttp来异步获取多个URL:
import aiohttp
import asyncio
async def fetch_url(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
html = await response.text()
print("Fetched URL:", url)
# 在这里处理HTML内容
urls = ["https://www.example.com", "https://www.example2.com"]
loop = asyncio.get_event_loop()
tasks = [fetch_url(url) for url in urls]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()