注:本文章为学习过程中对知识点的记录,供自己复习使用,也给大家做个参考,如有错误,麻烦指出,大家共同探讨,互相进步。
借鉴出处:
该文章的路线和主要内容:崔庆才(第2版)python3网络爬虫开发实战
Requests中文文档:https://www.w3cschool.cn/requests2/
requests库是在urllib的基础上进行的进行的封装,比urllib使用更加便捷,企业中多数用requests,所以与urllib对照学习,加深记忆!
1、安装requests库
pip install requests
2、requests.get() 对比 urllib.request.urlopen()
输入:
import requests
res = requests.get('https://www.baidu.com/')
print(type(res)) //输出响应的类型
print(res.status_code) //状态码
print(type(res.text)) //响应体类型
print(res.text[:100]) //响应体内容(显示一部分,要不然太多了)
print(res.headers) //响应头
print(res.history) //请求历史记录
print(res.cookies) //Cookie
print(res.url) //
输出:
<class 'requests.models.Response'>
200
<class 'str'>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charse
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 04 Oct 2022 02:40:51 GMT', 'Last-Modified': 'Mon, 23
Jan 2017 13:24:33 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
[]
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
https://www.baidu.com/
总结:requests除了get方法,还有post、put、delete等方法。
①GET请求
a、如果需要在请求头和请求体中加入参数,该怎么做?
输入:
import requests
url = 'http://www.httpbin.org/get'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0',
'Host': 'www.httpbin.org'
}
data = {
'username': 'jack',
'password': 'abc123456'
}
res = requests.get(url=url, data=data, headers=headers)
print(res.text)
输出:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Content-Length": "32",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0",
"X-Amzn-Trace-Id": "Root=1-633ba026-08f0fed010e0327633825313"
},
"origin": "120.227.32.26",
"url": "http://www.httpbin.org/get"
}
对比原始响应体
b、抓取二进制数据
输入:
import requests
res = requests.get('https://scrape.center/favicon.ico')
print(res.text)
print(res.content)
输出:可以通过将爬取下来图片的二进制数据存入本地favicon.ico里,这样图片就会被保存到本地
import requests
res = requests.get('https://scrape.center/favicon.ico')
with open('favicon.ico','wb') as f:
f.write(res.content)
②POST请求:与GET请求类似,传参即可。
3、响应
上面已经介绍了常用的响应属性,其中不同状态码都有对应的属性
# 信息性状态码
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
# 成功状态码
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),
# 重定向状态码
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
'resume_incomplete', 'resume',), # These 2 to be removed in 3.0
# 客户端错误状态码
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),
# 服务端错误状态码
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')
例如判断状态结果是不是404,可以用requests.codes.not_found作为内置的状态码做比较。
4、高级用法
get()还可传入其他参数,(与post基本一致)。
①文件上传
file对象(文件对象,要传入二进制数据)
输入:
import requests
files = {'file': open('favicon.ico', 'rb')}
res = requests.post('http://www.httpbin.org/post',files=files)
print(res.text)
输出:
{
"args": {},
"data": "",
"files": {
"file": "data:application/octet-stream;base64,AAABAAEAICAAAAEAIACoEAAAFgAAACgAAAAgAAAAQAAAAAEAIAAAAAAAABAAABILAAASCwAAAAAAAAAAAABXP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+
v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1
c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1hA6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9YQOv/Ujrq/0ox6v9LMur/SzLq/0sy6v9LMu
r/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0ox6v9SOur/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/WEDr/1I66v9yXe7/n5H0/5qK8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5
uL8/+bi/P/m4vz/5uL8/+ZivP/n5H0/3Jd7v9SOur/WEDr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////////////////////////////////////////////////////////////////////z8////////mozz/0sy6v9aQuv/Vz/r/1c/6/9XP+
v/Vz/r/1pC6/9LMur/mYrz///////6+f7//Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P//+vn+//////+aivP/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8//////////////7+///+/v///v7///
7+///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v/////////////8/P///////5uL8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////////////////////////////////////////////////////////////
////////z8////////m4vz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////+VhfL/dF/v/3tn7/96Zu//embv/3pm7/96Zu//embv/3pm7/96Zu//embv/3tn7/90X+7/lYXz///////+/v///Pz///////+bi/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0
sy6v+ajPP///////z8///+/v///////3Rg7v9HLen/UTjq/0826v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/UTjq/0cu6f90X+7///////7+///8/P///////5uL8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///7+////////e2jv/1E46v9aQ+v/WUHr/1lC6/9bRO
z/W0Ts/1tE7P9bROz/W0Ts/1tE7P9dRuz/VDzr/31q8P///////v7///z8////////m4vz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////96Zu7/Tzbq/1lB6/9YQOv/VDzr/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0006v9DKen/cVzu///////+/v///Pz///
////+bi/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////3pm7v9PNur/WULr/1Q76/9mT+v/morz/5iI8/+YiPP/mIjz/5iI8/+YiPP/mYn0/5SD8/+toPb////////////8/P///////5uL8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8/////
///Pz///7+////////embu/0826v9aQ+v/Tzbr/3xq6f////7//v7///////////////////////////////////////////////////z8////////m4vz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////96Zu7/Tzbq/1pD6/9PNuv/e2rq/////v/7+////Pz///
z8///8/P///Pz///z8///8/P///f3//////////////Pz///////+ai/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////3pm7v9PNur/WkPr/0826/98aur////+//39///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///7+////////5qL8/9LMu
r/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///7+////////embu/0826v9aQ+v/Tzbr/3xq6f////7//v7//////v////7////+/////v////7////+/////v////7////+//z8/v//////m4zz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///
////96Zu7/Tzbq/1lB6/9VPev/X0np/31s6f98aur/fGrq/3xq6v98aur/fGrq/3xq6v98aur/fGrq/3xq6v98aur/e2rq/39t6v9mUOr/VDzr/1hA6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////3pm7v9PNur/WUHr/1c/6/9VPev/TzXr/0826/9PNuv/Tzbr/0826/9PNu
v/Tzbr/0826/9PNuv/Tzbr/0826/9PNuv/TjXr/1Q76/9YQOv/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///7+////////e2jv/1E46v9aQ+v/WUHr/1lB6/9bQ+v/WkPr/1pD6/9aQ+v/WkPr/1pD6/9aQ+v/WkPr/1pD6/9aQ+v/WkPr/1pD6/9bQ+v/WEHr/1c/6/9XP+v/Vz/r/1
c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////90YO7/SC3p/1E46v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/UDfq/0826v9UPOv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////5WF8v9zYO
7/e2jv/3pm7/96Zu//embv/3pm7/96Zu//embv/3pm7/96Zu//embv/3pm7/96Zu//embv/3pm7/95Zu//fGnv/2RP7f9VPOv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////////////////////////////////////////////////////
////////////////z8////////mozz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P/////////////+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v///Pv///////+ai/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+
v/WkLr/0sy6v+ZivP///////r5/v/8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///6+f7//////5mK8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////
////////////////////////////////////////////////////////////////z8////////m4zz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1hA6/9SOur/cl7u/5+R9P+ZivP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajP
P/mYrz/5+R9P9yXe7/Ujrq/1hA6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1hA6/9SOur/SjHq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SjHq/1I66v9YQOv/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1
hA6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+
v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1
c/6/9XP+v/Vz/r/1c/6/9XP+v/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Content-Length": "4433",
"Content-Type": "multipart/form-data; boundary=4e29f7f88aa42e12b942b3607d866316",
"Host": "www.httpbin.org",
"User-Agent": "python-requests/2.28.1",
"X-Amzn-Trace-Id": "Root=1-633bba12-33dc358e435dfae534d00d8a"
},
"json": null,
"origin": "120.227.32.26",
"url": "http://www.httpbin.org/post"
}
②Cookie设置(与urllib中设置Cookie做对比)
输入:
import requests
res = requests.get('https://www.gitee.com')
print(res.cookies)
for key, value in res.cookies.items():
print(key + ':' + value)
输出:
<RequestsCookieJar[<Cookie gitee-session-n=dllHb0lPWTR2eTkwS3gvM2dNdy9GcmZ5Q2tDQzFIb1Y1TDh3VnR6TDVHL1A5MVZJclhRLzN3UEN1Q3U2Yko4L2hIekE5b1lGSE1ETmRZL0FxYTRKNktoYnB2M25qVXJYY1RsdkFXT2NSajZnUnZGR25UbzZ0eGY2S0Y0ak5tMTFsQnc1dVRoZlRYZlBzc
HI1ZlZua3djWWNrTDZUaHRVYzBvNGVCTTJmaUZzNzhjUlBheEpYTXc3VTArc25HcTNJLS1qc2V1TXZSaDkwNVlIUzVUZ1hGQnFRPT0%3D--599965dcd2feafe5aa01b712881b026ab2eaf05a for .gitee.com/>, <Cookie user_locale=zh-CN for .gitee.com/>, <Cookie oschina_new_us
er=false for gitee.com/>]>
gitee-session-n:dllHb0lPWTR2eTkwS3gvM2dNdy9GcmZ5Q2tDQzFIb1Y1TDh3VnR6TDVHL1A5MVZJclhRLzN3UEN1Q3U2Yko4L2hIekE5b1lGSE1ETmRZL0FxYTRKNktoYnB2M25qVXJYY1RsdkFXT2NSajZnUnZGR25UbzZ0eGY2S0Y0ak5tMTFsQnc1dVRoZlRYZlBzcHI1ZlZua3djWWNrTDZUaHRVYzBv
NGVCTTJmaUZzNzhjUlBheEpYTXc3VTArc25HcTNJLS1qc2V1TXZSaDkwNVlIUzVUZ1hGQnFRPT0%3D--599965dcd2feafe5aa01b712881b026ab2eaf05a
user_locale:zh-CN
oschina_new_user:false
③Session维持
直接利用request中的get、post方法做到的模拟网页请求,但不同的请求处于不同的session中(或者说用两个浏览器打开两个请求)。假如第一个请求通过post执行登录,第二个请求通过get方法获取登录后的个人信息,如果第二次请求是打开一个新的浏览器选项卡而不是新的浏览器,且不想每个请求中都加入cookie(会比较繁琐),就可以用到Session对象。
import requests
s = requests.Session()
s.get('https://www.httpbin.org/cookies/set/number/123456')
r = s.get('https://www.httpbin/org/cookies')
print('r.text')
④SSL证书验证
某些网站没有设置https证书或者不能被CA机构认证,这时会出现SSL证书错误的提示,如下图:
直接爬取会报SSL证书无效,如下:
在请求中加入verify参数,默认是True,会自动验证。
import requests
res = requests.get("https://ssr2.scrape/center/", verify=False)
print(res.status_code)
⑤超时设置
和urllib一样,在请求参数中加入timeout,timeout=1意味着请求超过1s,就会抛出异常。
实际上,请求分为两个阶段:连接(connect)和读取(read),如time(5,30)【timeout=1就是连接和读取的总和】。·
如果想永久等待,可以直接将timeout设为None,或者不加参数timeout。
⑥身份认证
访问页面如果需要登录,在请求参数中加入auth参数即可。
import requests
res = resquests.get('https://ssr3.scrape.center/',auth=('admin','admin'))
print(res.status_code)
⑦代理设置
后面会整章学习。