近期工作中要解决两个问题,一个是数据组需要网爬一些图片数据,另外一个是要批量爬取公司用于文档协同的一个网站上的附件。于是乎,就写了两个脚本去完成任务。
第一步:向确定的url发送请求,接收服务器的响应信息;如果是需要用户登录的网页,需要手动获取cookie信息放入header中,或者模拟登录自动获取cookie。
第二步:对接收到的信息进行解析,找到需要的标签内容(通常是我们需要的图片或文件的url);
第三步:向目标url发送请求,保存数据到本地。
python在网络爬虫方面提供了一些框架,Scrapy、Pyspider等,由于我们要实现的都是小功能,用一些现成的库即可。
(如果要实现模拟登录自动获取cookie,可参考爬虫实战学习笔记_2 网络请求urllib模块+设置请求头+Cookie+模拟登陆-CSDN博客)
- import urllib.request
-
- headers = {
- "Cookie": 'confluence.list.pages.cookie=list-content-tree;.......'
- }
-
- req = urllib.request.Request(url, headers=headers)
- response = urllib.request.urlopen(req)
- from bs4 import BeautifulSoup
-
- html = response.read().decode("utf8")
- soup = BeautifulSoup(html, "lxml")
- a_list = soup.find_all("a")
- for a in a_list:
- if "class" in a.attrs:
- if "filename" in a["class"]:
- filename = a.text.strip()
- download_url = a['href']
- print(download_url)
- import requests
-
- file = requests.get(download_url, headers=headers)
- save_path = './download/'
- if not os.path.exists(save_path):
- os.mkdir(save_path)
- save_file = open(os.path.join(save_path, filename), 'wb')
- save_file.write(file.content)
- save_file.close()
- print('save ok')
上述脚本可针对特定网页进行附件爬取,但多个网页如何先获取到所有网页地址是个棘手的问题。目前只能通过搜寻url规律,发现里面的pageId是9位数字字符,大概确定了范围,进行暴力遍历。
网上关于百度、google爬取关键字图片的开源代码很多,我也是找了一个开源代码进行稍微修改,目前满足实际需要。这里附上代码,供参考。
- # -*- coding: UTF-8 -*-"""
- import requests
- import tqdm
- import os
- import json
-
- def configs(search, page, number):
- url = 'https://image.baidu.com/search/acjson'
- params = {
- "tn": "resultjson_com",
- "logid": "11555092689241190059",
- "ipn": "rj",
- "ct": "201326592",
- "is": "",
- "fp": "result",
- "queryWord": search,
- "cl": "2",
- "lm": "-1",
- "ie": "utf-8",
- "oe": "utf-8",
- "adpicid": "",
- "st": "-1",
- "z": "",
- "ic": "0",
- "hd": "",
- "latest": "",
- "copyright": "",
- "word": search,
- "s": "",
- "se": "",
- "tab": "",
- "width": "",
- "height": "",
- "face": "0",
- "istype": "2",
- "qc": "",
- "nc": "1",
- "fr": "",
- "expermode": "",
- "force": "",
- "pn": str(60 * page),
- "rn": number,
- "gsm": "1e",
- "1617626956685": ""
- }
- return url, params
-
-
- def loadpic(number, page, path):
- while (True):
- if number == 0:
- break
- url, params = configs(search, page, number)
- try:
- response = requests.get(url, headers=header, params=params).content.decode('utf-8')
- result = json.loads(response)
- url_list = []
- for data in result['data'][:-1]:
- url_list.append(data['thumbURL'])
- for i in range(len(url_list)):
- getImg(url_list[i], 60 * page + i, path)
- bar.update(1)
- number -= 1
- if number == 0:
- break
- page += 1
- except Exception as e:
- print(e)
- continue
- print("\nfinish!")
-
-
- def getImg(url, idx, result_path):
- img = requests.get(url, headers=header)
- file = open(result_path + str(idx + 1) + '.jpg', 'wb')
- file.write(img.content)
- file.close()
-
-
- if __name__ == '__main__':
- search = "溜冰" # 爬取的关键词
- number = 100 #爬取的目标数量
- result_path = os.path.join(os.getcwd(), search)
- if not os.path.exists(result_path):
- os.mkdir(result_path)
- header = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
-
- bar = tqdm.tqdm(total=number)
- page = 0
- loadpic(number, page, result_path)
-
-