scrapy startproject 项目名字
成功创建项目:
scrapy genspider name www.xxx.com
#name为名字,www.xxx.com为网址
执行前修改settings配置:
scrapy crawl name
#name为爬虫文件名称
输出一下response
成功输出
但是你会发现有许多日志输出,使用可以使用以下方法:
scrapy crawl name --nolog
可以看到没有日志输出了:
也可以指定日志错误:
使用scrapy crawl name
就可以了,也不会输出大量日志,只会显示错误信息
scrapy可以使用xpath框架:
我想获取小说名字和作者名
做个测试:
import scrapy
class BiquSpider(scrapy.Spider):
name = 'biqu'
# 允许的域名
# allowed_domains = ['www.bbiquge.net']
start_urls = ['http://www.bbiquge.net/']
def parse(self, response):
list = response.xpath('//*[@id="mainleft"]/div[2]/ul')
text = list[0].xpath('.//text()')
print(text)
返回的是Selector对象。
使用改为print(text.extract())
extract()可以从Selector对象的data参数提取出来
修改一下代码:
import scrapy
class BiquSpider(scrapy.Spider):
name = 'biqu'
# 允许的域名
# allowed_domains = ['www.bbiquge.net']
start_urls = ['http://www.bbiquge.net/']
def parse(self, response):
list = response.xpath('//*[@id="mainleft"]/div[2]/ul')
aut = list.xpath('./li/span/text()')
book = list.xpath('./li/a/text()')
for i in range(0, len(aut)):
print(book[i].extract(), aut[i].extract())
成功获取:
scrapy crawl name -o ./存储文件名字.csv
存储一下小说名单
import scrapy
class BiquSpider(scrapy.Spider):
name = 'biqu'
# 允许的域名
# allowed_domains = ['www.bbiquge.net']
start_urls = ['http://www.bbiquge.net/']
def parse(self, response):
list = response.xpath('//*[@id="mainleft"]/div[2]/ul')
aut = list.xpath('./li/span/text()').extract()
book = list.xpath('./li/a/text()').extract()
# 存储解析到的数据
datas = []
dic = {
"书名": book,
"作者": aut
}
datas.append(dic)
return datas
成功存储
但是只能存储一下格式的:
想要存储其他格式的需要用到管道:
管道是存到item对象中
1.修改爬虫文件内容:
import scrapy
from ..items import ScrapyBiquItem
class BiquSpider(scrapy.Spider):
name = 'biqu'
# 允许的域名
# allowed_domains = ['www.bbiquge.net']
start_urls = ['http://www.bbiquge.net/']
def parse(self, response):
list = response.xpath('//*[@id="mainleft"]/div[2]/ul')
aut = list.xpath('./li/span/text()').extract()
book = list.xpath('./li/a/text()').extract()
for i in range(0,len(aut)):
item = ScrapyBiquItem()
item['aut'] = aut[i]
item['book'] = book[i]
yield item # 将item提交给管道
2.在items.py添加item类:
import scrapy
class ScrapyBiquItem(scrapy.Item):
# define the fields for your item here like:
aut = scrapy.Field()
book = scrapy.Field()
pass
3.在pipeliner.py添加方法:
class ScrapyBiquPipeline:
fp = None
#重写一个父类方法,该方法在开始爬虫会调用一次
def open_spider(self,spider):
print('start')
self.fp = open('./book_list.txt','w',encoding='utf-8')
#处理item对象
def process_item(self, item, spider):
book = item['book']
aut = item['aut']
self.fp.write(book+'--'+aut+'\n')
return item
def close_spider(self,spider):
print('end')
self.fp.close()
4.配置文件修改:
BOT_NAME = 'scrapy_biqu'
SPIDER_MODULES = ['scrapy_biqu.spiders']
NEWSPIDER_MODULE = 'scrapy_biqu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 指定显示日志类型
LOG_LEVEL = 'ERROR'
ITEM_PIPELINES = {
'scrapy_biqu.pipelines.ScrapyBiquPipeline': 300,#数字为优先级
}
5.运行代码:
成功保存:
可以进行实现伪装和ip代理
class ScrapyBiquDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
http = ['122.9.101.6:8888',
'101.200.127.149:3129',
'183.236.123.242:8060',
'58.20.184.187:9091',
'47.106.105.236:80',
'202.55.5.209:8090']
https = [
'139.196.155.96:8080',
'47.102.193.144:8890'
]
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
# 拦截请求
def process_request(self, request, spider):
request.headers['User-Agent'] = str(UserAgent().random)
if request.url.split(':')[0] == 'http':
request.meta['Proxy'] = 'http://' + random.choice(self.http)
else:
request.meta['Proxy'] = 'https://' + random.choice(self.https)
return None
配置文件加入:
DOWNLOADER_MIDDLEWARES = {
'scrapy_biqu.middlewares.ScrapyBiquDownloaderMiddleware': 543,
}