【实用工具系列之爬虫】python爬取资讯数据

系列

1.【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）
2.【实用工具系列之爬虫】python爬取资讯数据

前言

在大数据架构中，数据收集与数据存储占据了极为重要的地位，可以说是大数据的核心基础。而爬虫技术在这两大核心技术层次中占有了很大的比例。

本文实现一种简单快速的爬虫方法，其中用了代理ip，代理ip的获取可以参考我的这篇文章【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）。

szZack的文章

代理IP

代理IP网站：xicidaili

具体方法详见【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）。
输出的代理ip数据保存到 ‘proxy_ip.pkl’

爬取数据代码

本文以爬取小量财经数据为例子。

网站
- 地址：http://xxx
- 爬取内容
  url，title，click_number，html_content
- 保存数据为csv，格式如下：
  url，title，click_number，html_content，crawl_time
  szZack的文章
实战
- 步骤：
  1、爬取首页，提取url作为第一层
  2、爬取第一层的url，作为第二层
  3、爬取第二层的url，作为第三层
  4、结束
环境
- pandas
- python3
- Ubuntu16.04
- requests

代码
crawl_finance_news.py

1、导入依赖包

import crawl_proxy_ip
import pandas as pd
import re, time, sys, os, random
import telnetlib
import requests

1
2
3
4
5
6

2.全局变量

global url_set
url_set = {}
1
2

3.爬取核心代码

def crawl_finance_news(start_url):
    
    #提取数据格式：url，title，click_number，html_content，crawl_time
    
    proxy_ip_list = crawl_proxy_ip.load_proxy_ip('proxy_ip.pkl')
    
    #爬取首页
    start_html = crawl_web_data(start_url, proxy_ip_list)
    #open('tmp.txt', 'w').write(start_html)
    global url_set
    url_set[start_url] = 0
    
    #提取第一层web
    web_content_list = extract_web_content(start_html, proxy_ip_list)
    
    #提取第二层web
    length = len(web_content_list)
    for i in range(length):
        if len(web_content_list[i][2]) == 0:
            html = crawl_web_data(web_content_list[i][0], proxy_ip_list)
            web_content_list += extract_web_content(html, proxy_ip_list)
            if len(web_content_list) > 1000:#仅仅是测试
                break
    
    #提取第3层web
    length = len(web_content_list)
    for i in range(length):
        if len(web_content_list[i][2]) == 0:
            html = crawl_web_data(web_content_list[i][0], proxy_ip_list)
            web_content_list += extract_web_content(html, proxy_ip_list)
            if len(web_content_list) > 1000:#仅仅是测试
                break
            
    #保存数据
    columns = ['url', 'title', 'click_number', 'html_content', 'crawl_time']
    df = pd.DataFrame(columns = columns, data = web_content_list)
    df.to_csv('finance_data.csv', encoding='utf-8')
    print('data_len:', len(web_content_list))
    
    
def crawl_web_data(url, proxy_ip_list):

    proxy_ip_dict = random.choice(proxy_ip_list)
    if len(proxy_ip_list) == 0:
        return ''
    proxy_ip_dict = proxy_ip_list[0]
    
    try:
        html = download_by_proxy(url, proxy_ip_dict)
        print(url, 'ok')
            
    except Exception as e:
        #print('50 e', e)
        #删除无效的ip
        index = proxy_ip_list.index(proxy_ip_dict)
        proxy_ip_list.pop(index)
        print('proxy_ip_list', len(proxy_ip_list))
       	
        return crawl_web_data(url, proxy_ip_list)
        
    return html
    

def download_by_proxy(url, proxies):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.103 Safari/537.36', 'Connection':'close'}
    response = requests.get(url=url, proxies=proxies, headers=headers, timeout=10)
    response.encoding = 'utf-8'
    html = response.text
    return html
    

def extract_web_content(html, proxy_ip_list):

    #提取数据格式：url，title，click_number，html_content， crawl_time
    
    web_content_list = []
    
    html_content = html
    html = html.replace(' target ="_blank"', '')
    html = html.replace(' ', '')
    html = html.replace('\r', '')
    html = html.replace('\n', '')
    html = html.replace('\t', '')
    html = html.replace('"target="_blank', '')
    
    #证监会：拟对证券违法行为提高刑期上限
    res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)#finance 必须是金融资讯
    while res is not None:
        url, title = res.groups()
        #print('url, title', url, title)
        global url_set
        if url in url_set:#防止重复
            html = html.replace('href="%s">%s<' %(url, title), '')
            res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)
            continue
            
        else:
            url_set[url] = 0
        click_number = get_click_number(url, proxy_ip_list)
        #print('click_number', click_number)
        now_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        if len(click_number) == 0:#仅保留正文
            html_content = ''
        web_content_list.append([url, title, click_number, html_content, now_time])
        if len(web_content_list) > 200:#test 每页最多爬取200条
            break
        
        html = html.replace('href="%s">%s<' %(url, title), '')
        res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)
        
    return web_content_list
    [szZack的文章](https://blog.csdn.net/zengNLP?type=blog)
    
def get_click_number(url, proxy_ip_list):
    
    html = crawl_web_data(url, proxy_ip_list)
    #4297
    res = re.search('(\d{1,})', html)
    if res is not None:
        return res.groups()[0]
        
    return ''
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123

4.测试

if __name__ == '__main__':
    
    #xx网：xxx/
    #用法：python crawl_finance_news.py 'xxx/'
    if len(sys.argv) == 2:
        crawl_finance_news(sys.argv[1])
        
1
2
3
4
5
6
7

5.代码说明
1、先爬取代理ip：python crawl_proxy_ip.py
2、爬取财经新闻：python crawl_finance_news.py ‘xxx/’
3、这里仅仅是测试，爬取1000条就结束
4、数据保存到：finance_data.csv
szZack的文章
6.爬取内容示意

szZack的文章

相关阅读:
labview下位机软件编程笔记
 机器学习从入门到放弃：Transfomer-现代大模型的基石
 搭建 Makefile+OpenOCD+CMSIS-DAP+Vscode arm-none-eabi-gcc 工程模板
 C# - Enum各种转换
 逍遥自在学C语言 | 赋值运算符
 CSS常用函数补充(var、clac、blur、gradient)
成功编译并运行flutter安卓的gradle文件范例
 【计算机毕业设计】Springboot 社区助老志愿服务系统-96682，免费送源码，【开题选题+程序定制+论文书写+答辩ppt书写-原创定制程序】
分布式文件存储 - - - MinIO从入门到飞翔
 Virtual安装centos后，xshell连接centos 测试及遇到的坑
原文地址：https://blog.csdn.net/zengNLP/article/details/126647866