【爬虫】（一）fossies.org

文章目录

前言
效果
观察
分析
运行
后记

前言

因为毕设是基于机器学习的，所以需要大量的样本来训练模型和检验成果，因此，通过爬虫，在合法合规的情况下，爬取自己所需要的资源，在此进行记录；

本次爬取的网站是 https://fossies.org/windows/misc/

总的代码都会在 运行 中贴出…

再次申明：本博文仅供学习使用，请勿他用！！！

效果

在这里插入图片描述

观察

进入网站，看到的是比较简洁的首页：

随便点进去一个，进一步观察一下情况，发现点击之后可以直接下载，~~是相对比较简单的网站了~~；

那接下来就是开始动手写脚本了；

分析

1、先请求一下，看看是否能行；

import requests

url = 'https://fossies.org/windows/misc/'
print(requests.get(url).status_code)

# 200
1
2
3
4
5
6

2、请求成功之后我们接着下一步，在浏览器中 F12 查看网页元素，观察规律；

在这里插入图片描述

3、这里可以通过 DOM 进行查找，也可以通过 XPATH，~~也可以正则~~，全凭个人喜好，先筛选出每个，

import requests
from lxml import etree

url = 'https://fossies.org/windows/misc/'
html = etree.HTML(requests.get(url).text)
trs = html.xpath('//*[@id="archlist"]/table/tr')
1
2
3
4
5
6

[<Element tr at 0x7fa3663fcf40>, <Element tr at 0x7fa3663fc480>, ..., <Element tr at 0x7fa365d73180>]
1

注意，这里有个很坑爹的地方，浏览器上看到的网页内容有部分是通过 JS 动态渲染过的，因此 requests 时是没有的，比如说 tbody；

4、接下来就是从单个里去获取到标签的属性了，因为 href 和文件的名字是一样的，因此只要获取一个就行；

for tr in trs:
    href = tr.xpath('td/a')[0].get('href')
1
2

WinSCP-5.21.3-Portable.zip
WinSCP-5.21.3-Setup.exe
WinSCP-5.21.3-Source.zip
neo4j-community-4.4.11-windows.zip
WiresharkPortable64_3.6.8.paf.exe
Wireshark-win64-3.6.8.exe

...

pdftk_free-2.02-win-setup.exe
FreeMat-4.2-Setup.exe
md5.zip
dia-setup-0.97.2-2-unsigned.exe
unz600dn.zip
unz600xn.exe
zip300xn.zip
zip300xn-x64.zip
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

5、可以看到，文件的地址就是前缀加上刚刚获取到的 href，因此直接拼接 url 进行下载；

down_urls = []

for tr in trs:
    href = tr.xpath('td/a')[0].get('href')
    down_urls.append(url+href)
1
2
3
4
5

运行

可以自行加上日志，进度条等，也可以使用协程，线程，进程来提升速度，以下是全部代码：

国外的网站，不用代理就挺慢的；

import os
import socket
import requests
import urllib.request
from lxml import etree

PROXY = '127.0.0.1:10809'

proxies = {
    'http': 'socks5://127.0.0.1:10808',
    'https': 'socks5://127.0.0.1:10808'
}

headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
}

proxy = urllib.request.ProxyHandler({'https': PROXY})
open_proxy = urllib.request.build_opener(proxy)
urllib.request.install_opener(open_proxy)
socket.setdefaulttimeout(20)

def get_down_urls():
    try:
        down_urls = []
        url = 'https://fossies.org/windows/misc/'
        html = etree.HTML(requests.get(url, proxies=proxies).text)
        trs = html.xpath('//*[@id="archlist"]/table/tr')
        for tr in trs:
            href = tr.xpath('td/a')[0].get('href')
            down_urls.append(url+href)
        return down_urls
    except:
        pass

def download(url, name):
    try:
        filepath = os.path.join(os.getcwd(), name)
        urllib.request.urlretrieve(url, filepath)

    except Exception as e:
        print(e)
        
if __name__ == '__main__':
    urls = get_down_urls()
    for url in urls:
        try:
            name = url.split('/')[-1]
            download(url, name)
        except Exception as e:
            print(e)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

后记

仅仅用来记录毕设期间所爬过的网站；

再次申明：本博文仅供学习使用，请勿他用！！！

相关阅读:
leetcode算法之前缀和
Netty源码剖析之内存池和对象池设计流程
postman archive / postman old versions / postman 历史版本下载
AutoEncoder和VAE
笔试强训48天——day19
nodejs DEBUG=*
Docker笔记-09 Docker Compose
UI 到底重不重要？
小网SIM卡QMI拨号无法获取IPv6地址问题的分析
Electron录制应用-打包静态文件问题【命令行ffmpeg导不出视频】

原文地址：https://blog.csdn.net/weixin_46263782/article/details/126779311