Python爬虫抓取网站模板的完整版实现

业余爱好喜欢倒弄下个人网站。对之前的个人博客网站模板不太满意，网上看到别人的网站真漂亮啊，于是想着搞下来借鉴下，仅用于个人用途。若是单个存取太麻烦，用工具的话还得找，于是想到干脆使用python实现下，python用于爬虫可真厉害。

之前的博客搭建在了华为云，地址在这里：个人博客

下面分享下抓去网站模板的完整版实现，亲测可用。（注：仅限个人爱好者研究使用，不要用于其他非法用途。）

环境准备

由于个人使用的是64位版本的python3环境，安装下用到的第三方库。

BeautifulSoup库，简称bs4,常用的爬虫库，可以在HTML或XML文件中提取数据的网页信息提取，方便得到dom的标签和属性值。

lxml库，python的HTML/XML的解析器，速度很快，其主要功能是解析和提取XML和HTML中的数据。

urllib库，这个库一般不用下python内置的urllib库。这个是模拟发起网络请求，一般建议使用requests，它是对urllib的再次封装。需要注意的是python2和python3上的异同。python2上没有urllib.request。python2中的如urllib.urllib2.urlopen需改为 urllib.request.urlopen()。

库的安装

由于默认仓库网站被墙的原因，需要改下镜像才能成功下载。对于python3推荐使用pip或pip3的install。因为pip2是留给python2用的，如果电脑上同时有python2和python3安装的话。

临时改变镜像：


$pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package # 清华源
$pip3 install -i  http://pypi.douban.com/simple some-package #豆瓣镜像

用国内源码对pip进行升级：

$pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U

如果觉得每次这样临时改变镜像设置不太方便，可以对配置进行修改。

linux下的指定位置为：


$HOME/.config/pip/pip.conf
#或者
$HOME/.pip/pip.conf

windows下的指定位置为：


%APPDATA%\pip\pip.ini
#或者
%HOME%\pip\pip.ini

实现原理

首先要进行网页分析，实现原理还是比较简单的，就跟用网站访问类似，你能访问到网页就能通过查看网页源代码找到里面的相关链接，js脚本和css文件等。模板无非就是需要把相关的css，js文件和网页文件下载下来。所以原理就是爬取网页找到上面的script，link标签，a herf标签，把相关的网址链接提取和保存起来存为文件，然后去重并调用urlretrieve()方法直接将远程数据下载到本地。比如你要下载某个网页或文件，只需调用urlretrieve()，指定好参数即可。

urlretrieve(url, filename=None, reporthook=None, data=None)

如将百度首页的网页保存下来，只需：


#!/usr/bin/env python  
# coding=utf-8  
import os  
from urllib.request import urlretrieve
  
def cbk(a,b,c):  
    '''''回调函数 
    @a:已经下载的数据块 
    @b:数据块的大小 
    @c:远程文件的大小 
    '''  
    per=100.0*a*b/c  
    if per>100:  
        per=100  
    print('%.2f%%' % per)
  
url='http://www.baidu.com'  
dir=os.path.abspath('.')  
work_path=os.path.join(dir,'baidu.html')  
urlretrieve(url,work_path,cbk)

完整源码


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# by yangyongzhen
# 2016-12-06
 
from bs4 import BeautifulSoup
import urllib, urllib.request, os, time
import re
import lxml
 
rootpath = os.getcwd() + u'/抓取的模板/'
 
 
def makedir(path):
    if not os.path.isdir(path):
        os.makedirs(path)
 
 
#创建抓取的根目录
makedir(rootpath)
 
 
#显示下载进度
def Schedule(a, b, c):
    '''''
    a:已经下载的数据块
    b:数据块的大小
    c:远程文件的大小
   '''
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print('%.2f%%' % per)
 
 
def grabHref(url, listhref, localfile):
    html = urllib.request.urlopen(url).read()
    html = str(html, 'gb2312', 'ignore').encode('utf-8', 'ignore')
 
    content = BeautifulSoup(html, features="lxml").findAll('link')
    myfile = open(localfile, 'w')
    pat = re.compile(r'href="([^"]*)"')
    pat2 = re.compile(r'http')
    for item in content:
        h = pat.search(str(item))
        href = h.group(1)
        if pat2.search(href):
            ans = href
        else:
            ans = url + href
        if not ans.__contains__(url):
            continue
        if ans.endswith('/'):
            ans += 'index.html'
        listhref.append(ans)
        myfile.write(ans)
        myfile.write('\r\n')
        print(ans)
 
    content = BeautifulSoup(html, features="lxml").findAll('script')
    pat = re.compile(r'src="([^"]*)"')
    pat2 = re.compile(r'http')
    for item in content:
        h = pat.search(str(item))
        if h:
            href = h.group(1)
        if pat2.search(href):
            ans = href
        else:
            ans = url + href
        listhref.append(ans)
        myfile.write(ans)
        myfile.write('\r\n')
        print(ans)
 
    content = BeautifulSoup(html, features="lxml").findAll('a')
    pat = re.compile(r'href="([^"]*)"')
    pat2 = re.compile(r'http')
    for item in content:
        h = pat.search(str(item))
        if h:
            href = h.group(1)
        if pat2.search(href):
            ans = href
        else:
            ans = url + href
        if not ans.__contains__(url):
            continue 
        if ans.endswith('/'):
            ans += 'index.html' 
        listhref.append(ans)
        myfile.write(ans)
        myfile.write('\r\n')
        print(ans)
 
    myfile.close()
 
 
def _progress(block_num, block_size, total_size):
    '''回调函数
       @block_num: 已经下载的数据块
       @block_size: 数据块的大小
       @total_size: 远程文件的大小
    '''
    sys.stdout.write(
        '\r>> Downloading %s %.1f%%' %
        (filename, float(block_num * block_size) / float(total_size) * 100.0))
    sys.stdout.flush()
 
 
def main():
    url = "http://http://www.helongx.com/"  #采集网页的地址
    listhref = []  #链接地址
    localfile = 'ahref.txt'  #保存链接地址为本地文件，文件名
    grabHref(url, listhref, localfile)
    listhref = list(set(listhref))  #去除链接中的重复地址
 
    curpath = rootpath
    start = time.perf_counter()
    for item in listhref:
        curpath = rootpath
        name = item.split('/')[-1]
        fdir = item.split('/')[3:-1]
        for i in fdir:
            curpath += i
            curpath += '/'
        print(curpath)
        makedir(curpath)
        local = curpath + name
        print('name:' + name)
        if len(name) == 0:
            continue
        if name.__contains__('www'):
            continue
        if name.__contains__('?'):
            continue
        print(local)
        try:
            urllib.request.urlretrieve(item, local, Schedule)  # 远程保存函数
        except Exception as e:
            print(e)
 
    end = time.perf_counter()
    print(u'模板抓取完成！')
    print(u'一共用时：', end - start, u'秒')
 
 
if __name__ == "__main__":
 
    main()

注意事项

针对不同的网站，需要分析下网页源码找到链接的规律。比如有的网站首页就是 www.xxx.xxx,不带index.html后缀或者后缀是别的其他的如index.aspx或index.php之类的。可以修改脚本源码，加些特殊的处理。比如自动补上首页名称和只抓取本网站的内容：


for item in content:
        h = pat.search(str(item))
        href = h.group(1)
        if pat2.search(href):
            ans = href
        else:
            ans = url + href
        #非本站的链接不抓取
        if not ans.__contains__(url):
            continue
        #补上首页后缀名
        if ans.endswith('/'):
            ans += 'index.html'
        listhref.append(ans)
        myfile.write(ans)
        myfile.write('\r\n')
        print(ans)
 
    content = BeautifulSoup(html, features="lxml").findAll('script')

引用

python爬虫之bs4模块（超详细）_- 打小就隔路à的博客-CSDN博客_bs4模块

bs4介绍_- 白鹿 -的博客-CSDN博客_bs4

Python-- lxml用法_ydw_ydw的博客-CSDN博客_lxml python

python中pip和pip3的区别、使用以及加速方法_非晚非晚的博客-CSDN博客_python3使用pip还是pip3

Python爬虫实战案例：一键爬取，多种网页模板任你选！_Code皮皮虾的博客-CSDN博客

python3的urlretrieve（）方法的作用与使用（入门）_逸少凌仙的博客-CSDN博客_python urlretrieve 小白如何入门 Python 爬虫？ - 知乎

Python爬虫教程（从入门到精通）

Python-xpath与bs4_「已注销」的博客-CSDN博客

Python网络爬虫 - 飞桨AI Studio

python 爬虫 2 （网页解析bs4、lxml、xpath、正则）_BeanInJ的博客-CSDN博客

python爬虫训练11：正则表达式，bs4，xpath抓取网站数据对比_<编程路上>的博客-CSDN博客

https://blog.csdn.net/weixin_43788986/category_11895372.html

解析网页哪家强-Xpath和正则表达式（re）及BeautifulSoup的比较（文中含有三者的基本语法介绍）_莫莫先生的博客-CSDN博客_xpath和正则表达式

Beautiful Soup 4.4.0 文档 — beautifulsoup 4.4.0q 文档

爬虫学习笔记（五）——网页解析工具(bs4、xpath)_别呀的博客-CSDN博客_网页解析工具

爬虫系列（一）：解析网页的常见方式汇总——re、bs4、xpath——以链家租房信息爬取为例_limSedrick=quant的博客-CSDN博客

相关阅读:
MAXScript实现简单的碰撞检测教程
 爬虫与反爬：一场无休止之战
 Django admin后台添加自定义菜单和功能页面
 《树莓派4B家庭服务器搭建指南》第二十一期：安装开源远程桌面服务rustdesk, 内网丝滑,外网流畅控制
 机器学习-10-基于paddle实现神经网络
 iOS自动化打包 Jenkins+Gitlab+Fastlane+蒲公英+钉钉
 基于SpringBoot的水果销售网站
 mac配置hdc
Arch/ Manjaro 个人常用命令行
 Oracle缓存表
原文地址：https://blog.csdn.net/qq8864/article/details/127107263