某次需要将大的压缩包分割传输,并恢复。找到了一段有用的python程序。
这个软件包可以压缩和分割大文件。它从一个根目录开始,遍历子目录,并扫描其中的每个文件。如果某个文件的大小超过了阈值大小,那么它们会被压缩和分割成多个归档文件,每个归档文件的最大大小为分区大小。压缩/分割适用于任何文件扩展名。
举例:
对于目录
- $ tree --du -h ~/MyFolder
-
- └── [415M] My Datasets
- │ ├── [6.3K] Readme.txt
- │ └── [415M] Data on Leaf-Tailed Gecko
- │ ├── [ 35M] DatasetA.zip
- │ ├── [ 90M] DatasetB.zip
- │ ├── [130M] DatasetC.zip
- │ └── [160M] Books
- │ ├── [ 15M] RegularBook.pdf
- │ └── [145M] BookWithPictures.pdf
- └── [818M] Video Conference Meetings
- ├── [817M] Discussion_on_Fermi_Paradox.mp4
- └── [1.1M] Notes_on_Discussion.pdf
使用
$ python3 src/main.py --root_dir ~/MyFolder
目录变成
- $ tree --du -h ~/MyFolder
-
- └── [371M] My Datasets
- │ ├── [6.3K] Readme.txt
- │ └── [371M] Data on Leaf-Tailed Gecko
- │ ├── [ 35M] DatasetA.zip
- │ ├── [ 90M] DatasetB.zip
- │ ├── [ 95M] DatasetC.zip.7z.001
- │ ├── [ 18M] DatasetC.zip.7z.002
- │ └── [133M] Books
- │ ├── [ 15M] RegularBook.pdf
- │ ├── [ 95M] BookWithPictures.pdf.7z.001
- │ └── [ 23M] BookWithPictures.pdf.7z.002
- └── [794M] Video Conference Meetings
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.001
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.002
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.003
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.004
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.005
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.006
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.007
- ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.008
- ├── [ 33M] Discussion_on_Fermi_Paradox.mp4.7z.009
- └── [1.1M] Notes_on_Discussion.pdf
使用
$ python3 src/reverse.py --root_dir ~/MyFolder
则恢复到原始文件。
本地已经安装 Python 3.x.x.
虽然在src/main.py中遍历目录是串行的,但是通过7z压缩/分割每个文件在默认情况下是并行的。
使用src/reverse.py进行反转完全是串行的。
用于分割大文件的代码main.py如下:
- import sys # 导入sys模块,用于退出程序
- import os # 导入os模块,用于文件和目录操作
- import shutil # 导入shutil模块,用于文件操作
- import subprocess # 导入subprocess模块,用于执行shell命令
- import argparse # 导入argparse模块,用于解析命令行参数
-
-
- def parse_arguments():
- # 解析命令行参数
- parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles')
-
- parser.add_argument('--root_dir', type=str, default=os.getcwd(),
- help="Root directory to start traversing. Defaults to current working directory.")
- parser.add_argument('--delete_original', type=bool, default=True,
- help="Do you want to delete the original (large) file after compressing to archives?")
- parser.add_argument('--partition_ext', type=str, default="7z", choices=["7z", "xz", "bzip2", "gzip", "tar", "zip", "wim"],
- help="Extension of the partitions. Recommended: 7z due to compression ratio and inter-OS compatibility.")
- parser.add_argument('--cmds_into_7z', type=str, default="a",
- help="Commands to pass in to 7z.")
- parser.add_argument('--threshold_size', type=int, default=100,
- help="Max threshold of the original file size to split into archive. I.e. files with sizes below this arg are ignored.")
- parser.add_argument('--threshold_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'],
- help="Unit of the threshold size specified (bytes, kilobytes, megabytes, gigabytes).")
- parser.add_argument('--partition_size', type=int, default=95,
- help="Max size of an individual archive. May result in actual partition size to be higher than this value due to disk formatting. In that case, reduce this arg value.")
- parser.add_argument('--partition_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'],
- help="Unit of the partition size specified (bytes, kilobytes, megabytes, gigabytes).")
-
- args = parser.parse_args()
- return args
-
-
- def check_7z_install():
- # 检查是否安装了7z,如果没有安装则退出程序
- if shutil.which("7z"):
- return True
- else:
- sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.")
-
-
- def is_over_threshold(f_full_dir, args):
- # 判断文件是否超过阈值大小
- size_dict = {
- "b": 1e-0,
- "k": 1e-3,
- "m": 1e-6,
- "g": 1e-9
- }
- return os.stat(f_full_dir).st_size * size_dict[args.threshold_size_unit] >= args.threshold_size
-
-
- def traverse_root_dir(args):
- # 遍历指定目录下的文件,并进行压缩
- for root, _, files in os.walk(args.root_dir):
- for f in files:
- f_full_dir = os.path.join(root, f)
-
- if is_over_threshold(f_full_dir, args):
- f_full_dir_noext, ext = os.path.splitext(f_full_dir)
- # 使用7z命令进行压缩
- prc = subprocess.run(["7z", "-v" + str(args.partition_size) + args.partition_size_unit, args.cmds_into_7z,
- f_full_dir_noext + "." + ext[1:] + "." + args.partition_ext, f_full_dir])
-
- if args.delete_original and prc.returncode == 0:
- os.remove(f_full_dir)
-
-
- if __name__ == '__main__':
- check_7z_install() # 检查是否安装了7z
- traverse_root_dir(parse_arguments()) # 压缩文件
-
这段代码会从root_dir开始遍历所有子目录,并将所有超过100MB的文件压缩为最大大小约为95MB的较小存档文件。默认选项是在压缩后删除原始(大)文件,但可以关闭此选项。
执行记录
- D:\tmp\git_di>python main.py --root_dir "D:\tmp\git_di"
-
- 7-Zip 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
-
- Scanning the drive:
- 1 file, 3329165073 bytes (3175 MiB)
-
- Creating archive: D:\tmp\git_di\testfile.zip.7z
-
- Add new data to archive: 1 file, 3329165073 bytes (3175 MiB)
-
-
- Files read from disk: 1
- Archive size: 3304152719 bytes (3152 MiB)
- Volumes: 34
- Everything is Ok
可以当前目录下生成了多个压缩包分块(testfile.zip.7z.001, testfile.zip.7z.002 ......)
用于恢复大文件的代码reverse.py 如下:
- import sys # 导入sys模块,用于退出程序
- import os # 导入os模块,用于文件和目录操作
- import shutil # 导入shutil模块,用于文件操作
- import subprocess # 导入subprocess模块,用于执行shell命令
- import argparse # 导入argparse模块,用于解析命令行参数
-
-
- def parse_arguments():
- # 解析命令行参数
- parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles_reverse')
-
- parser.add_argument('--root_dir', type=str, default=os.getcwd(),
- help="Root directory to start traversing. Defaults to current working directory.")
- parser.add_argument('--delete_partitions', type=bool, default=True,
- help="Do you want to delete the partition archives after extracting the original files?")
-
- args = parser.parse_args()
- return args
-
-
- def check_7z_install():
- # 检查是否安装了7z,如果没有安装则退出程序
- if shutil.which("7z"):
- return True
- else:
- sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.")
-
-
- def is_partition(f_full_dir):
- # 判断文件是否是分卷文件
- return any(f_full_dir.endswith(ext) for ext in
- [".7z.001", ".xz.001", ".bzip2.001", ".gzip.001", ".tar.001", ".zip.001", ".wim.001"])
-
-
- def reverse_root_dir(args):
- # 遍历指定目录下的文件,并进行解压
- for root, _, files in os.walk(args.root_dir):
- for f in files:
- f_full_dir = os.path.join(root, f)
- if is_partition(f_full_dir):
- # 使用7z解压文件
- prc = subprocess.run(["7z", "e", f_full_dir, "-o" + root])
- if args.delete_partitions and prc.returncode == 0:
- f_noext, _ = os.path.splitext(f)
- os.chdir(root)
- os.system("rm" + " \"" + f_noext + "\"*")
-
-
- if __name__ == '__main__':
- check_7z_install() # 检查是否安装了7z
- reverse_root_dir(parse_arguments()) # 解压分卷文件
-
测试
将压缩包分块(testfile.zip.7z.001, testfile.zip.7z.002 ......)放置与目录 D:\tmp\git_di 下,reverse.py 也放在同级目录下。
执行记录
- D:\tmp\git_di>python reverse.py --root_dir "D:\tmp\git_di"
-
- 7-Zip 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
-
- Scanning the drive for archives:
- 1 file, 99614720 bytes (95 MiB)
-
- Extracting archive: D:\tmp\git_di\testfile.zip.7z.001
- --
- Path = D:\tmp\git_di\testfile.zip.7z.001
- Type = Split
- Physical Size = 99614720
- Volumes = 34
- Total Physical Size = 3304152719
- ----
- Path = testfile.zip.7z
- Size = 3304152719
- --
- Path = testfile.zip.7z
- Type = 7z
- Physical Size = 3304152719
- Headers Size = 162
- Method = LZMA2:24
- Solid = -
- Blocks = 1
-
- Everything is Ok
-
- Size: 3329165073
- Compressed: 3304152719
- 'rm' 不是内部或外部命令,也不是可运行的程序
- 或批处理文件。
可以看到新生成了文件 testfile.zip。
参考github链接
https://github.com/sisl/GitHub-ForceLargeFiles
over.