第六章解析glob.glob与os.walk(工具)

glob.glob

glob模块介绍

glob是python的标准库模块，只要安装python就可以使用该模块。glob模块主要用来查找目录和文件，可以使用*、？、[]这三种通配符对路径中的文件进行匹配。

*：代表0个或多个字符
?：代表一个字符
[]：匹配指定范围内的字符，如[0-9]匹配数字

Unix样式路径名模式扩展

glob模块英文文档：https://docs.python.org/3/library/glob.html

glob模块的具体使用

查看glob模块有哪些方法属性

>>> dir(glob)
['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', 
'__name__', '__package__', '__spec__', '_glob0', '_glob1', '_glob2', '_iglob', 
'_ishidden', '_isrecursive', '_iterdir', '_rlistdir', 'escape', 'fnmatch', 
'glob', 'glob0', 'glob1', 'has_magic', 'iglob', 'magic_check', 
'magic_check_bytes', 'os', 're']
>>>
1
2
3
4
5
6
7

glob模块常用的两个方法有：glob.glob() 和 glob.iglob，下面详细介绍

glob.glob(pathname, *, recursive=False)函数的使用

函数glob.glob()定义：

def glob(pathname, *, recursive=False):
    """Return a list of paths matching a pathname pattern.
    The pattern may contain simple shell-style wildcards a la
    fnmatch. However, unlike fnmatch, filenames starting with a
    dot are special cases that are not matched by '*' and '?'
    patterns.
    If recursive is true, the pattern '**' will match any files and
    zero or more directories and subdirectories.
    """
    return list(iglob(pathname, recursive=recursive))
1
2
3
4
5
6
7
8
9
10

glob.glob()函数的参数和返回值

```
def glob(pathname, *, recursive=False):
1
```
- pathname：该参数是要匹配的路径
- recursive：如果是true就会递归的去匹配符合的文件路径，默认是False
返回匹配到的路径列表

glob.glob()函数使用实例

先给出测试使用的目录结构：

test_dir/
├── a1.txt
├── a2.txt
├── a3.py
├── sub_dir1
│   ├── b1.txt
│   ├── b2.py
│   └── b3.py
└── sub_dir2
    ├── c1.txt
    ├── c2.py
    └── c3.txt
1
2
3
4
5
6
7
8
9
10
11
12

1、返回目录的路径列表

>>> path_list1 = glob.glob('./test_dir/')
>>> path_list
['./test_dir/']
1
2
3

2、匹配'./test_dir/*路径下的所有目录和文件，并返回路径列表

>>> path_list2 = glob.glob('./test_dir/*')
>>> path_list2
['./test_dir/a3.py', './test_dir/a2.txt', './test_dir/sub_dir1', './test_dir/sub_dir2', './test_dir/a1.txt']
1
2
3

3、匹配./test_dir/路径下含有的所有.py文件（不递归）

>>> path_list3 = glob.glob('./test_dir/*.py')
>>> path_list3
['./test_dir/a3.py']
>>> path_list4 = glob.glob('./test_dir/*/*.py')
>>> path_list4
['./test_dir/sub_dir1/b2.py', './test_dir/sub_dir1/b3.py', './test_dir/sub_dir2/c2.py']
1
2
3
4
5
6

4、递归的匹配./test_dir/**路径下的所有目录和文件，并返回路径列表

>>> path_list5 = glob.glob('./test_dir/**', recursive=True)
>>> path_list5
['./test_dir/', './test_dir/a3.py', './test_dir/a2.txt', './test_dir/sub_dir1', './test_dir/sub_dir1/b2.py', './test_dir/sub_dir1/b3.py', './test_dir/sub_dir1/b1.txt', './test_dir/sub_dir2', './test_dir/sub_dir2/c3.txt', './test_dir/sub_dir2/c1.txt', './test_dir/sub_dir2/c2.py', './test_dir/a1.txt']
>>> path_list6 = glob.glob('./test_dir/**/*.py', recursive=True)
>>> path_list6
['./test_dir/a3.py', './test_dir/sub_dir1/b2.py', './test_dir/sub_dir1/b3.py', './test_dir/sub_dir2/c2.py']
1
2
3
4
5
6

注意：

如果要对某个路径下进行递归，一定要在后面加两个*
1

>>> path_list = glob.glob('./test_dir/', recursive=True)
>>> path_list
['./test_dir/']
1
2
3

其他通配符`*、?、[]`实例

>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']
>>> glob.glob('**/*.txt', recursive=True)
['2.txt', 'sub/3.txt']
>>> glob.glob('./**/', recursive=True)
['./', './sub/']
1
2
3
4
5
6
7
8
9
10
11

os.walk

引言

我们在使用python时时常会遇到调用某些文件的需求，这时我们就需要得到这些文件的路径。python强大的自带os模块使得获得路径变得很容易。下面介绍如何使用os.walk函数来遍历文件夹及子文件夹下所有文件并得到路径。

os.walk的完整定义形式如下：

os.walk(top, topdown=True, onerror=None, followlinks=False)
1

参数：

top:需要遍历目录的地址。
topdown 为真，则优先遍历top目录，否则优先遍历top的子目录(默认为开启)。
onerror 需要一个 callable 对象，当walk需要异常时，会调用。
followlinks 如果为真，则会遍历目录下的快捷方式(linux 下是 symbolic link)实际所指的目录(默认关闭)。

以上四个参数一般只需要指定第一个文件夹路径，剩下的一般情况不需要指定。

`os.walk`使用

os.walk 的返回值是一个生成器(generator),也就是说我们需要用循环不断的遍历它（不可以直接print），来获得所有的内容。

每次遍历的对象都是返回的是一个三元元组(root,dirs,files)

root 所指的是当前正在遍历的这个文件夹的本身的地址
dirs 是一个 list ，内容是该文件夹中所有的目录的名字(不包括子目录)
files 同样是 list , 内容是该文件夹中所有的文件(不包括子目录)

注意，函数会自动改变root的值使得遍历所有的子文件夹。所以返回的三元元组的个数为所有子文件夹（包括子子文件夹，子子子文件夹等等）加上1（根文件夹）。

举例：

对于具有以下结构的目录进行测试：

$ cd namesort/
$ tree
.
|-- namelist.txt
|-- nameout.txt
|-- namesorttest.py
`-- test
    |-- name2.txt
    |-- namelist.txt
    |-- nameout.txt
    `-- namesort.py
1
2
3
4
5
6
7
8
9
10
11

执行：

import os
path = '/home/jhxie/Workspace/namesort'
for root,dirs,files in os.walk(path):
        print root,dirs,files
1
2
3
4

输出为：

/home/jhxie/Workspace/namesort ['test'] ['nameout.txt', 'namelist.txt', 'namesorttest.py']
/home/jhxie/Workspace/namesort/test [] ['nameout.txt', 'name2.txt', 'namelist.txt', 'namesort.py']
1
2

获得所有子文件路径（`os.path.join`使用）

由于os.walk获得的并不是路径，所以需要将其内容进行连接得到路径。

这时使用python自带函数os.path.join,其语法为：

os.path.join(path1[, path2[, ...]])
1

其中嵌套的[]表示写在最前面的是高级目录，后面的是低级的，也就是按参数排列顺序拼接。

举例：

os.path.join("home", "me", "mywork")
1

在Linux系统上会返回home/me/mywork
在Windows系统上会返回home\me\mywork

可能大家已经注意到了，此函数并不是简单的字符串连接函数，你不需要在输入的参数字符串中加入分隔符，函数会根据你的系统自动加入对应的分隔符，这也是这个函数存在的意义所在。

所以我们正好使用os.path.join()来处理上面生成的遍历结果：

import os
path = '/home/jhxie/Workspace/namesort'
for root,dirs,files in os.walk(path):
        for file in files:
                print(os.path.join(root,file))
1
2
3
4
5

输出结果：

/home/jhxie/Workspace/namesort/nameout.txt
/home/jhxie/Workspace/namesort/namelist.txt
/home/jhxie/Workspace/namesort/namesorttest.py
/home/jhxie/Workspace/namesort/test/nameout.txt
/home/jhxie/Workspace/namesort/test/name2.txt
/home/jhxie/Workspace/namesort/test/namelist.txt
/home/jhxie/Workspace/namesort/test/namesort.py
1
2
3
4
5
6
7

示例

import os
import glob
train_txt_path = os.path.join("data", "LEDNUM", "train.txt")
train_dir = os.path.join("data", "LEDNUM", "train_data")
def gen_txt(train_txt_path, train_dir):
    with open(train_txt_path, 'w') as f:
        for img_path in glob.iglob(os.path.join(train_dir, '**/*.jpg'), recursive=True):
            label = os.path.basename(img_path).split('_')[0]
            line = img_path + ' ' + label + '\n'
            f.write(line)


import os
train_txt_path = os.path.join("data", "LEDNUM", "train.txt")
train_dir = os.path.join("data", "LEDNUM", "train_data")
def gen_txt(train_txt_path, train_dir):
    f = open(txt_path, 'w')
    for root, s_dirs, _ in os.walk(img_dir, topdown=True):  # 获取 train文件下各文件夹名称,
        for sub_dir in s_dirs:
            i_dir = os.path.join(root, sub_dir)             # 获取各类的文件夹 绝对路径
            img_list = os.listdir(i_dir)                    # 获取类别文件夹下所有png图片的路径
            for i in range(len(img_list)):
                if not img_list[i].endswith('jpg'):         # 若不是png文件，跳过
                    continue
                label = img_list[i].split('_')[0]
                img_path = os.path.join(i_dir, img_list[i])
                line = img_path + ' ' + label + '\n'
                f.write(line)
    f.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

相关阅读:
[I2C]I2C通信协议详解（一） --- 什么是I2C
DayDayUp：计算机技术与软件专业技术资格证书之《系统集成项目管理工程师》课程讲解之十大知识领域之4核心—成本进度管理
 基于HMM-Viterbi的通信网络资源数据处理方法及应用
 项目需要实现国际化？不妨来试试它
 【NLP】使用 BERT 和 PyTorch Lightning 进行多标签文本分类
 大话C# WPF基础入门进阶，深入浅出解析章节教程 9 循环入门2初级点
 谁删了服务器？谈VC源码字符集和回车换行注意事项
 MongoDB聚合运算符：$dateAdd
【深度学习】torch-张量Tensor
基于FPGA的图像RGB转HLS实现,包含testbench和MATLAB辅助验证程序
原文地址：https://blog.csdn.net/weixin_44302770/article/details/134481176

第六章 解析glob.glob与os.walk(工具)