一直使用 Datasets 类,首次遇到 IterableDatasets 类,遂查找区别
一共有两种数据集:
映射样式的数据集是实现了__getitem__()和__len__()协议的数据集,表示从(可能是非整数)索引/键到数据样本的映射
dataset[idx] 可以从磁盘上得到第 idx 个数据可迭代样式的数据集是 IterableDataset 的实例,实现了 iter() 协议,表示数据样本上的可迭代对象,这种类型的数据集特别适用于随机读取非常昂贵甚至不可能的情况,并且批处理大小取决于获取的数据。
iter(dataset) 时,它可以返回从数据库、远程服务器读取的数据流,甚至是实时生成的日志num_workers 遍Python中可迭代对象(Iterable)并不是指某种具体的数据类型,它是指存储了元素的一个容器对象,且容器中的元素可以通过__iter__方法或__getitem__方法访问。
iter 方法的作用是让对象可以用 for...in...循环遍历,getitem 方法是让对象可以通过index索引的方式访问实例中的元素。for...in 完成。凡是可迭代对象都可以直接用 for...in循环访问,这个语句做了两件事:
__iter__获得一个可迭代器__next____iter__来转变为Iterator。# 构造 iterable-style datasets
class iter_Dataset(IterableDataset):
def __init__(self, num_samples):
self.num_samples = num_samples
def __iter__(self):
for i in range(self.num_samples):
label = np.array(i)
yield label
# 构造 map-style datasets
class normal_Dataset(Dataset):
def __init__(self, num_samples):
self.num_samples = num_samples
self.data = []
for i in range(self.num_samples):
self.data += [i]
def __len__(self):
return self.num_samples
def __getitem__(self, idx):
return self.data[idx]
iter_dataset = iter_Dataset(10)
dataset = normal_Dataset(10)
print(f"normal dataset type: {type(dataset)}")
for lbl in dataset:
print(f"normal Dataset: {lbl}")
>>normal dataset type: <class '__main__.normal_Dataset'>
normal Dataset: 0
normal Dataset: 1
normal Dataset: 2
print(f"iterable dataset type: {type(iter_dataset)}")
for lbl in iter_dataset:
print(f"iter dataset: {lbl}")
>>iterable dataset type: <class '__main__.iter_Dataset'>
iter dataset: 0
iter dataset: 1
iter dataset: 2
num_workers = 1, batch_size = 1,可以看到,输出的结果是一样的iter_dataset = iter_Dataset(10)
iter_dataloader = DataLoader(
iter_dataset,
num_workers=1,
batch_size=1,
)
dataset = normal_Dataset(10)
dataloader = DataLoader(
dataset,
num_workers=1,
batch_size=1,
)
for lbl in dataloader:
print(f"normal Dataset: {lbl}")
>>normal Dataset: Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0])
normal Dataset: Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[1])
normal Dataset: Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2])
for lbl in iter_dataloader:
print(f"iter dataset: {lbl}")
>>iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[1])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2])]
num_workers = 2, batch_size = 1,可以看到,iterable-style 重复输出了两次,这就验证了,这种形式的每个 worker 都会遍历整个数据集,从而导致有几个 workers 就输出几遍>>normal Dataset: Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0])
normal Dataset: Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[1])
normal Dataset: Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2])
>>iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[1])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[1])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2])]
iter dataset: [Tensor(shape=[1], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2])]
num_workers = 2, batch_size = 2,可以看到,每条数据都变成了两个元素,验证了 batch_size 的效果>>normal Dataset: Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0, 1])
normal Dataset: Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2, 3])
normal Dataset: Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[4, 5])
>>iter dataset: [Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0, 1])]
iter dataset: [Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[0, 1])]
iter dataset: [Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2, 3])]
iter dataset: [Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[2, 3])]
iter dataset: [Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[4, 5])]
iter dataset: [Tensor(shape=[2], dtype=int64, place=Place(gpu_pinned), stop_gradient=True,
[4, 5])]