PyTorch数据读入是通过Dataset DataLoader的方式完成的,Dataset定义好数据的格式和数据变换形式,DataLoader用iterative的方式不断读入批次数据, 本文介绍 Pytorch 数据读入的流程 。
参考 深入浅出PyTorch ,系统补齐基础知识。
本节目录
- PyTorch常见的数据读取方式
- 构建自己的数据读取流程
Dataset
我们可以定义自己的Dataset类来实现灵活的数据读取,定义的类需要继承PyTorch自身的Dataset类。主要包含三个函数:
__init__
: 用于向类中传入外部参数,同时定义样本集__getitem__
: 用于逐个读取样本集合中的元素,可以进行一定的变换,并将返回训练/验证所需的数据__len__
: 用于返回数据集的样本数
系统默认 Dataset
1234 | import torchfrom torchvision import datasetstrain_data = datasets.ImageFolder(train_path, transform=data_transform)val_data = datasets.ImageFolder(val_path, transform=data_transform) |
---|
这里使用了PyTorch自带的ImageFolder类的用于读取按一定结构存储的图片数据(path对应图片存放的目录,目录下包含若干子目录,每个子目录对应属于同一个类的图片)。
其中“data_transform”可以对图像进行一定的变换,如翻转、裁剪等操作,可自己定义。
自定义Dataset
这里另外给出一个例子,其中图片存放在一个文件夹,另外有一个csv文件给出了图片名称对应的标签。这种情况下需要自己来定义Dataset类:
1234567891011121314151617181920212223242526272829303132333435 | class MyDataset(Dataset): def __init__(self, data_dir, info_csv, image_list, transform=None): """ Args: data_dir: path to image directory. info_csv: path to the csv file containing image indexes with corresponding labels. image_list: path to the txt file contains image names to training/validation set transform: optional transform to be applied on a sample. """ label_info = pd.read_csv(info_csv) image_file = open(image_list).readlines() self.data_dir = data_dir self.image_file = image_file self.label_info = label_info self.transform = transform def __getitem__(self, index): """ Args: index: the index of item Returns: image and its labels """ image_name = self.image_fileindex.strip('n') raw_label = self.label_info.loc[self.label_info'Image_index' == image_name] label = raw_label.iloc:,0 image_name = os.path.join(self.data_dir, image_name) image = Image.open(image_name).convert('RGB') if self.transform is not None: image = self.transform(image) return image, label def __len__(self): return len(self.image_file) |
---|
DataLoader
构建好Dataset后,就可以使用DataLoader来按批次读入数据了,实现代码如下:
1234 | from torch.utils.data import DataLoadertrain_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=4, shuffle=True, drop_last=True)val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, num_workers=4, shuffle=False) |
---|
其中:
- batch_size:样本是按“批”读入的,batch_size就是每次读入的样本数
- num_workers:有多少个进程用于读取数据,Windows下该参数设置为0,Linux下常见的为4或者8,根据自己的电脑配置来设置
- shuffle:是否将读入的数据打乱,一般在训练集中设置为True,验证集中设置为False
- drop_last:对于样本最后一部分没有达到批次数的样本,使其不再参与训练
DataLoader 参数很多,支持很强大的数据生成器,pytorch2 的文档如下:
1 | torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='') |
---|
Parameters:
- dataset (Dataset) – dataset from which to load the data.
- batch_size (int, optional) – how many samples per batch to load (default:
1
). - shuffle (bool, optional) – set to
True
to have the data reshuffled at every epoch (default:False
). - sampler (Sampler or Iterable*,* optional) – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified. - batch_sampler (Sampler or Iterable*,* optional) – like
sampler
, but returns a batch of indices at a time. Mutually exclusive withbatch_size
,shuffle
,sampler
, anddrop_last
. - num_workers (int, optional) – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:0
) - collate_fn (Callable*,* optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
- pin_memory (bool, optional) – If
True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fn
returns a batch that is a custom type, see the example below. - drop_last (bool, optional) – set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
) - timeout (numeric*,* optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default:
0
) - worker_init_fn (Callable*,* optional) – If not
None
, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
) - generator (torch.Generator, optional) – If not
None
, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default:None
) - prefetch_factor (int, optional*,* keyword-only arg) – Number of batches loaded in advance by each worker.
2
means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default isNone
. Otherwise if value of num_workers>0 default is2
). - persistent_workers (bool, optional) – If
True
, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default:False
) - pin_memory_device (str, optional) – the data loader will copy Tensors into device pinned memory before returning them if pin_memory is set to true.
参考资料
- https://datawhalechina.github.io/thorough-pytorch/第三章/3.3 数据读入.html
- https://pytorch.org/docs/stable/data.html
文章链接: https://cloud.tencent.com/developer/article/2303817