training strategy and PyTorch document
Last updated on July 28, 2023 am
这一篇记录一下PyTorch中踩过的一些坑和模型训练技术
Pytorch Document
Multi-GPU on Single device
DP - Data Parallel
1 |
|
- Transfer minibatch from RAM to GPU-0 (main GPU).
- GPU distribute subminibatch and model parameters to other GPU device.
- Forward pass in all GPU device.
- Gather output on master GPU, compute loss.
- Scatter loss to GPUs and backward to calculate gradients.
- Gather gradient on GPU-0.
- Update model parameters.
DDP - Distributed Data Parallel
1 |
|
- Use multiple worker processes to parallelize data load, and send each minibatch data to different GPUs.
- Forward pass on each GPU and get result.
- Compute loss
- Backward pass on each GPU and reduce gradient on every GPUs.
- Update parameters at the same time, make sure all weights are the same.
Note:
torch.distributed
pack andtorch.utils.data.DistributedSampler
pack are needed.- model should be on GPU-0 before wrapped by DDP.
- use of DataLoader should reference the Dataset and DataLoader part. If you want to completely shuffle data, set True in
train_sampler
instead of in theDataLoader
. - if want to
drop_last
, only use it inDataLoader
Dataset and DataLoader
Dateset
All dataset part should inherit from torch.utils.data.Dataset
1 |
|
Note:
function
__getitem__
returns meta data (needs collected_fn in DataLoader to separate and stack the data if the shape of data is not equal size) / otherwise use default collect_fn for equal size meta data.The example of using collect_fn is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57def __getitem__(self, idx):
vid = self.corresponding_vid[idx] if (self.mode == 'train' and not self.save_on_disk) or self.is_total else self.video_ids[idx]
choose_idx = 0
feature2d = self.backbone_2d_dict[vid]
feature3d = self.backbone_3d_dict[vid]
objects = self.objects_dict[vid]
if (self.mode == 'train' and not self.save_on_disk) or self.is_total:
numberic_cap, vp_semantics, caption_semantics, nouns, nouns_vec, vocab_ids, vocab_probs, fillmasks = self.total_entries[idx]
else:
numberic_cap, vp_semantics, caption_semantics, nouns, nouns_vec = self.vid2language[vid][choose_idx][1:]
vocab_ids, vocab_probs, fillmasks = None, None, None
captions = self.vid2captions[vid]
nouns_dict = {'nouns': nouns, 'vec': torch.FloatTensor(nouns_vec)}
return torch.FloatTensor(feature2d), torch.FloatTensor(feature3d), torch.FloatTensor(objects), \
torch.LongTensor(numberic_cap), \
torch.FloatTensor(vp_semantics), \
torch.FloatTensor(caption_semantics), captions, nouns_dict, vid, \
vocab_ids, vocab_probs, fillmasks
def collate_fn_caption(batch):
feature2ds, feature3ds, objects, numberic_caps, \
vp_semantics, caption_semantics, captions, nouns_dict_list, vids, \
vocab_ids, vocab_probs, fillmasks = zip(*batch)
bsz, obj_dim = len(feature2ds), objects[0].shape[-1]
longest_objects_num = max([item.shape[0] for item in objects])
ret_objects = torch.zeros([bsz, longest_objects_num, obj_dim])
ret_objects_mask = torch.ones([bsz, longest_objects_num])
for i in range(bsz):
ret_objects[i, :objects[i].shape[0], :] = objects[i]
ret_objects_mask[i, :objects[i].shape[0]] = 0.0
feature2ds = torch.cat([item[None, ...] for item in feature2ds], dim=0) # (bsz, sample_numb, dim_2d)
feature3ds = torch.cat([item[None, ...] for item in feature3ds], dim=0) # (bsz, sample_numb, dim_3d)
vp_semantics = torch.cat([item[None, ...] for item in vp_semantics], dim=0) # (bsz, dim_sem)
caption_semantics = torch.cat([item[None, ...] for item in caption_semantics], dim=0) # (bsz, dim_sem)
numberic_caps = torch.cat([item[None, ...] for item in numberic_caps], dim=0) # (bsz, seq_len)
masks = numberic_caps > 0
captions = [item for item in captions]
nouns = list(nouns_dict_list)
vids = list(vids)
vocab_ids = torch.cat([item[None, ...] for item in vocab_ids], dim=0).long() if vocab_ids[0] is not None else None # (bsz, seq_len, 50)
vocab_probs = torch.cat([item[None, ...] for item in vocab_probs], dim=0).float() if vocab_probs[0] is not None else None # (bsz, seq_len, 50)
fillmasks = torch.cat([item[None, ...] for item in fillmasks], dim=0).float() if fillmasks[0] is not None else None # (bsz, seq_len)
return feature2ds.float(), feature3ds.float(), ret_objects.float(), ret_objects_mask.float(), \
vp_semantics.float(), caption_semantics.float(), \
numberic_caps.long(), masks.float(), captions, nouns, vids, \
vocab_ids, vocab_probs, fillmasks
DataLoader
Pytorch uses DataLoader in torch.utils.data.DataLoader
to feed data to the net. An example is as follows:
1 |
|
- dataset - dataset from which to load the data.
- batch_size – how many samples per batch to load (default:
1
). - shuffle – set to
True
to have the data reshuffled at every epoch (default:False
). - sampler – defines the strategy to draw samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified. -> In DDP strategy is used. - batch_sampler – like
sampler
, but returns a batch of indices at a time. Mutually exclusive withbatch_size
,shuffle
,sampler
, anddrop_last
. - num_workers – how many subprocesses to use for data loading.
0
means that the data will be loaded in the main process. (default:0
) - collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
- pin_memory – If
True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fn
returns a batch that is a custom type, see the example below. - drop_last – set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)