training strategy and PyTorch document

Last updated on July 28, 2023 am

这一篇记录一下PyTorch中踩过的一些坑和模型训练技术

Pytorch Document

Multi-GPU on Single device
DP - Data Parallel
1
net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
  1. Transfer minibatch from RAM to GPU-0 (main GPU).
  2. GPU distribute subminibatch and model parameters to other GPU device.
  3. Forward pass in all GPU device.
  4. Gather output on master GPU, compute loss.
  5. Scatter loss to GPUs and backward to calculate gradients.
  6. Gather gradient on GPU-0.
  7. Update model parameters.
DDP - Distributed Data Parallel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch.distributed as dist
from torch.utils.data import DistributedSampler as distsampler

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if multi_device:
model.to(device)
local_rank = int(os.environ["LOCAL_RANK"]) #get GPU index on current device.
torch.cuda.set_device(local_rank) #set cuda device using GPU index.
dist.init_process_group('nccl',init_method='env://') #initialize backend

# rank = int(os.environ['RANK']) //useless
# world_size=int(os.environ['WORLD_SIZE'])

device=torch.device(f'cuda:{local_rank}') #following data should use this new.
model=nn.parallel.DistributedDataParallel(model,device_ids=[local_rank],output_device=local_rank)

train_sampler=distsampler(train_dataset,shuffle=True) #set DistributedSampler.
train_data=DataLoader(train_dataset,batch_size,sampler=train_sampler,pin_memory=True,drop_last=True) #set sampler.
  1. Use multiple worker processes to parallelize data load, and send each minibatch data to different GPUs.
  2. Forward pass on each GPU and get result.
  3. Compute loss
  4. Backward pass on each GPU and reduce gradient on every GPUs.
  5. Update parameters at the same time, make sure all weights are the same.

Note:

  1. torch.distributed pack and torch.utils.data.DistributedSampler pack are needed.
  2. model should be on GPU-0 before wrapped by DDP.
  3. use of DataLoader should reference the Dataset and DataLoader part. If you want to completely shuffle data, set True in train_sampler instead of in the DataLoader.
  4. if want to drop_last, only use it in DataLoader
Dataset and DataLoader
Dateset

All dataset part should inherit from torch.utils.data.Dataset

1
2
3
4
5
from torch.utils.data import Dataset
class dataset(Dataset): #should include the following functions
def __init__(self):
def __getitem__(self,idx):
def __len__(self):

Note:

  1. function __getitem__ returns meta data (needs collected_fn in DataLoader to separate and stack the data if the shape of data is not equal size) / otherwise use default collect_fn for equal size meta data.

    The example of using collect_fn is as follows:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    def __getitem__(self, idx):
    vid = self.corresponding_vid[idx] if (self.mode == 'train' and not self.save_on_disk) or self.is_total else self.video_ids[idx]
    choose_idx = 0

    feature2d = self.backbone_2d_dict[vid]
    feature3d = self.backbone_3d_dict[vid]
    objects = self.objects_dict[vid]

    if (self.mode == 'train' and not self.save_on_disk) or self.is_total:
    numberic_cap, vp_semantics, caption_semantics, nouns, nouns_vec, vocab_ids, vocab_probs, fillmasks = self.total_entries[idx]
    else:
    numberic_cap, vp_semantics, caption_semantics, nouns, nouns_vec = self.vid2language[vid][choose_idx][1:]
    vocab_ids, vocab_probs, fillmasks = None, None, None

    captions = self.vid2captions[vid]
    nouns_dict = {'nouns': nouns, 'vec': torch.FloatTensor(nouns_vec)}

    return torch.FloatTensor(feature2d), torch.FloatTensor(feature3d), torch.FloatTensor(objects), \
    torch.LongTensor(numberic_cap), \
    torch.FloatTensor(vp_semantics), \
    torch.FloatTensor(caption_semantics), captions, nouns_dict, vid, \
    vocab_ids, vocab_probs, fillmasks

    def collate_fn_caption(batch):
    feature2ds, feature3ds, objects, numberic_caps, \
    vp_semantics, caption_semantics, captions, nouns_dict_list, vids, \
    vocab_ids, vocab_probs, fillmasks = zip(*batch)

    bsz, obj_dim = len(feature2ds), objects[0].shape[-1]
    longest_objects_num = max([item.shape[0] for item in objects])
    ret_objects = torch.zeros([bsz, longest_objects_num, obj_dim])
    ret_objects_mask = torch.ones([bsz, longest_objects_num])
    for i in range(bsz):
    ret_objects[i, :objects[i].shape[0], :] = objects[i]
    ret_objects_mask[i, :objects[i].shape[0]] = 0.0

    feature2ds = torch.cat([item[None, ...] for item in feature2ds], dim=0) # (bsz, sample_numb, dim_2d)
    feature3ds = torch.cat([item[None, ...] for item in feature3ds], dim=0) # (bsz, sample_numb, dim_3d)

    vp_semantics = torch.cat([item[None, ...] for item in vp_semantics], dim=0) # (bsz, dim_sem)
    caption_semantics = torch.cat([item[None, ...] for item in caption_semantics], dim=0) # (bsz, dim_sem)

    numberic_caps = torch.cat([item[None, ...] for item in numberic_caps], dim=0) # (bsz, seq_len)
    masks = numberic_caps > 0

    captions = [item for item in captions]
    nouns = list(nouns_dict_list)
    vids = list(vids)
    vocab_ids = torch.cat([item[None, ...] for item in vocab_ids], dim=0).long() if vocab_ids[0] is not None else None # (bsz, seq_len, 50)
    vocab_probs = torch.cat([item[None, ...] for item in vocab_probs], dim=0).float() if vocab_probs[0] is not None else None # (bsz, seq_len, 50)
    fillmasks = torch.cat([item[None, ...] for item in fillmasks], dim=0).float() if fillmasks[0] is not None else None # (bsz, seq_len)

    return feature2ds.float(), feature3ds.float(), ret_objects.float(), ret_objects_mask.float(), \
    vp_semantics.float(), caption_semantics.float(), \
    numberic_caps.long(), masks.float(), captions, nouns, vids, \
    vocab_ids, vocab_probs, fillmasks

DataLoader

Pytorch uses DataLoader in torch.utils.data.DataLoader to feed data to the net. An example is as follows:

1
2
3
4
from torch.utils.data import DataLoader
loader = DataLoader(dataset=dataset, batch_size=cfgs.bsz,
shuffle=True, collate_fn=collate_fn,
num_workers=0)
  • dataset - dataset from which to load the data.
  • batch_size – how many samples per batch to load (default: 1).
  • shuffle – set to True to have the data reshuffled at every epoch (default: False).
  • sampler – defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified. -> In DDP strategy is used.
  • batch_sampler – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.
  • num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
  • collate_fn – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
  • pin_memory – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.
  • drop_last – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

Training strategies