Training Process
The training pipeline is robust, supporting distributed training, mixed precision, and dynamic learning rate scheduling.
Training Concepts
1. Loop Structure
Standard PyTorch training loop:
- Forward Pass: Compute model predictions.
- Loss Calculation: CrossEntropyLoss.
- Backward Pass: Backpropagate gradients.
- Optimization Step: Update weights (Adam optimizer).
2. Mixed Precision (AMP)
We use torch.cuda.amp (Automatic Mixed Precision) to speed up training and reduce memory usage.
- Scaler:
GradScaleris used to prevent gradient underflow/overflow. - BFloat16: Used for matrix multiplications on supported hardware.
3. Distributed Data Parallel (DDP)
The train.py script is designed to run on multiple GPUs.
- DistributedSampler: Ensures each GPU sees a unique subset of the data.
- Sync: Gradients are synchronized across GPUs during the backward pass.
- Metric Reduction: Validation metrics are aggregated from all ranks.
4. Hyperparameters
Key hyperparameters (defaults):
- Hidden Size: 384
- Layers: 4
- Learning Rate: 1e-3
- Scheduler: ReduceLROnPlateau (factor 0.2, patience 3)
- Dropout: 0.3 - 0.5
Checkpointing
The model is saved only when validation loss improves.
- Format:
.pthfile containing model state, optimizer state, and scheduler state. - Naming:
checkpoint_{timestamp}-signs_{num_signs}/{epoch}.pth