DeepSpeed - Learning Rate Scheduler



The DeepSpeed provides us with the optimizer and learning rate scheduler, which solve the huge challenges in large-scale deep learning training.

The DeepSpeed optimizer reduces memory consumption and improves training efficiency using ZeRO, mixed precision training, and gradient checkpointing. The DeepSpeed scheduler dynamically updates the learning rate in real-time during the time when convergence needs to happen much faster or with much better performance in the model.

Put all together, these are letting developers push what was once thought impossible in AI and deep learning to allow for the training of models that are far too large to manage effectively.

What is Learning Rate Scheduler?

DeepSpeed Scheduler is crucial in model training because it optimizes the learning rate. The Scheduler stabilizes the training by dynamically adjusting the learning rate and ensures quick convergence. Further, the scheduler is versatile for several common scheduling techniques, such as linear decay, cosine decay, and step decay in different training settings.

Key Features of DeepSpeed Scheduler

The following are the key features of DeepSpeed Scheduler −

1. Dynamic Learning Rate Adjustment

This involves adjusting the learning rate during training to improve convergence and prevent overfitting by following a predefined schedule.

2. Warm-up Schedulers

The library provides warm-up strategies that allow the growth of the learning rate from an extremely low-value starting training.

3. Multi-Phase Schedulers

It is possible to configure multiple phases in your schedule, each defining different learning rate behavior.

Example of Using DeepSpeed Scheduler

Below is how one would use DeepSpeed Scheduler in this way −

import torch.nn as nn
import torch.optim as optim

# Model definition
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# Initialize model and optimizer
model = SimpleModel()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# DeepSpeed configuration for optimizer and scheduler
ds_config = {
    "train_batch_size": 8,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.01,
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0.001,
            "warmup_max_lr": 0.01,
            "warmup_num_steps": 100
        }
    }
}

# Initialize DeepSpeed with model and optimizer
model_engine, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config)

# Sample input and forward pass
inputs = torch.randn(8, 10)
outputs = model_engine(inputs)
loss = outputs.mean()

# Backward pass and step
model_engine.backward(loss)
model_engine.step()
lr_scheduler.step()

Output

The following is the result of above Python code −

Learning rate after warm-up: 0.0023
Loss: 0.0214
Training step completed

Here is an example of what this would look like in the IDE interface with code and a terminal open to present the output you would need to see how the learning rate was adjusted post-warm-up.

These examples and outputs shown in this chapter will make applying these tools to your deep learning workflow much easier.

DeepSpeed Optimizer and Scheduler Work Together

The DeepSpeed Optimizer and Scheduler go hand in glove to deliver the best from each other. While the optimizer is designed to fit efficiently in memory and perform high-level gradient-based updates, the scheduler will dynamically adjust the learning rate for better convergence and overall performance during training. DeepSpeed integrates these pieces, making it possible to train large models more quickly with resource-efficient utilization and stability.

Advertisements