The world’s leading publication for data science, AI, and ML professionals.

Monitoring BERT Model Training with TensorBoard

Gradient Flow and Update Ratios

https://unsplash.com/@tobiastu
https://unsplash.com/@tobiastu

In the previous article, we explained all the building components of the Bert model. Now we are going to train the model monitoring the training process in TensorBoard, looking at the gradient flow, updates-parameters ratios, loss and evaluation metrics.

Why would we like to monitor gradients flow and updates ratios instead of simply looking at the loss and evaluation metrics? When we start the model training on a big amount of data, we might run many iterations before realising, looking at the loss and evaluation metrics that the model is not training. Here, looking at the gradients magnitude and updates ratio we can immediately spot that something is wrong which saves us time and money.


Data preparation

We will use 20newsgroups dataset (License: Public Domain / Source: http://qwone.com/~jason/20Newsgroups/) from sklearn in this example with 4 categories : alt.atheism, talk.religion.misc, comp.graphics and sci.space. We tokenize the data with BertTokenizer from the transformers library and wrap them into BertDataset class which inherits from torch.utils.data.Dataset allowing to batch and shuffle the data and conveniently load them into the model.

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
X_train = pd.DataFrame(newsgroups_train['data'])
y_train = pd.Series(newsgroups_train['target'])
X_test = pd.DataFrame(newsgroups_test['data'])
y_test = pd.Series(newsgroups_test['target'])
BATCH_SIZE = 16
max_length = 256
config = BertConfig.from_pretrained("bert-base-uncased")
config.num_labels = len(y_train.unique())
config.max_position_embeddings = max_length
train_encodings = tokenizer(X_train[0].tolist(), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(X_test[0].tolist(), truncation=True, padding=True, max_length=max_length)

class BertDataset(Dataset):
  def __init__(self, encodings, labels):
   self.encodings = encodings
   self.labels = labels
  def __getitem__(self, idx):
   item = {key: torch.tensor(val[idx]).to(device) for key, val in 
  self.encodings.items()}
   item['labels'] = torch.tensor(self.labels[idx]).to(device)
   return item
  def __len__(self):
   return len(self.labels)

train_dataset = BertDataset(train_encodings, y_train)
test_dataset = BertDataset(test_encodings, y_test)
train_dataset_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataset_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)
for d in train_dataset_loader:
    print(d)
    break

# output : 
{'input_ids': tensor([[ 101, 2013, 1024,  ...,    0,    0,    0],
         [ 101, 2013, 1024,  ..., 1064, 1028,  102],
         [ 101, 2013, 1024,  ...,    0,    0,    0],
         ...,
         [ 101, 2013, 1024,  ..., 2620, 1011,  102],
         [ 101, 2013, 1024,  ..., 1012, 4012,  102],
         [ 101, 2013, 1024,  ..., 3849, 2053,  102]], device='cuda:0'),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'),
 'labels': tensor([3, 0, 2, 1, 0, 2, 2, 1, 1, 0, 1, 3, 3, 0, 2, 1], device='cuda:0')}

TensorBoard usage

Tensorboard allows us to write and save for future analysis different types of data, including images and scalars. First of all let’s install tensorboard with pip:

pip install tensorboard

To write to TensorBoard we will be using the SummaryWriter from torch.utils.tensorboard

from torch.utils.tensorboard import SummaryWriter
# SummaryWriter takes log directory as argument
writer = SummaryWriter('tensorboard/runs/bert_experiment_1')

To write scalars, we use:

writer.add_scalar('loss/train', loss, counter_train)

The _countertrain variable is needed to know the step number at which something was written to TensorBoard. To write an image, we will use the following:

writer.add_figure("gradients", myfig, 
        global_step=counter_train, close=True, walltime=None)

Model training

Now let’s look at our training function

def train_epoch(  model : BertForSequenceClassification,
                  data_loader : DataLoader,
                  optimizer : AdamW,
                  scheduler : get_linear_schedule_with_warmup,
                  n_examples : int,
                  out_tensorboard : bool = True,
                  out_every : int = 30,
                  step_eval : int = None,
                  test_data_loader : DataLoader= None,
                  len_test_dataset : int = None
                ) -> Tuple[float, float]:
    """
    out_every : every how many steps add gradients and ratios figures and train loss to the tensorboard
    out_tensorboard : write to tensorboard or not
    step_eval : every how many step evaluate the model on test data. If None is passed then we will evaluate only at the end of the epoch.
    """
    print(f"Overall number of steps for training : {len(data_loader) * EPOCHS}")
    print(f"Tensorboard will save {(len(data_loader) * EPOCHS) // out_every} figures")
    global counter_train
    SKIP_PROB = 0
    model = model.train()
    losses = []
    correct_predictions = 0

    tot_batches = len(data_loader) * EPOCHS
    steps_out_stats = list(np.arange(0, tot_batches, out_every))

    running_losses = []
    for d in tqdm(data_loader, desc="Train batch"):

        outputs = model(**d)
        preds = outputs.logits.argmax(1)
        loss = outputs.loss
        correct_predictions += torch.sum(preds == d['labels'])
        running_losses.append(loss.item())
        losses.append(loss.item())
        loss.backward()

        if counter_train in steps_out_stats and out_tensorboard:
            curr_params = copy.deepcopy(optimizer.param_groups[0]['params'])

            print("writing gradients and ratios..")
            # write gradients to tensorboard
            myfig = plot_grad_flow(model.named_parameters(), skip_prob=SKIP_PROB)
            writer.add_figure("gradients", myfig, global_step=counter_train, close=True, walltime=None)

            named_params = copy_model_params(model.named_parameters())

            writer.add_scalar('loss/train', np.mean(running_losses), counter_train)
            running_loss = []

            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()

            next_params = copy.deepcopy(optimizer.param_groups[0]['params'])
            ratios = compute_ratios(curr_params, next_params, named_params)
            optimizer.zero_grad()

            fig_ratio = plot_ratios(ratios, skip_prob=SKIP_PROB)
            writer.add_figure("gradient ratios", fig_ratio, global_step=counter_train, close=True, walltime=None)

        else:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        writer.add_scalar('learning_rate', scheduler.get_lr()[0], counter_train)

        if step_eval != None and counter_train % step_eval == 0 and counter_train > 1:
            print("evaluating the model..")
            val_acc, val_loss = eval_model(model, test_dataset_loader, len_test_dataset)
            writer.add_scalar('loss/test', val_loss, counter_train)
            writer.add_scalar('accuracy/test', val_acc, counter_train)
            model = model.train()

        counter_train += 1
    return correct_predictions.cpu().numpy() / n_examples, np.mean(losses)

_outevery variable controls how often to write to TensorBoard, measured in number of steps. We can also evaluate more often than after each epoch using _stepeval variable. Gradients flow and updates ratios figures are returned from _plot_gradflow and _plotratios functions respectively.

_plot_gradflow we pass the model’s parameters and for better visualization we can decide to skip some of the layers with _skipprob parameter. We write mean, max and standard deviations of the gradients of each layer, ignoring bias layers as they are less interesting. You can remove that part on the 17th line if you want to display bias’s gradients as well. We also display the percentage of zero gradients in each layer. It’s important to highlight, as BERT uses GeLU rather than ReLU activation function this last plot might be less useful, but if you are using a different model with ReLU which suffers from dying neurons problem, displaying the percentage of zero gradients is actually helpful.

def plot_grad_flow(named_parameters : iter, skip_prob : float = 0.5, verbose : bool = False, seed : int = 0) -> plt.figure:
    '''Plots the gradients flowing through different layers in the net during training.
    Can be used for checking for possible gradient vanishing / exploding problems.

    skip_prob : skip some random layers for better visualization if there are a lot of layers
    seed : random seed to skip layers
    '''

    np.random.seed(seed)
    name_replace = {"encoder" : "enc", "layer" : "l"}
    plt.rcParams["figure.figsize"] = (20, 12)
    # plt.rcParams['figure.dpi']= 150

    mean_grads, max_grads, zero_grads, stds = [], [], [], []
    name_layers = []
    for (n, p) in (named_parameters):
        if (p.requires_grad) and ("bias" not in n):
            if np.random.rand() < skip_prob:
                if verbose:
                    print(f"skipped {n}")
                continue
            name_layers.append('.'.join([name_replace.get(el, el) for el in n.split(".") if el not in ['weight', 'bert', 'self']]))
            mean_grads.append(p.grad.abs().mean().detach().cpu().item())
            max_grads.append(p.grad.abs().max().detach().cpu().item())
            stds.append(p.grad.abs().std().detach().cpu().item())
            zero_grads.append(torch.sum(p.grad.abs() == 0.).detach().cpu().item() / p.grad.nelement())
        elif not (p.requires_grad):
            print(f"{n} does not require grad")

    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    sns.set(font_scale = 1.2)
    fig, ax = plt.subplots(3)
    sns.barplot(x=name_layers, y=max_grads, palette=['b']*len(name_layers), alpha=0.9, ax=ax[2], label="max grads",)
    sns.barplot(x=name_layers, y=mean_grads, palette=['r']*len(name_layers), alpha=0.9, ax=ax[2], label="mean grads",)

    sns.barplot(x=name_layers, y=zero_grads, palette=['k']*len(name_layers), alpha=0.9, ax=ax[0], label="percentage zero grads")

    sns.barplot(x=name_layers, y=stds, palette=['c']*len(name_layers), alpha=0.9, ax=ax[1], label="standard dev grads")

    ax[2].set_ylim([-0.005, 0.05])
    ax[1].set_ylim([-0.005, 0.1])
    ax[0].set_ylim([-0.005, 1.05])
    ax[2] = ax[2].set_xticklabels(ax[2].get_xticklabels(), rotation = 90)
    ax[0].set(xticklabels=[])
    ax[1].set(xticklabels=[])

    ax[0].legend()
    ax[1].legend()
    plt.legend()
    plt.close()
    return fig

_plotratios function displays the update / parameter ratio for each parameter which is a standardized measure as the update is divided by the parameter value and helps to understand how your neural network is learning. As a rough heuristic, this value should be around 1e-3, if lower then learning rate might be too low, otherwise too high. Also here you can reduce the layers displayed through _skipprob parameter.

def plot_ratios(ratios : dict,
                skip_prob : float = 0.5,
                verbose : bool = False,
                seed : int = 0) -> plt.figure:
    '''Plots the update/param ratio.
    Can be used for checking for possible gradient vanishing / exploding problems.

    This ratio should be around 1e-3.
    If it is lower than this then the learning rate might be too low.
    If it is higher then the learning rate is likely too high

    skip_prob : skip some random layers for better visualization if there are a lot of layers
    seed : random seed to skip layers
    '''
    np.random.seed(seed)
    name_replace = {"encoder" : "enc", "layer" : "l"}
    plt.rcParams["figure.figsize"] = (20, 12)
    # plt.rcParams['figure.dpi']= 150

    ratios_list = []
    name_layers = []
    for (n, r) in (ratios.items()):
        if np.random.rand() < skip_prob:
            if verbose:
                print(f"skipped {n}")
            continue
        name_layers.append('.'.join([name_replace.get(el, el) for el in n.split(".") if el not in ['weight', 'bert', 'self']]))
        ratios_list.append(r)

    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    sns.set(font_scale = 1.2)
    fig, ax = plt.subplots()
    sns.barplot(x=name_layers, y=ratios_list, palette=['b']*len(name_layers), ax=ax, label="ratio update/param",)

    ax.set_ylim([-0.0005, 0.005])
    ax = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)

    plt.legend()
    plt.close()
    return fig

def compute_ratios(param_prev : dict ,
                   param_next : dict,
                   named_params : dict,
                   ignore_bias : bool = True) -> dict:
    """ Compute update/param ratio
    """
    updates = {}
    param_names = []
    for i, (name, val) in enumerate(named_params.items()):
        param_names.append(name)
        updates[name] = copy.deepcopy((param_next[i] - param_prev[i]).cpu().detach().numpy())

    ratio_updates = {}
    for n in param_names:
        if ignore_bias:
            if 'bias' in n: continue
        k = named_params[n]
        param_scale = np.linalg.norm(k.ravel())
        update = updates[n]
        update_scale = np.linalg.norm(update.ravel())
        ratio_updates[n] = update_scale/param_scale
    return ratio_updates

During the model training you can start TensorBoard using the following command :

tensorboard --samples_per_plugin images=100 --logdir bert_experiment_1

Otherwise, you can create a .bat file (on Windows) for a quicker launch. Create a new file, for example _runtensorboard with .bat extension and copy paste the below command modifying _path_to_anaconda_env, path_to_savedresults and _envname accordingly. Then click on the file to launch TensorBoard.

cmd /k "cd path_to_anaconda_envScripts &amp; activate env_name &amp; cd path_to_saved_resultstensorboardruns &amp; tensorboard - samples_per_plugin images=100 - logdir bert_experiment_1"

Defining _samples_perplugin images = 100 we display all the images in this task, otherwise by default TensorBoard will only display some of them.


Results

Monitoring these plots we can spot quickly if something is going not as expected. For example, if gradients are zero for many layers you might have a vanishing gradient problem. Similarly, if ratios are very low or very high you might want to dig deeper immediately without waiting until the end of the training or several epochs of training.

For example, looking at the ratios and the gradients on step number 240 with the below configuration we can see that things look good and our training is proceeding well – we can expect good results at the end.

EPOCHS = 5
optimizer = AdamW(model.parameters(), lr=3e-5, correct_bias=False)
total_steps = len(train_dataset_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
                   optimizer,
                   num_warmup_steps= 0.7 * total_steps,
                   num_training_steps=total_steps
                                            )
Ratio updates, Image by Author
Ratio updates, Image by Author
Gradients flow, Image by Author
Gradients flow, Image by Author

Indeed the final results are :

train/test accuracy, Image by Author
train/test accuracy, Image by Author

While if we change the schedule setting to this:

scheduler = get_linear_schedule_with_warmup(
 optimizer,
 num_warmup_steps= 0.1 * total_steps,
 num_training_steps=total_steps
 )

We notice at 180th step from the graphs that the model is not learning and we can stop the training to investigate further. In this case setting _num_warmupsteps to _0.1 * totalsteps makes the learning rate decrease and become very small shortly after the start of the training, and we end up with vanishing gradients which do not propagate back to first layers of the networking stopping effectively the learning process.

Ratio updates, Image by Author
Ratio updates, Image by Author
Gradients flow, Image by Author
Gradients flow, Image by Author
train/test accuracy, Image by Author
train/test accuracy, Image by Author

You can find the full code in this GitHub repo to try and experiment yourself.


Conclusions

After reading this and previous articles you should have all the tools and understanding of how to train Bert model in your projects! Things I described here are some of the ones you want to monitor during training, but TensorBoard offers other functionalities like Embeddings Projector that you can use to explore your embedding layer and many more that you can find here.

References

https://cs231n.github.io/neural-networks-3/


Related Articles