
In the previous article, we explained all the building components of the Bert model. Now we are going to train the model monitoring the training process in TensorBoard, looking at the gradient flow, updates-parameters ratios, loss and evaluation metrics.
Why would we like to monitor gradients flow and updates ratios instead of simply looking at the loss and evaluation metrics? When we start the model training on a big amount of data, we might run many iterations before realising, looking at the loss and evaluation metrics that the model is not training. Here, looking at the gradients magnitude and updates ratio we can immediately spot that something is wrong which saves us time and money.
Data preparation
We will use 20newsgroups dataset (License: Public Domain / Source: http://qwone.com/~jason/20Newsgroups/) from sklearn in this example with 4 categories : alt.atheism, talk.religion.misc, comp.graphics and sci.space. We tokenize the data with BertTokenizer from the transformers library and wrap them into BertDataset class which inherits from torch.utils.data.Dataset allowing to batch and shuffle the data and conveniently load them into the model.
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
X_train = pd.DataFrame(newsgroups_train['data'])
y_train = pd.Series(newsgroups_train['target'])
X_test = pd.DataFrame(newsgroups_test['data'])
y_test = pd.Series(newsgroups_test['target'])
BATCH_SIZE = 16
max_length = 256
config = BertConfig.from_pretrained("bert-base-uncased")
config.num_labels = len(y_train.unique())
config.max_position_embeddings = max_length
train_encodings = tokenizer(X_train[0].tolist(), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(X_test[0].tolist(), truncation=True, padding=True, max_length=max_length)
class BertDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]).to(device) for key, val in
self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx]).to(device)
return item
def __len__(self):
return len(self.labels)
train_dataset = BertDataset(train_encodings, y_train)
test_dataset = BertDataset(test_encodings, y_test)
train_dataset_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataset_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)
for d in train_dataset_loader:
print(d)
break
# output :
{'input_ids': tensor([[ 101, 2013, 1024, ..., 0, 0, 0],
[ 101, 2013, 1024, ..., 1064, 1028, 102],
[ 101, 2013, 1024, ..., 0, 0, 0],
...,
[ 101, 2013, 1024, ..., 2620, 1011, 102],
[ 101, 2013, 1024, ..., 1012, 4012, 102],
[ 101, 2013, 1024, ..., 3849, 2053, 102]], device='cuda:0'),
'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], device='cuda:0'),
'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:0'),
'labels': tensor([3, 0, 2, 1, 0, 2, 2, 1, 1, 0, 1, 3, 3, 0, 2, 1], device='cuda:0')}
TensorBoard usage
Tensorboard allows us to write and save for future analysis different types of data, including images and scalars. First of all let’s install tensorboard with pip:
pip install tensorboard
To write to TensorBoard we will be using the SummaryWriter from torch.utils.tensorboard
from torch.utils.tensorboard import SummaryWriter
# SummaryWriter takes log directory as argument
writer = SummaryWriter('tensorboard/runs/bert_experiment_1')
To write scalars, we use:
writer.add_scalar('loss/train', loss, counter_train)
The _countertrain variable is needed to know the step number at which something was written to TensorBoard. To write an image, we will use the following:
writer.add_figure("gradients", myfig,
global_step=counter_train, close=True, walltime=None)
Model training
Now let’s look at our training function
def train_epoch( model : BertForSequenceClassification,
data_loader : DataLoader,
optimizer : AdamW,
scheduler : get_linear_schedule_with_warmup,
n_examples : int,
out_tensorboard : bool = True,
out_every : int = 30,
step_eval : int = None,
test_data_loader : DataLoader= None,
len_test_dataset : int = None
) -> Tuple[float, float]:
"""
out_every : every how many steps add gradients and ratios figures and train loss to the tensorboard
out_tensorboard : write to tensorboard or not
step_eval : every how many step evaluate the model on test data. If None is passed then we will evaluate only at the end of the epoch.
"""
print(f"Overall number of steps for training : {len(data_loader) * EPOCHS}")
print(f"Tensorboard will save {(len(data_loader) * EPOCHS) // out_every} figures")
global counter_train
SKIP_PROB = 0
model = model.train()
losses = []
correct_predictions = 0
tot_batches = len(data_loader) * EPOCHS
steps_out_stats = list(np.arange(0, tot_batches, out_every))
running_losses = []
for d in tqdm(data_loader, desc="Train batch"):
outputs = model(**d)
preds = outputs.logits.argmax(1)
loss = outputs.loss
correct_predictions += torch.sum(preds == d['labels'])
running_losses.append(loss.item())
losses.append(loss.item())
loss.backward()
if counter_train in steps_out_stats and out_tensorboard:
curr_params = copy.deepcopy(optimizer.param_groups[0]['params'])
print("writing gradients and ratios..")
# write gradients to tensorboard
myfig = plot_grad_flow(model.named_parameters(), skip_prob=SKIP_PROB)
writer.add_figure("gradients", myfig, global_step=counter_train, close=True, walltime=None)
named_params = copy_model_params(model.named_parameters())
writer.add_scalar('loss/train', np.mean(running_losses), counter_train)
running_loss = []
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
next_params = copy.deepcopy(optimizer.param_groups[0]['params'])
ratios = compute_ratios(curr_params, next_params, named_params)
optimizer.zero_grad()
fig_ratio = plot_ratios(ratios, skip_prob=SKIP_PROB)
writer.add_figure("gradient ratios", fig_ratio, global_step=counter_train, close=True, walltime=None)
else:
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
writer.add_scalar('learning_rate', scheduler.get_lr()[0], counter_train)
if step_eval != None and counter_train % step_eval == 0 and counter_train > 1:
print("evaluating the model..")
val_acc, val_loss = eval_model(model, test_dataset_loader, len_test_dataset)
writer.add_scalar('loss/test', val_loss, counter_train)
writer.add_scalar('accuracy/test', val_acc, counter_train)
model = model.train()
counter_train += 1
return correct_predictions.cpu().numpy() / n_examples, np.mean(losses)
_outevery variable controls how often to write to TensorBoard, measured in number of steps. We can also evaluate more often than after each epoch using _stepeval variable. Gradients flow and updates ratios figures are returned from _plot_gradflow and _plotratios functions respectively.
_plot_gradflow we pass the model’s parameters and for better visualization we can decide to skip some of the layers with _skipprob parameter. We write mean, max and standard deviations of the gradients of each layer, ignoring bias layers as they are less interesting. You can remove that part on the 17th line if you want to display bias’s gradients as well. We also display the percentage of zero gradients in each layer. It’s important to highlight, as BERT uses GeLU rather than ReLU activation function this last plot might be less useful, but if you are using a different model with ReLU which suffers from dying neurons problem, displaying the percentage of zero gradients is actually helpful.
def plot_grad_flow(named_parameters : iter, skip_prob : float = 0.5, verbose : bool = False, seed : int = 0) -> plt.figure:
'''Plots the gradients flowing through different layers in the net during training.
Can be used for checking for possible gradient vanishing / exploding problems.
skip_prob : skip some random layers for better visualization if there are a lot of layers
seed : random seed to skip layers
'''
np.random.seed(seed)
name_replace = {"encoder" : "enc", "layer" : "l"}
plt.rcParams["figure.figsize"] = (20, 12)
# plt.rcParams['figure.dpi']= 150
mean_grads, max_grads, zero_grads, stds = [], [], [], []
name_layers = []
for (n, p) in (named_parameters):
if (p.requires_grad) and ("bias" not in n):
if np.random.rand() < skip_prob:
if verbose:
print(f"skipped {n}")
continue
name_layers.append('.'.join([name_replace.get(el, el) for el in n.split(".") if el not in ['weight', 'bert', 'self']]))
mean_grads.append(p.grad.abs().mean().detach().cpu().item())
max_grads.append(p.grad.abs().max().detach().cpu().item())
stds.append(p.grad.abs().std().detach().cpu().item())
zero_grads.append(torch.sum(p.grad.abs() == 0.).detach().cpu().item() / p.grad.nelement())
elif not (p.requires_grad):
print(f"{n} does not require grad")
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.set(font_scale = 1.2)
fig, ax = plt.subplots(3)
sns.barplot(x=name_layers, y=max_grads, palette=['b']*len(name_layers), alpha=0.9, ax=ax[2], label="max grads",)
sns.barplot(x=name_layers, y=mean_grads, palette=['r']*len(name_layers), alpha=0.9, ax=ax[2], label="mean grads",)
sns.barplot(x=name_layers, y=zero_grads, palette=['k']*len(name_layers), alpha=0.9, ax=ax[0], label="percentage zero grads")
sns.barplot(x=name_layers, y=stds, palette=['c']*len(name_layers), alpha=0.9, ax=ax[1], label="standard dev grads")
ax[2].set_ylim([-0.005, 0.05])
ax[1].set_ylim([-0.005, 0.1])
ax[0].set_ylim([-0.005, 1.05])
ax[2] = ax[2].set_xticklabels(ax[2].get_xticklabels(), rotation = 90)
ax[0].set(xticklabels=[])
ax[1].set(xticklabels=[])
ax[0].legend()
ax[1].legend()
plt.legend()
plt.close()
return fig
_plotratios function displays the update / parameter ratio for each parameter which is a standardized measure as the update is divided by the parameter value and helps to understand how your neural network is learning. As a rough heuristic, this value should be around 1e-3, if lower then learning rate might be too low, otherwise too high. Also here you can reduce the layers displayed through _skipprob parameter.
def plot_ratios(ratios : dict,
skip_prob : float = 0.5,
verbose : bool = False,
seed : int = 0) -> plt.figure:
'''Plots the update/param ratio.
Can be used for checking for possible gradient vanishing / exploding problems.
This ratio should be around 1e-3.
If it is lower than this then the learning rate might be too low.
If it is higher then the learning rate is likely too high
skip_prob : skip some random layers for better visualization if there are a lot of layers
seed : random seed to skip layers
'''
np.random.seed(seed)
name_replace = {"encoder" : "enc", "layer" : "l"}
plt.rcParams["figure.figsize"] = (20, 12)
# plt.rcParams['figure.dpi']= 150
ratios_list = []
name_layers = []
for (n, r) in (ratios.items()):
if np.random.rand() < skip_prob:
if verbose:
print(f"skipped {n}")
continue
name_layers.append('.'.join([name_replace.get(el, el) for el in n.split(".") if el not in ['weight', 'bert', 'self']]))
ratios_list.append(r)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.set(font_scale = 1.2)
fig, ax = plt.subplots()
sns.barplot(x=name_layers, y=ratios_list, palette=['b']*len(name_layers), ax=ax, label="ratio update/param",)
ax.set_ylim([-0.0005, 0.005])
ax = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.legend()
plt.close()
return fig
def compute_ratios(param_prev : dict ,
param_next : dict,
named_params : dict,
ignore_bias : bool = True) -> dict:
""" Compute update/param ratio
"""
updates = {}
param_names = []
for i, (name, val) in enumerate(named_params.items()):
param_names.append(name)
updates[name] = copy.deepcopy((param_next[i] - param_prev[i]).cpu().detach().numpy())
ratio_updates = {}
for n in param_names:
if ignore_bias:
if 'bias' in n: continue
k = named_params[n]
param_scale = np.linalg.norm(k.ravel())
update = updates[n]
update_scale = np.linalg.norm(update.ravel())
ratio_updates[n] = update_scale/param_scale
return ratio_updates
During the model training you can start TensorBoard using the following command :
tensorboard --samples_per_plugin images=100 --logdir bert_experiment_1
Otherwise, you can create a .bat file (on Windows) for a quicker launch. Create a new file, for example _runtensorboard with .bat extension and copy paste the below command modifying _path_to_anaconda_env, path_to_savedresults and _envname accordingly. Then click on the file to launch TensorBoard.
cmd /k "cd path_to_anaconda_envScripts & activate env_name & cd path_to_saved_resultstensorboardruns & tensorboard - samples_per_plugin images=100 - logdir bert_experiment_1"
Defining _samples_perplugin images = 100 we display all the images in this task, otherwise by default TensorBoard will only display some of them.
Results
Monitoring these plots we can spot quickly if something is going not as expected. For example, if gradients are zero for many layers you might have a vanishing gradient problem. Similarly, if ratios are very low or very high you might want to dig deeper immediately without waiting until the end of the training or several epochs of training.
For example, looking at the ratios and the gradients on step number 240 with the below configuration we can see that things look good and our training is proceeding well – we can expect good results at the end.
EPOCHS = 5
optimizer = AdamW(model.parameters(), lr=3e-5, correct_bias=False)
total_steps = len(train_dataset_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps= 0.7 * total_steps,
num_training_steps=total_steps
)


Indeed the final results are :

While if we change the schedule setting to this:
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps= 0.1 * total_steps,
num_training_steps=total_steps
)
We notice at 180th step from the graphs that the model is not learning and we can stop the training to investigate further. In this case setting _num_warmupsteps to _0.1 * totalsteps makes the learning rate decrease and become very small shortly after the start of the training, and we end up with vanishing gradients which do not propagate back to first layers of the networking stopping effectively the learning process.



You can find the full code in this GitHub repo to try and experiment yourself.
Conclusions
After reading this and previous articles you should have all the tools and understanding of how to train Bert model in your projects! Things I described here are some of the ones you want to monitor during training, but TensorBoard offers other functionalities like Embeddings Projector that you can use to explore your embedding layer and many more that you can find here.