The world’s leading publication for data science, AI, and ML professionals.

A step further in the creation of a sign language translation system based on artificial…

A strategy to bring accessibility at scale.

Photo by Jo Hilton on Unsplash
Photo by Jo Hilton on Unsplash

Communication is fundamental in our society, which people daily use to express themselves and have access to the most basic services like public transport, school, and health care. Sign language is used in all countries by people with severe hearing loss, a condition that reaches millions of people worldwide. The problem is that most hearing people do not know how to speak by signs, creating a barrier that makes it difficult for the deafs to have social interactions.

To transpose this obstacle, we can use Artificial Intelligence techniques, like Convolutional Neural Networks, to create a sign translation system and generate legends for the sign executed.

Another interesting point is that although every country has its own sign language, a deep learning architecture can generalize well for problems with the same domain, just with some training and hyperparameter optimization.

However, there is one part in this process that is expensive, time-consuming and repetitive for all sign languages in the world: The dataset creation.

Imagine that someone created a state-of-the-art architecture to recognize signs and generate legends with very high accuracy in American Sign Language (ASL). To implement this solution in practice and bring accessibility to the real world, the scientists of every country that doesn’t use ASL would need to create a huge dataset (with most common words used daily for example) to retrain the network. So, it’s clear that one of the main bottlenecks is in dataset creation!

Based on this, I’ll explore some findings in how we can create a dataset more effectively for sign language to train a high accuracy model, serving as a guide to future works.

This article is based on my paper called: Efficient sign language recognition system and dataset creation method based on deep learning and image processing.

Experimental Dataset

The main idea in this work is to create a sign language recognition system based on a cheap dataset, helping future works that will need to do that.

But what is a cheap dataset?

In my opinion, it’s a dataset that uses a simple sensor, like an RGB camera, few interpreters, and the same background in the recordings.

That’s why we created a dataset recording the videos from two different smartphones, in the same standard background, and with two interpreters, configuring a simple and easy setup.

Another doubt was the Frames Per Second (FPS) to capture the records or to subsample the images, once it can lead to distinct results in the final performance, so we created two datasets using the same procedure, where the first was recorded at 60 FPS and the second one at 30 FPS.

Furthermore, we also subsampled the datasets into 30 and 20 FPS for the first dataset and 20 FPS for the second, to test if we can reduce the number of images once the record was made, without impacting the results.

In the end, we recorded 14 signs used daily by deaf people, repeating each sign three times with some variation between the executions. The dataset can be found on Kaggle.

List of words considered in our dataset.
List of words considered in our dataset.

To prove the model efficiency, we created a final validation dataset, considering the same 14 signs but recording over different backgrounds and light conditions, trying to reproduce a real-world scenario.

Train and test images, using the same background and light condition. (Image by Author)
Train and test images, using the same background and light condition. (Image by Author)
Validation set, using different backgrounds and light conditions. (Image by Author).
Validation set, using different backgrounds and light conditions. (Image by Author).

Hypotheses and experiments

These are the main hypotheses of this work, based on questions that emerged during the research and what experiments we did to answer our questions.

60 FPS may be better than 30 FPS, once it reduces motion blur.

It’s necessary to have movement to produce a sign, and the movement can produce blur in video records. In this context, using a greater FPS should reduce the blur and this could improve the model accuracy.

Experiment: We will compare the accuracy of the datasets recorded in 30 and 60 FPS

Artificial background creation may improve the generalization of the model

Record all signs in the same static background it’s much easier than changing the scene programmatically, moving the equipment and the people. But we believe that it can lead to a bias that could impact the model accuracy on the validation dataset.

Experiment: We will use semantic segmentation to create new dataset backgrounds and train a model with distinct scenes.

Geometrical transformations are better than intensity transformations for data augmentation

If we analyze how humans understand a sign it’s easy to note that geometrical features (like hand position and shape) are fundamental for recognition. On the other hand, background, skin color, clothes, hair, and other accessories are not relevant for us. That’s why we think that geometrical transformations (rotation, zoom, shear) will be better than intensity transformations (brightness, channel inversion).

Our transformations used to augment data and the ranges of application. (Image by Author)
Our transformations used to augment data and the ranges of application. (Image by Author)

Experiment: We’ll test different data augmentation techniques individually.

Model creation

The CNN model was based on EfficientNet-B0 due to the reduced number of parameters and good accuracy. After the feature extractor, we created a neural network to predict the sign. Every test was repeated 3 times in the same setup, capturing the average results and comparing them statistically, using analysis of variance (ANOVA) or T-student test. The data were randomly split into 80% of samples for training and 20% for testing. You can check the details and implementation in the Google Colab.

Results

Data Augmentation

The next table shows the mean accuracy of each data augmentation technique tested individually.

Mean accuracy of three executions, considering the 20 FPS dataset. (Image by Author)
Mean accuracy of three executions, considering the 20 FPS dataset. (Image by Author)

As the main goal of data augmentation is to improve the model invariance, we focus our attention on the results of the validation set, where it’s clear that geometrical transformations performed considerably better to improve the sign recognition.

Besides that, we notice that data augmentation had success to reduce overfitting, as shown in the next image.

Accuracy without and with data augmentation in the 20 FPS, showing how it mitigates train (orange) and test (blue) variance. (Image by Author)
Accuracy without and with data augmentation in the 20 FPS, showing how it mitigates train (orange) and test (blue) variance. (Image by Author)

The general accuracy decreased with data augmentation, due to the image invariance introduced, but the results improved considerably in the validation set. In addition, our hypotheses of geometrical transformations were confirmed, as highlighted in the table below, reaching an accuracy higher than the intensity transformations.

Artificial background creation

The images below show the result of semantic segmentation, based on DeepLabV3, to change the background. The result resolution was 331×331 pixels due to the computational cost.

Sign execution with the new artificial background. (Image by Author)
Sign execution with the new artificial background. (Image by Author)

We used 5 different scenes to replace the background of each sign. The next table shows de results.

Comparing average model accuracy with artificial background replacement ('background suffix') and without it. (Image by Author)
Comparing average model accuracy with artificial background replacement (‘background suffix’) and without it. (Image by Author)

To understand more about these results, we used a tool to explain the model predictions called LIME, highlighting the parts of the image that contributed more to the inference, as illustrated next.

Using LIME to explain model predictions in the validation set. (Image by Author)
Using LIME to explain model predictions in the validation set. (Image by Author)

Explanations suggest that the model is focusing on the correct part of the image, considering the position of the hand of the interpreters to infer the signal. This shows that the background is not biasing the results and that is why the replacement does not aggregate relevant features, acting just like a color transformation

Frames per second comparison

At first, we studied the most suitable FPS to subsample the videos into images to train the model by recording at 60 FPS. Table 5 brings the results.

Comparison between the subsamples of images. (Image by Author)
Comparison between the subsamples of images. (Image by Author)

Remarkably, 60 FPS does not compensate for the computational resource required, since it obtained about 10% less accuracy in the validation set. This is probably because this frame rate has almost 2 and 3 times more images than 30 and 20 FPS, as figure 2 showed, which may contribute to overfitting, causing a greater variance in the validation set. Besides that, the consecutive images of a video are similar to each other, generating a low gain of information.

The T-student test reveals that 30 and 20 FPS have a significant difference in the test set, leading to the conclusion that it is the best choice for this situation, but this should vary depending on the dataset size, once the training time is bigger for 30 FPS, and the exploration of Spatio-temporal features, influencing the amount of information needed to be extracted from the video.

The last test involving frame rates was to compare the performance between a dataset captured at 30 FPS and another one captured at 60 FPS, shown in table 6.

Comparison between the datasets capture at 60 FPS and 30 FPS. (Image by Author)
Comparison between the datasets capture at 60 FPS and 30 FPS. (Image by Author)

In the test set, the results are favorable to the dataset captured at 60 FPS, as the greater number of images helps the model to fit better during training (as shown in table 5). On the other side, In the validation set, there is no significant difference between the dataset capture at 60 or 30 FPS (with a p-value of 0.58). Another relevant fact is that the dataset captured at 30 FPS had fewer images than the captured at 60 FPS and subsampled to 30, owing to the faster execution of the signs, which is a normal variation depending on the interpreter and the situation. Therefore, instructing the interpreters to execute the sign slowly should help to further mitigate these accuracy differences in capture rate, mainly in a well-lit scenery, where the motion blur is less perceptive.

Thus, in an uncontrolled scenario, with different lighting conditions and sign speed execution by the interpreters, capturing the video at 60 FPS and resampling it to 30 FPS should be the best choice, getting a large number of images while avoiding motion blur, with the drawback to require a better sensor and more storage space. Despite that, in a well-controlled scenario, capture at 30 FPS will produce satisfactory results.

Improving validation accuracy with multi-stream CNN

As a final test, we created a multi-stream CNN to capture local and global information in the image, as shown next figure.

Multi-CNN architecture for sign recognition. (Image by Author)
Multi-CNN architecture for sign recognition. (Image by Author)

To segment hand we used EfficientDet as the object detector, passing individual images to the feature extractor and then to a neural network. Our final results show an accuracy of 96% on the test set and 81% on the validation set, showing that it’s possible to achieve good results and generalize for more complicated situations even with a simple dataset used for training.

Conclusions and final thoughts

In some years, sign language recognition will be considerably easier for machine learning, due to the new algorithms and technologies emerging, but I hope it has made clear the need for further studies on the efficiency of creating datasets, once every new translator system will need a huge amount of data behind the scenes.

We saw that just with few interpreters, a simple record setup, the same background, and with the correct data augmentation choices, it’s possible to generalize to real-world scenarios. In future works, a deeper analysis could be made, with more people and signs, to test if the same patterns observed here are repeated.

For further read and concepts, please, refer to the original paper. Thanks for reading!


Related Articles