Alvaro Leandro Cavalcante Carneiro, Author at Towards Data Science https://towardsdatascience.com/author/alvaroleandrocavalcante/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 14:26:56 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Alvaro Leandro Cavalcante Carneiro, Author at Towards Data Science https://towardsdatascience.com/author/alvaroleandrocavalcante/ 32 32 Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code https://towardsdatascience.com/stop-creating-bad-dags-optimize-your-airflow-environment-by-improving-your-python-code-146fcf4d27f7/ Thu, 30 Jan 2025 20:31:57 +0000 https://towardsdatascience.com/stop-creating-bad-dags-optimize-your-airflow-environment-by-improving-your-python-code-146fcf4d27f7/ Valuable tips to reduce your DAGs' parse time and save resources.

The post Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code appeared first on Towards Data Science.

]]>
Valuable tips to reduce your DAGs’ parse time and save resources.
Photo by Dan Roizer on Unsplash
Photo by Dan Roizer on Unsplash

Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows for companies worldwide. However, anyone who has already worked with Airflow in a production environment, especially in a complex one, knows that it can occasionally present some problems and weird bugs.

Among the many aspects you need to manage in an Airflow environment, one critical metric often flies under the radar: DAG parse time. Monitoring and optimizing parse time is essential to avoid performance bottlenecks and ensure the correct functioning of your orchestrations, as we’ll explore in this article.

That said, this tutorial aims to introduce [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench), an open-source tool I developed to help Data engineers monitor and optimize their Airflow environments, providing insights to reduce code complexity and parse time.

Why Parse Time Matters

Regarding Airflow, DAG parse time is often an overlooked metric. Parsing occurs every time Airflow processes your Python files to build the DAGs dynamically.

By default, all your DAGs are parsed every 30 seconds – a frequency controlled by the configuration variable _min_file_process_interval_. This means that every 30 seconds, all the Python code that’s present in your dags folder is read, imported, and processed to generate DAG objects containing the tasks to be scheduled. Successfully processed files are then added to the DAG Bag.

Two key Airflow components handle this process:

Together, both components (commonly referred to as the dag processor) are executed by the Airflow Scheduler, ensuring that your DAG objects are updated before being triggered. However, for scalability and security reasons, it is also possible to run your dag processor as a separate component in your cluster.

If your environment only has a few dozen DAGs, it’s unlikely that the parsing process will cause any kind of problem. However, it’s common to find production environments with hundreds or even thousands of DAGs. In this case, if your parse time is too high, it can lead to:

  • Delay DAG scheduling.
  • Increase resource utilization.
  • Environment heartbeat issues.
  • Scheduler failures.
  • Excessive CPU and memory usage, wasting resources.

Now, imagine having an environment with hundreds of DAGs containing unnecessarily complex parsing logic. Small inefficiencies can quickly turn into significant problems, affecting the stability and performance of your entire Airflow setup.

How to write better DAGs?

When writing Airflow DAGs, there are some important best practices to bear in mind to create optimized code. Although you can find a lot of tutorials on how to improve your DAGs, I’ll summarize some of the key principles that can significantly enhance your DAG performance.

Limit Top-Level Code

One of the most common causes of high DAG parsing times is inefficient or complex top-level code. Top-level code in an Airflow DAG file is executed every time the Scheduler parses the file. If this code includes resource-intensive operations, such as database queries, API calls, or dynamic task generation, it can significantly impact parsing performance.

The following code shows an example of a non-optimized DAG:

In this case, every time the file is parsed by the Scheduler, the top-level code is executed, making an API request and processing the DataFrame, which can significantly impact the parse time.

Another important factor contributing to slow parsing is top-level imports. Every library imported at the top level is loaded into memory during parsing, which can be time-consuming. To avoid this, you can move imports into functions or task definitions.

The following code shows a better version of the same DAG:

Avoid Xcoms and Variables in Top-Level Code

Still talking about the same topic, is particularly interesting to avoid using Xcoms and Variables in your top-level code. As stated by Google documentation:

If you are using Variable.get() in top level code, every time the .py file is parsed, Airflow executes a Variable.get() which opens a session to the DB. This can dramatically slow down parse times.

To address this, consider using a JSON dictionary to retrieve multiple variables in a single database query, rather than making multiple Variable.get() calls. Alternatively, use Jinja templates, as variables retrieved this way are only processed during task execution, not during DAG parsing.

Remove Unnecessary DAGs

Although it seems obvious, it’s always important to remember to periodically clean up unnecessary DAGs and files from your environment:

  • Remove unused DAGs: Check your dags folder and delete any files that are no longer needed.
  • Use .airflowignore: Specify the files Airflow should intentionally ignore, skipping parsing.
  • Review paused DAGs: Paused DAGs are still parsed by the Scheduler, consuming resources. If they are no longer required, consider removing or archiving them.

Change Airflow Configurations

Lastly, you could change some Airflow configurations to reduce the Scheduler resource usage:

  • min_file_process_interval: This setting controls how often (in seconds) Airflow parses your DAG files. Increasing it from the default 30 seconds can reduce the Scheduler’s load at the cost of slower DAG updates.
  • dag_dir_list_interval: This determines how often (in seconds) Airflow scans the dags directory for new DAGs. If you deploy new DAGs infrequently, consider increasing this interval to reduce CPU usage.

How to Measure DAG Parse Time?

We’ve discussed a lot about the importance of creating optimized DAGs to maintain a healthy Airflow environment. But how do you actually measure the parse time of your DAGs? Fortunately, there are several ways to do this, depending on your Airflow deployment or operating system.

For example, if you have a Cloud Composer deployment, you can easily retrieve a DAG parse report by executing the following command on Google CLI:

gcloud composer environments run $ENVIRONMENT_NAME 
 - location $LOCATION 
 dags report

While retrieving parse metrics is straightforward, measuring the effectiveness of your code optimizations can be less so. Every time you modify your code, you need to redeploy the updated Python file to your cloud provider, wait for the DAG to be parsed, and then extract a new report – a slow and time-consuming process.

Another possible approach, if you’re on Linux or Mac, is to run this command to measure the parse time locally on your machine:

time python airflow/example_dags/example.py

However, while simple, this approach is not practical for systematically measuring and comparing the parse times of multiple DAGs.

To address these challenges, I created the airflow-parse-bench, a Python library that simplifies measuring and comparing the parse times of your DAGs using Airflow’s native parse method.

Measuring and Comparing Your DAG’s Parse Times

The [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench) tool makes it easy to store parse times, compare results, and standardize comparisons across your DAGs.

Installing the Library

Before installation, it’s recommended to use a virtualenv to avoid library conflicts. Once set up, you can install the package by running the following command:

pip install airflow-parse-bench

Note: This command only installs the essential dependencies (related to Airflow and Airflow providers). You must manually install any additional libraries your DAGs depend on.

For example, if a DAG uses boto3 to interact with AWS, ensure that boto3 is installed in your environment. Otherwise, you’ll encounter parse errors.

After that, it’s necessary to initialize your Airflow database. This can be done by executing the following command:

airflow db init

In addition, if your DAGs use Airflow Variables, you must define them locally as well. However, it’s not necessary to put real values on your variables, as the actual values aren’t required for parsing purposes:

airflow variables set MY_VARIABLE 'ANY TEST VALUE'

Without this, you’ll encounter an error like:

error: 'Variable MY_VARIABLE does not exist'

Using the Tool

After installing the library, you can begin measuring parse times. For example, suppose you have a DAG file named dag_test.py containing the non-optimized DAG code used in the example above.

To measure its parse time, simply run:

airflow-parse-bench --path dag_test.py

This execution produces the following output:

Execution result. Image by author.
Execution result. Image by author.

As observed, our DAG presented a parse time of 0.61 seconds. If I run the command again, I’ll see some small differences, as parse times can vary slightly across runs due to system and environmental factors:

Result of another execution of the same DAG. Image by author.
Result of another execution of the same DAG. Image by author.

In order to present a more concise number, it’s possible to aggregate multiple executions by specifying the number of iterations:

airflow-parse-bench --path dag_test.py --num-iterations 5

Although it takes a bit longer to finish, this calculates the average parse time across five executions.

Now, to evaluate the impact of the aforementioned optimizations, I replaced the code in mydag_test.py with the optimized version shared earlier. After executing the same command, I got the following result:

Parse result of the optimized code. Image by author.
Parse result of the optimized code. Image by author.

As noticed, just applying some good practices was capable of reducing almost 0.5 seconds in the DAG parse time, highlighting the importance of the changes we made!

Further Exploring the Tool

There are other interesting features that I think it’s relevant to share.

As a reminder, if you have any doubts or problems using the tool, you can access the complete documentation on GitHub.

Besides that, to view all the parameters supported by the library, simply run:

airflow-parse-bench --help

Testing Multiple DAGs

In most cases, you likely have dozens of DAGs to test the parse times. To address this use case, I created a folder named dags and put four Python files inside it.

To measure the parse times for all the DAGs in a folder, it’s just necessary to specify the folder path in the --path parameter:

airflow-parse-bench --path my_path/dags

Running this command produces a table summarizing the parse times for all the DAGs in the folder:

Testing the parse time of multiple DAGs. Image by author.
Testing the parse time of multiple DAGs. Image by author.

By default, the table is sorted from the fastest to the slowest DAG. However, you can reverse the order by using the --order parameter:

airflow-parse-bench --path my_path/dags --order desc
Inverted sorting order. Image by author.
Inverted sorting order. Image by author.

Skipping Unchanged DAGs

The --skip-unchanged parameter can be especially useful during development. As the name suggests, this option skips the parse execution for DAGs that haven’t been modified since the last execution:

airflow-parse-bench --path my_path/dags --skip-unchanged

As shown below, when the DAGs remain unchanged, the output reflects no difference in parse times:

Output with no difference for unchanged files. Image by author.
Output with no difference for unchanged files. Image by author.

Resetting the Database

All DAG information, including metrics and history, is stored in a local SQLite database. If you want to clear all stored data and start fresh, use the --reset-db flag:

airflow-parse-bench --path my_path/dags --reset-db

This command resets the database and processes the DAGs as if it were the first execution.

Conclusion

Parse time is an important metric for maintaining scalable and efficient Airflow environments, especially as your orchestration requirements become increasingly complex.

For this reason, the [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench) library can be an important tool for helping data engineers create better DAGs. By testing your DAGs’ parse time locally, you can easily and quickly find your code bottleneck, making your dags faster and more performant.

Since the code is executed locally, the produced parse time won’t be the same as the one present in your Airflow cluster. However, if you are able to reduce the parse time in your local machine, the same might be reproduced in your cloud environment.

Finally, this project is open for collaboration! If you have suggestions, ideas, or improvements, feel free to contribute on GitHub.

References

maximize the benefits of Cloud Composer and reduce parse times | Google Cloud Blog

Optimize Cloud Composer via Better Airflow DAGs | Google Cloud Blog

Scheduler – Airflow Documentation

Best Practices – Airflow Documentation

GitHub – AlvaroCavalcante/airflow-parse-bench: Stop creating bad DAGs! Use this tool to measure and…

The post Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code appeared first on Towards Data Science.

]]>
How to Optimize Object Detection Models for Specific Domains https://towardsdatascience.com/how-to-optimize-object-detection-models-for-specific-domains-d8512c63a9c3/ Thu, 19 Oct 2023 16:33:16 +0000 https://towardsdatascience.com/how-to-optimize-object-detection-models-for-specific-domains-d8512c63a9c3/ Design better and faster models to solve your specific problem

The post How to Optimize Object Detection Models for Specific Domains appeared first on Towards Data Science.

]]>
Photo by paolo candelo on Unsplash
Photo by paolo candelo on Unsplash

Object Detection is widely employed across different domains, from academia to industry sectors, thanks to its ability to provide great results at a low computational cost. However, despite the abundance of open-source architectures publicly available, most of these models are designed to address general-purpose problems and may not be a good fit for specific contexts.

As an example, we can mention the Common Objects in Context (COCO) dataset, which is typically used as a baseline for research in this field, influencing the hyperparameters and architectural details of the models. This dataset comprises 90 distinct classes under various lighting conditions, backgrounds, and sizes. It turns out that, sometimes, the detection problem you are facing is relatively simple. You may want to detect just a few distinct objects without many scene or size variations. In this case, if you train your model using a generic set of hyperparameters, you would likely end up with a model that incurs unnecessary computational costs.

With this perspective in mind, the primary goal of this article is to provide guidance on optimizing various object detection models for less complex tasks. I want to assist you in selecting a more efficient configuration that reduces computational costs without prejudicing the mean Average Precision (mAP).


Providing some context

One of the goals of my master’s degree was to develop a sign language recognition system with minimal computational requirements. A crucial component of this system is the preprocessing stage, which involves the detection of the interpreter’s hands and face, as depicted in the figure below:

Samples from the HFSL dataset that were created in this work. Image by author.
Samples from the HFSL dataset that were created in this work. Image by author.

As illustrated, this problem is relatively straightforward, involving only two distinct classes and three concurrently appearing objects in the image. For this reason, my aim was to optimize the models’ hyperparameters to maintain a high mAP while reducing the computational cost, thus enabling efficient execution on edge devices such as smartphones.

Object detection architectures and setup

In this project, the following object detection architectures were tested: EfficientDetD0, Faster-RCNN, SDD320, SDD640, and YoloV7. However, the concepts presented here can be applied to adapt various other architectures.

For model development, I primarily utilized Python 3.8 and the TensorFlow framework, with the exception of YoloV7, where PyTorch was employed. While most examples provided here relate to TensorFlow, you can adapt these principles to your preferred framework.

In terms of hardware, the testing was conducted using an RTX 3060 GPU and an Intel Core i5–10400 CPU. All the source code and models are available on GitHub.

Fine-tuning of object detectors

When using TensorFlow for object detection, it’s essential to understand that all the hyperparameters are stored in a file named "pipeline.config". This protobuf file holds the configurations used to train and evaluate the model, and you’ll find it in any pre-trained model downloaded from TF Model Zoo, for instance. In this context, I will describe the modifications I’ve implemented in the pipeline files to optimize the object detectors.

It’s important to note that the hyperparameters provided here were specifically designed for hand and face detection (2 classes, 3 objects). Be sure to adapt them for your own problem domain.

General simplifications

The first change that can applied to all models is reducing the maximum number of predictions per class and the number of generated bounding boxes from 100 to 2 and 4, respectively. You can achieve this by adjusting the "max_number_of_boxes" property inside the "train_config" object:

...
train_config {
  batch_size: 128
  sync_replicas: true
  optimizer { ... }
  fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED"
  num_steps: 50000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 4 # <------------------ change this line
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "classification"
  fine_tune_checkpoint_version: V2
}
...

After that, change the "max_total_detections" and the "max_detections_per_class" that are inside the "post_processing" of the object detector:

post_processing {
  batch_non_max_suppression {
    score_threshold: 9.99999993922529e-09
    iou_threshold: 0.6000000238418579
    max_detections_per_class: 2 # <------------------ change this line
    max_total_detections: 4     # <------------------ change this line
    use_static_shapes: false
  }
  score_converter: SIGMOID
}

Those changes are important, especially in my case, as there are only three objects and two classes appearing in the image simultaneously. By decreasing the number of predictions, fewer iterations are required to eliminate overlapping bounding boxes through Non-maximum Suppression (NMS). Therefore, if you have a limited number of classes to detect and objects appearing in the scene, it could be a good idea to change this hyperparameter.

Additional adjustments were applied individually, taking into account the specific architectural details of each object detection model.

Single Shot Multibox Detector (SSD)

It’s always a good idea to test different resolutions when working with object detection. In this project, I utilized two versions of the model, SSD320 and SSD640, with input image resolutions of 320×320 and 640×640 pixels, respectively.

For both models, one of the primary modifications was to reduce the depth of the Feature Pyramid Network (FPN) from 5 to 4 by removing the most superficial layer. FPN is a powerful feature extraction mechanism that operates on multiple feature map sizes. However, for larger objects, the most superficial layer, designed for higher image resolutions, might not be necessary. That said, if the objects that you are trying to detect are not too small, it’s probably a good idea to remove this layer. To implement this change, adjust the "min_level" attribute from 3 to 4 within the "fpn" object:

...
feature_extractor {
  type: "ssd_mobilenet_v2_fpn_keras"
  depth_multiplier: 1.0
  min_depth: 16
  conv_hyperparams {
    regularizer { ... }
    initializer { ... }
    activation: RELU_6
    batch_norm {...}
  }
  use_depthwise: true
  override_base_feature_extractor_hyperparams: true
  fpn {
    min_level: 4 # <------------------ change this line
    max_level: 7
    additional_layer_depth: 108 # <------------------ change this line
  }
}
...

I also simplified the higher-resolution model (SSD640) by reducing the "additional_layer_depth" from 128 to 108. Likewise, I adjusted the "multiscale_anchor_generator" depth from 5 to 4 layers for both models, as shown below:

...
anchor_generator {
  multiscale_anchor_generator {
    min_level: 4 # <------------------ change this line
    max_level: 7
    anchor_scale: 4.0
    aspect_ratios: 1.0
    aspect_ratios: 2.0
    aspect_ratios: 0.5
    scales_per_octave: 2
  }
}
...

Finally, the network responsible for generating the bounding box predictions ("box_predictor") had the number of layers reduced from 4 to 3. Regarding SSD640, the box predictor depth was also decreased from 128 to 96, as shown below:

...
box_predictor {
  weight_shared_convolutional_box_predictor {
    conv_hyperparams {
      regularizer { ... }
      initializer { ... }
      activation: RELU_6
      batch_norm { ... }
    }
    depth: 96 # <------------------ change this line
    num_layers_before_predictor: 3 # <------------------ change this line
    kernel_size: 3
    class_prediction_bias_init: -4.599999904632568
    share_prediction_tower: true
    use_depthwise: true
  }
}
...

These simplifications were driven by the fact that we have a limited number of distinct classes with relatively straightforward patterns to detect. Therefore, it’s possible to reduce the number of layers and the depth of the model, since even with fewer feature maps we can still effectively extract the desired features from the images.

EfficinetDet-D0

Concerning EfficientDet-D0, I reduced the depth of the Bidirectional Feature Pyramid Network (Bi-FPN) from 5 to 4. Additionally, I decreased the Bi-FPN iterations from 3 to 2 and feature map kernels from 64 to 48. Bi-FPN is a sophisticated technique of multi-scale feature fusion, which can yield excellent results. However, it comes at the cost of higher computational demands, which can be a waste of resources for simpler problems. To implement the aforementioned adjustments, simply update the attributes of the "bifpn" object as follows:

...
bifpn {
      min_level: 4 # <------------------ change this line
      max_level: 7
      num_iterations: 2 # <------------------ change this line
      numyaml_filters: 48 # <------------------ change this line
    }
...

Besides that, it’s also important to reduce the depth of the "multiscale_anchor_generator" in the same manner as we did with SSD. Lastly, I reduced the layers of the box predictor network from 3 to 2:

...
box_predictor {
  weight_shared_convolutional_box_predictor {
    conv_hyperparams {
      regularizer { ... }
      initializer { ... }
      activation: SWISH
      batch_norm { ... }
      force_use_bias: true
    }
    depth: 64
    num_layers_before_predictor: 2 # <------------------ change this line
    kernel_size: 3
    class_prediction_bias_init: -4.599999904632568
    use_depthwise: true
  }
}
...

Faster R-CNN

The Faster R-CNN model relies on the Region Proposal Network (RPN) and anchor boxes as its primary techniques. Anchors are the central point of a sliding window that iterates over the last feature map of the backbone CNN. For each iteration, a classifier determines the probability of a proposal containing an object, while a regressor adjusts the bounding box coordinates. To ensure the detector is translation-invariant, it employs three different scales and three aspect ratios for the anchor boxes, which increases the number of proposals per iteration.

Although this is a shallow explanation, it’s apparent that this model is considerably more complex than the others due to its two-stage detection process. However, it’s possible to simplify it and enhance its speed while retaining its high accuracy.

To do so, the first important modification involves reducing the number of generated proposals from 300 to 50. This reduction is feasible because there are only a few objects present in the image simultaneously. You can implement this change by adjusting the "first_stage_max_proposals" property, as demonstrated below:

...
first_stage_box_predictor_conv_hyperparams {
  op: CONV
  regularizer { ... }
  initializer { ... }
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 50 # <------------------ change this line
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
...

After that, I eliminated the largest anchor box scale (2.0) from the model. This change was made because the hands and face maintain a consistent size due to the interpreter’s fixed distance from the camera, and having large anchor boxes might not be useful for proposal generation. Additionally, I removed one of the aspect ratios of the anchor boxes, given that my objects have similar shapes with minimal variation in the dataset. These adjustments are visually represented below:

first_stage_anchor_generator {
  grid_anchor_generator {
    scales: [0.25, 0.5, 1.0] # <------------------ change this line
    aspect_ratios: [0.5, 1.0] # <------------------ change this line
    height_stride: 16
    width_stride: 16
  }
}

That said, it’s crucial to consider the size and aspect ratios of your target objects. This consideration allows you to eliminate less useful anchor boxes and significantly decrease the computational cost of the model.

YoloV7

In contrast, minimal changes were applied to YoloV7 to preserve the architecture’s functionality. The main modification involved simplifying the CNN responsible for feature extraction, in both the backbone and the model’s head. To achieve this, I decreased the number of kernels/feature maps for nearly every convolutional layer, creating the following model:

backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [22, 3, 1]],  # 0
   [-1, 1, Conv, [44, 3, 2]],  # 1-P1/2      
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [89, 3, 2]],  # 3-P2/4  
   [-1, 1, Conv, [44, 1, 1]],
   [-2, 1, Conv, [44, 1, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],  # 11
   [-1, 1, MP, []],
   [-1, 1, Conv, [89, 1, 1]],
   [-3, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 16-P3/8  
   [-1, 1, Conv, [89, 1, 1]],
   [-2, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [512, 1, 1]],  # 24
   [-1, 1, MP, []],
   [-1, 1, Conv, [89, 1, 1]],
   [-3, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 29-P4/16  
   [-1, 1, Conv, [89, 1, 1]],
   [-2, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [716, 1, 1]],  # 37
   [-1, 1, MP, []],
   [-1, 1, Conv, [256, 1, 1]],
   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 42-P5/32  
   [-1, 1, Conv, [128, 1, 1]],
   [-2, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [716, 1, 1]],  # 50
  ]

# yolov7 head
head:
  [[-1, 1, SPPCSPC, [358]], # 51
   [-1, 1, Conv, [179, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [37, 1, Conv, [179, 1, 1]], # route backbone P4
   [[-1, -2], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],
   [-2, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]], # 63
   [-1, 1, Conv, [89, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [24, 1, Conv, [89, 1, 1]], # route backbone P3
   [[-1, -2], 1, Concat, [1]],
   [-1, 1, Conv, [89, 1, 1]],
   [-2, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [89, 1, 1]], # 75
   [-1, 1, MP, []],
   [-1, 1, Conv, [89, 1, 1]],
   [-3, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 2]],
   [[-1, -3, 63], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],
   [-2, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]], # 88
   [-1, 1, MP, []],
   [-1, 1, Conv, [179, 1, 1]],
   [-3, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [179, 3, 2]],
   [[-1, -3, 51], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],
   [-2, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [358, 1, 1]], # 101
   [75, 1, RepConv, [179, 3, 1]],
   [88, 1, RepConv, [358, 3, 1]],
   [101, 1, RepConv, [716, 3, 1]],
   [[102,103,104], 1, IDetect, [nc, anchors]],   # Detect(P3, P4, P5)
  ]

As discussed earlier, removing some layers and feature maps from the detectors is typically a good approach for simpler problems, since feature extractors are initially designed to detect dozens or even hundreds of classes in diverse scenarios, requiring a more robust model to address these complexities and ensure high accuracy.

With these adjustments, I decreased the number of parameters from 36.4 million to just 14.1 million, representing a reduction of approximately 61%. Furthermore, I used an input resolution of 512×512 pixels instead of the suggested 640×640 pixels in the original paper.

Additional tip

Another valuable tip in the training of object detectors is to utilize the Kmeans model for unsupervised adjustment of the anchor box proportions, fitting the width and height of the figures to maximize the ratio of Intersection over Union (IoU) within the training set. By doing this, we can better adapt the anchors to the given problem domain, thereby enhancing model convergence by starting with adequate aspect ratios. The figure below exemplifies this process, comparing three anchor boxes used by default in the SSD algorithm (in red) next to three boxes with optimized proportions for the hand and face detection task (in green).

Comparing different bounding boxes' aspect ratios. Image by author.
Comparing different bounding boxes’ aspect ratios. Image by author.

Showing the results

I trained and evaluated each detector using my own dataset, called the Hand and Face Sign Language (HFSL) dataset, considering the mAP and the Frames Per Second (FPS) as the main metrics. The table below provides a summary of the results, with values in parentheses representing the FPS of the detector before implementing any of the described optimizations.

Object detection results.
Object detection results.

We can observe that most of the models showed a significant reduction in inference time while maintaining a high mAP across various levels of Intersection over Union (IoU). More complex architectures, such as Faster R-CNN and EfficientDet, increased the FPS on GPU by 200.80% and 231.78%, respectively. Even SSD-based architectures showed a huge increase in performance, with 280.23% and 159.59% improvements for the 640 and 320 versions, respectively. Considering YoloV7, although the FPS difference is most noticeable on the CPU, the optimized model has 61% fewer parameters, reducing memory requirements and making it more suitable for edge devices.

Conclusion

There are instances when computational resources are limited, or tasks must be executed quickly. In such scenarios, we can further optimize the open-source object detection models to find a combination of hyperparameters that can reduce the computational requirements without affecting the results, thereby offering a suitable solution for diverse problem domains.

I hope this article has assisted you in making better choices to train your object detectors, resulting in significant efficiency gains with minimal effort. If you didn’t understand some of the explained concepts, I recommend you dive deeper into how your object detection architecture works. Additionally, consider experimenting with different hyperparameter values to further streamline your models based on the specific problem you are addressing!

The post How to Optimize Object Detection Models for Specific Domains appeared first on Towards Data Science.

]]>
Auto-Labeling Tool for Object Detection https://towardsdatascience.com/auto-labeling-tool-for-object-detection-acf410a600b8/ Tue, 20 Sep 2022 04:14:11 +0000 https://towardsdatascience.com/auto-labeling-tool-for-object-detection-acf410a600b8/ Stop wasting all of your time labeling datasets

The post Auto-Labeling Tool for Object Detection appeared first on Towards Data Science.

]]>
Photo by Kyle Hinkson on Unsplash
Photo by Kyle Hinkson on Unsplash

Anyone who has worked with object detection knows that the labeling/annotation process is the hardest part. It is not difficult because it’s complex like training a model, but because the process is very tedious and time-consuming.

In my previous works with this technology, I had the challenge of dealing with datasets with thousands of images or a few hundreds images with dozens of objects per image. In both situations, my only options were to waste days creating labels or using a lot of human resources to do so.

Considering this annoying bottleneck, I’ve created a simple (yet effective) auto annotation tool to make this process easier. Although it doesn’t completely replace the manual annotation process, it’ll help you to save a lot of time in your life. Based on this, this article explains how this tool works and how you can use it to simplify your next object detection project!


How it works

The auto annotation tool is based on the idea of a semi-supervised architecture, where a model trained with a small amount of labeled data is used to produce the new labels for the rest of the dataset. As simple as that, the library uses an initial and simplified object detection model to generate the XML files with the image annotations (considering the PASCAL VOC format). This process can be illustrated by the following image:

Auto annotation process. (Image by author)
Auto annotation process. (Image by author)

As a semi-supervised solution, unfortunately, it’s impossible to avoid manual annotation, but you’ll need to label just a small amount of your data.

It’s hard to determine the number of images to manually label, as it depends on the complexity of your problem. If you want to detect dogs and cats and have 2000 images in the dataset, for example, probably 200 images are enough (100 per class). On the other hand, if you have dozens of classes or objects that are hard to detect, you should need more manual annotations to see the benefits of the semi-supervised approach.

On the other hand, there are some interesting advantages of spending at least some time annotating images manually. First of all, you’ll have a closer look into the data, which could help you to discover problems (e.g. the objects are too close to each other, or lighting conditions are different than you thought) and determine the model constraints.

Besides that, a reduced version of the dataset is often used for hyperparameter tuning and neural architecture search, so this might be a great moment to try to find the model configurations that best fit your problem.

That said, once you have labeled some images and trained an initial model, you’ll be able to use the auto annotation tool to speed up this process for the rest of your dataset!

Using the auto annotation tool

This project is completely open-source and it’s available on GitHub. The code is written in Python and currently, it just supports TensorFlow models (although the Pytorch support will come soon).

You can install the library using pip, as shown below:

$ pip install auto-annotate

It’s recommended to use a Python virtual environment to avoid any compatibility issues with your TensorFlow version. After the installation, you can use the library both from the command line or directly in your Python code. For both, you’ll have the same set of parameters:

  • saved_model_path: The path of the saved_model folder with the initial model.
  • label_map_path: The path of the label_map.pbtxt file.
  • imgs_path: The path of the folder with the dataset images to label.
  • xml_path (optional): Path to save the resulting XML files. The default behavior is to save in the same folder of the dataset images.
  • threshold: Confidence threshold to accept the detections made by the model. the default is 0.5.

Command line

The easier way to use the library is to call it from the command line. To do so, in your terminal, execute the following command with your own parameters:

python -m auto_annotate --label_map_path /example/label_map.pbtxt 
--saved_model_path /example/saved_model 
--imgs_path /example/dataset_images 
--xml_path /example/dataset_labels 
--threshold 0.65

Python code

If you, for some reason, would like to use the library directly in your python code, you can simply do:

If everything worked correctly, you will see a progress bar of the dataset annotation. During the execution, you can use an annotation tool like LabelImg to open your images and auto-generated labels to verify if they are being generated as expected.

Post-action: Review the labels

As we already know, every Machine Learning model makes mistakes, and this is not different in this labeling process. If you retrain your initial model with the brand new labels generated by this tool, you’ll be assuming weak supervision, as you’ll find some noise in the labels/annotations.

Weak supervision has its own problems, and you would like to avoid this, if possible. That said, it’s recommended to review the labels after the auto annotation process, to find and fix the wrong predictions. Again we face a manual process, but reviewing and improving the quality of some labels is considerably faster than drawing the bounding boxes from zero.

Furthermore, the quality of the generated predictions will depend on the accuracy of the initial model and the confidence threshold. If the confidence threshold is high (close to 1) the model will generate fewer incorrect predictions (false positives), and you’ll have to draw the boxes for the missing objects (false negatives).

On the other hand, a threshold near 0 will generate more incorrect predictions, but you will just need to erase and fix the skewed bounding boxes. The best confidence value will be a parameter to tune based on your problem and model performance.

After the review, you’ll be ready to retrain the initial model with the whole dataset.

Project extension: MLOps use case

After auto-annotating my whole dataset and training an awesome model I just finished my project, right? Well, maybe not…

Machine learning engineering and model productization are very important areas nowadays, as we understood that a model in production needs to be monitored and improved, and this is not different for deep learning models, like object detectors.

When you release an image-based project, the users will probably send images that are not so common in the training dataset. Consider the cat and dog detector, for example: you may not have many images of dogs on the beach in your dataset, but you may receive a lot of them during the summer vacations!

That said, a great use case of this project is to create an auto-annotation pipeline to constantly generate new labels from the images sent by the users in production. You could then integrate the new images and the generated annotations into an automatic training pipeline to retrain your Object Detection model every month.

With this approach, you will guarantee that your model will always be updated with the user behavior and perform well. Also, as your model becomes more robust, less manual validation will be required to annotate the new images.

Conclusion and final thoughts

As deep learning keep advancing with new state-of-the-art architectures and more computer power is available, data is still a huge bottleneck in artificial intelligence.

To train an object detection model that is good enough to go to production, a lot of annotated data is necessary, and this can easily become the most expensive part of the project.

Given that fact, an auto annotation tool can be a great helper in this process, as it speeds up the human task by inferring the location and class of objects in the image.

Among other options, the auto annotation tool introduced in this article has the advantage of being free, open-source, and easy to use. Although the manual part is still necessary, this library has helped me with a lot of different projects so far, and I think it can also help other people!

The post Auto-Labeling Tool for Object Detection appeared first on Towards Data Science.

]]>
A step further in the creation of a sign language translation system based on artificial… https://towardsdatascience.com/a-step-further-in-the-creation-of-a-sign-language-translation-system-based-on-artificial-9805c2ae0562/ Fri, 09 Apr 2021 20:10:22 +0000 https://towardsdatascience.com/a-step-further-in-the-creation-of-a-sign-language-translation-system-based-on-artificial-9805c2ae0562/ A strategy to bring accessibility at scale.

The post A step further in the creation of a sign language translation system based on artificial… appeared first on Towards Data Science.

]]>
A step further in the creation of a sign language translation system based on artificial intelligence

A strategy to bring accessibility at scale

Communication is fundamental in our society, which people daily use to express themselves and have access to the most basic services like public transport, school, and health care. Sign language is used in all countries by people with severe hearing loss, a condition that reaches millions of people worldwide. The problem is that most hearing people do not know how to speak by signs, creating a barrier that makes it difficult for the deafs to have social interactions.

To transpose this obstacle, we can use Artificial Intelligence techniques, like Convolutional Neural Networks, to create a sign translation system and generate legends for the sign executed.

Another interesting point is that although every country has its own sign language, a deep learning architecture can generalize well for problems with the same domain, just with some training and hyperparameter optimization.

However, there is one part in this process that is expensive, time-consuming and repetitive for all sign languages in the world: The dataset creation.

Imagine that someone created a state-of-the-art architecture to recognize signs and generate legends with very high accuracy in American Sign Language (ASL). To implement this solution in practice and bring accessibility to the real world, the scientists of every country that doesn’t use ASL would need to create a huge dataset (with most common words used daily for example) to retrain the network. So, it’s clear that one of the main bottlenecks is in dataset creation!

Based on this, I’ll explore some findings in how we can create a dataset more effectively for sign language to train a high accuracy model, serving as a guide to future works.

This article is based on my paper called: Efficient sign language recognition system and dataset creation method based on deep learning and image processing.

Experimental Dataset

The main idea in this work is to create a sign language recognition system based on a cheap dataset, helping future works that will need to do that.

But what is a cheap dataset?

In my opinion, it’s a dataset that uses a simple sensor, like an RGB camera, few interpreters, and the same background in the recordings.

That’s why we created a dataset recording the videos from two different smartphones, in the same standard background, and with two interpreters, configuring a simple and easy setup.

Another doubt was the Frames Per Second (FPS) to capture the records or to subsample the images, once it can lead to distinct results in the final performance, so we created two datasets using the same procedure, where the first was recorded at 60 FPS and the second one at 30 FPS.

Furthermore, we also subsampled the datasets into 30 and 20 FPS for the first dataset and 20 FPS for the second, to test if we can reduce the number of images once the record was made, without impacting the results.

In the end, we recorded 14 signs used daily by deaf people, repeating each sign three times with some variation between the executions. The dataset can be found on Kaggle.

List of words considered in our dataset.
List of words considered in our dataset.

To prove the model efficiency, we created a final validation dataset, considering the same 14 signs but recording over different backgrounds and light conditions, trying to reproduce a real-world scenario.

Train and test images, using the same background and light condition. (Image by Author)
Train and test images, using the same background and light condition. (Image by Author)
Validation set, using different backgrounds and light conditions. (Image by Author).
Validation set, using different backgrounds and light conditions. (Image by Author).

Hypotheses and experiments

These are the main hypotheses of this work, based on questions that emerged during the research and what experiments we did to answer our questions.

60 FPS may be better than 30 FPS, once it reduces motion blur.

It’s necessary to have movement to produce a sign, and the movement can produce blur in video records. In this context, using a greater FPS should reduce the blur and this could improve the model accuracy.

Experiment: We will compare the accuracy of the datasets recorded in 30 and 60 FPS

Artificial background creation may improve the generalization of the model

Record all signs in the same static background it’s much easier than changing the scene programmatically, moving the equipment and the people. But we believe that it can lead to a bias that could impact the model accuracy on the validation dataset.

Experiment: We will use semantic segmentation to create new dataset backgrounds and train a model with distinct scenes.

Geometrical transformations are better than intensity transformations for data augmentation

If we analyze how humans understand a sign it’s easy to note that geometrical features (like hand position and shape) are fundamental for recognition. On the other hand, background, skin color, clothes, hair, and other accessories are not relevant for us. That’s why we think that geometrical transformations (rotation, zoom, shear) will be better than intensity transformations (brightness, channel inversion).

Our transformations used to augment data and the ranges of application. (Image by Author)
Our transformations used to augment data and the ranges of application. (Image by Author)

Experiment: We’ll test different data augmentation techniques individually.

Model creation

The CNN model was based on EfficientNet-B0 due to the reduced number of parameters and good accuracy. After the feature extractor, we created a neural network to predict the sign. Every test was repeated 3 times in the same setup, capturing the average results and comparing them statistically, using analysis of variance (ANOVA) or T-student test. The data were randomly split into 80% of samples for training and 20% for testing. You can check the details and implementation in the Google Colab.

Results

Data Augmentation

The next table shows the mean accuracy of each data augmentation technique tested individually.

Mean accuracy of three executions, considering the 20 FPS dataset. (Image by Author)
Mean accuracy of three executions, considering the 20 FPS dataset. (Image by Author)

As the main goal of data augmentation is to improve the model invariance, we focus our attention on the results of the validation set, where it’s clear that geometrical transformations performed considerably better to improve the sign recognition.

Besides that, we notice that data augmentation had success to reduce overfitting, as shown in the next image.

Accuracy without and with data augmentation in the 20 FPS, showing how it mitigates train (orange) and test (blue) variance. (Image by Author)
Accuracy without and with data augmentation in the 20 FPS, showing how it mitigates train (orange) and test (blue) variance. (Image by Author)

The general accuracy decreased with data augmentation, due to the image invariance introduced, but the results improved considerably in the validation set. In addition, our hypotheses of geometrical transformations were confirmed, as highlighted in the table below, reaching an accuracy higher than the intensity transformations.

Artificial background creation

The images below show the result of semantic segmentation, based on DeepLabV3, to change the background. The result resolution was 331×331 pixels due to the computational cost.

Sign execution with the new artificial background. (Image by Author)
Sign execution with the new artificial background. (Image by Author)

We used 5 different scenes to replace the background of each sign. The next table shows de results.

Comparing average model accuracy with artificial background replacement ('background suffix') and without it. (Image by Author)
Comparing average model accuracy with artificial background replacement (‘background suffix’) and without it. (Image by Author)

To understand more about these results, we used a tool to explain the model predictions called LIME, highlighting the parts of the image that contributed more to the inference, as illustrated next.

Using LIME to explain model predictions in the validation set. (Image by Author)
Using LIME to explain model predictions in the validation set. (Image by Author)

Explanations suggest that the model is focusing on the correct part of the image, considering the position of the hand of the interpreters to infer the signal. This shows that the background is not biasing the results and that is why the replacement does not aggregate relevant features, acting just like a color transformation

Frames per second comparison

At first, we studied the most suitable FPS to subsample the videos into images to train the model by recording at 60 FPS. Table 5 brings the results.

Comparison between the subsamples of images. (Image by Author)
Comparison between the subsamples of images. (Image by Author)

Remarkably, 60 FPS does not compensate for the computational resource required, since it obtained about 10% less accuracy in the validation set. This is probably because this frame rate has almost 2 and 3 times more images than 30 and 20 FPS, as figure 2 showed, which may contribute to overfitting, causing a greater variance in the validation set. Besides that, the consecutive images of a video are similar to each other, generating a low gain of information.

The T-student test reveals that 30 and 20 FPS have a significant difference in the test set, leading to the conclusion that it is the best choice for this situation, but this should vary depending on the dataset size, once the training time is bigger for 30 FPS, and the exploration of Spatio-temporal features, influencing the amount of information needed to be extracted from the video.

The last test involving frame rates was to compare the performance between a dataset captured at 30 FPS and another one captured at 60 FPS, shown in table 6.

Comparison between the datasets capture at 60 FPS and 30 FPS. (Image by Author)
Comparison between the datasets capture at 60 FPS and 30 FPS. (Image by Author)

In the test set, the results are favorable to the dataset captured at 60 FPS, as the greater number of images helps the model to fit better during training (as shown in table 5). On the other side, In the validation set, there is no significant difference between the dataset capture at 60 or 30 FPS (with a p-value of 0.58). Another relevant fact is that the dataset captured at 30 FPS had fewer images than the captured at 60 FPS and subsampled to 30, owing to the faster execution of the signs, which is a normal variation depending on the interpreter and the situation. Therefore, instructing the interpreters to execute the sign slowly should help to further mitigate these accuracy differences in capture rate, mainly in a well-lit scenery, where the motion blur is less perceptive.

Thus, in an uncontrolled scenario, with different lighting conditions and sign speed execution by the interpreters, capturing the video at 60 FPS and resampling it to 30 FPS should be the best choice, getting a large number of images while avoiding motion blur, with the drawback to require a better sensor and more storage space. Despite that, in a well-controlled scenario, capture at 30 FPS will produce satisfactory results.

Improving validation accuracy with multi-stream CNN

As a final test, we created a multi-stream CNN to capture local and global information in the image, as shown next figure.

Multi-CNN architecture for sign recognition. (Image by Author)
Multi-CNN architecture for sign recognition. (Image by Author)

To segment hand we used EfficientDet as the object detector, passing individual images to the feature extractor and then to a neural network. Our final results show an accuracy of 96% on the test set and 81% on the validation set, showing that it’s possible to achieve good results and generalize for more complicated situations even with a simple dataset used for training.

Conclusions and final thoughts

In some years, sign language recognition will be considerably easier for machine learning, due to the new algorithms and technologies emerging, but I hope it has made clear the need for further studies on the efficiency of creating datasets, once every new translator system will need a huge amount of data behind the scenes.

We saw that just with few interpreters, a simple record setup, the same background, and with the correct data augmentation choices, it’s possible to generalize to real-world scenarios. In future works, a deeper analysis could be made, with more people and signs, to test if the same patterns observed here are repeated.

For further read and concepts, please, refer to the original paper. Thanks for reading!

The post A step further in the creation of a sign language translation system based on artificial… appeared first on Towards Data Science.

]]>
TensorFlow Semi-Supervised Object Detection Architecture https://towardsdatascience.com/tensorflow-semi-supervised-object-detection-architecture-757b9c88f270/ Fri, 24 Jul 2020 22:06:18 +0000 https://towardsdatascience.com/tensorflow-semi-supervised-object-detection-architecture-757b9c88f270/ An easy way to auto-label your images while test the overall model performance.

The post TensorFlow Semi-Supervised Object Detection Architecture appeared first on Towards Data Science.

]]>
Photo by Tobias Keller on Unsplash
Photo by Tobias Keller on Unsplash

Object Detection is one of the most popular and used computer vision methods nowadays, where the intention is not only to determine whether the object is found or not in the image in the same way as most common classification problems but also point the location of these objects of interest, being the necessary approach for situations where multiple objects may appear simultaneously in the image.

One of the challenges of this method is to create the dataset, once it’s necessary to manually set the positions of all objects in the image, spending a lot of time to do so in a large number of observations.

This process is inefficient, expensive, and time-consuming, mainly in some problems that are required to label dozens of objects in each image or demand specialized knowledge.

Based on this, I created a TensorFlow Semi-supervised Object Detection Architecture (TSODA) to interactively train an object detection model, and use it to automatically label new images based on a confidence threshold level, aggregating them to the later training process.

In this article, I’ll show you the necessary steps to reproduce this approach in your object detection project. With this, you’ll be able to create labels in your images automatically while measuring the model performance!

Table of contents:

  1. How TSODA Works
  2. Example Application
  3. Implementation
  4. Results
  5. Conclusion

How TSODA works

The working is similar to any other semi-supervised method, where the training is done with labeled and unlabeled data, unlike the most common supervised approach.

An initial model is trained using strongly labeled data done by hand, learns some features from these data, and then create inferences in the unlabeled data to aggregate these new labeled images to a new training process.

The whole idea can be illustrated by the following image:

(font: Author)
(font: Author)

This operation is done until the stop criterion is reached, either the number of executions or no remaining unlabeled data.

As we saw in the schema, a confidence threshold of 80% was initially configured. This is an important parameter once the new images will be used to a new training process and if incorrectly labeled could create undesirable noise, undermining the model performance.

The propose of TSODA is to introduce a simple and fast way to use semi-supervised learning in your object detection project.

Example Application

To exemplify the approach and test if everything is working properly, a random sample of 1,100 images of the Asirra dataset was done in a proportion of 50% per class.

The images were labeled manually to a later comparison, you can download the same data on Kaggle.

I used Single Shot Multibox Detector (SSD) as the object detection architecture and Inception as the base network instead of VGG 16 like in the original paper.

SSD and Inception have a good trade-off between training speed and accuracy, so I think it’s a great start point, mainly because in each iteration the TSODA needs to save a checkpoint of the trained model, infer new images and load the model to train it again, so a faster training is good to iterate more and aggregate these images to the learning.

Testing performance

To test TSODA performance just 100 labeled images of each class were provided to split into training and test while 900 were let as unlabeled, simulating a situation where just a little time was spent creating the labeled dataset. the obtained results were compared to a model trained with all the manually labeled images.

The data were randomly split into 80% of images for training and 20% for testing.

Implementation

As the name suggests, the whole architecture is done using the TensorFlow environment, in version 2.x.

This new TF version is not yet fully compatible with object detection, and some parts were difficult to adapt, but in the next months this will be the default and more used version of TF in all projects, that’s why I think it’s important to adapt the code to use it.

To create TSODA, new scripts and folders were added in a fork of TF Model Garden repository, so you can easily clone and with just small modifications run your semi-supervised project, besides be a familiar structure for those who work with TF.

You can clone my repository to easily follow these steps or adapt your TF model repository.

The work was done inside _models/research/objectdetection, where you will find the following folders and files:

  • _inference_frommodel.py: This file will be executed to use the model to infer new images.
  • _generate_xml.py and generate_tfrecord.py: Will both be used to create the train and test TF records used in the training of the object detection model (these scripts are adapted from raccoon dataset_).
  • _test_images and train_images_ folder: Have the JPG images and XML files that will be used.
  • _unlabeled_images and labeled_images_ folder: Contains respectively all images without labels and the images automatically labeled by the algorithm that will be divided into training and test folder to keep the proportion ratio.

Inside utils folder we also have some things:

  • _generate_xml.py_: This script is responsible to get the model inference and generate a new XML that will be stored inside the _labeledimages folder.
  • _visualizationutils.py: This file also has some modifications in the code to capture the model inference and pass to the "generateXml" class.

That’s it, this is all you need to have in your repository!

Preparing Environment

To run this project you will need nothing!?

The training process is in a Google Colab Notebook, so it’s fast and simple to train your model, you will literally just need to replace my images by yours and choose another base model if you and.

Make a copy of the original Colab Notebook to your Google Drive and execute it.

If you really want to run TSODA in your machine, at the beginning of the Jupiter notebook you’ll see the installation requirements, just follow it but don’t forget to also install TF 2.x. I recommend creating a virtual environment.

Understanding the code

The _inference_from_model.py_ was responsible to load the _savedmodel.pb that was created in the training and use it to make new inferences in the unlabeled images. Most of the code was got from the object_detection_tutorial.ipynb found in the _colabtutorials folder.

If you don’t want to use Colab for training you’ll need to replace the paths at the beginning of the file.

Another important method in this file is the _partition_data_ which is responsible to split the inferred images (that will be in the _labeledimages folder) into training and test to keep the same ratio.

A change that you may want to do is in the split ratio, in my case, I chose an 80/20 proportion, but if you want something different, you can set it in the method parameter.

The visualization_utils.py is where the bounding boxes are drawn into the image, so we use this to get the boxes’ positions, class name, file name, and pass it into our XML generator. The following code shows the most of the process:

The XML is generated if a box is detected into the image with a higher confidence level than specified.

All the information arrives in the generate_xml.py and the XML is created using ElementTree.

Inside the code, there are comments that will help you to understand how everything is working.

Results

To evaluate the model performance was used the mean Average Precision (mAP), if you have some doubt about how it works, check out this.

The first test was done training a model by 4,000 epochs, using all the images strongly labeled.

The training took about twenty-one minutes and the results are shown in Table 1.

Table 2: mAP using all images correctly labeled for training and test. (font: Author)
Table 2: mAP using all images correctly labeled for training and test. (font: Author)

As expected, the model got a high mAP, mainly in a lower UoI rate.

The second test was done using the same configurations but with TSODA considering just 100 labeled images. In each iteration, the model was trained by 1,000 epochs and then used to infer and create new labeled images. The results are shown in Figure 2.

Model convergence in TSODA (font: Author)
Model convergence in TSODA (font: Author)

The whole training process took thirty-eight minutes, about seventeen minutes more than the previous one, and the model reached a worse final mAP, as shown in Table 2:

final mAP in the first test. (font: Author)
final mAP in the first test. (font: Author)

As Table 3 reveals, most images were successfully annotated in the first iteration, being aggregated in the training. This could mean that the minimum confidence threshold isn’t high enough, as in the first thousand iterations the model doesn’t converge properly yet, possibly creating wrong annotations.

Number of remaining unlabeled images by the iterations (font: Author).
Number of remaining unlabeled images by the iterations (font: Author).

TSODA requires more time and epochs to improve model performance and get close to the original method. This happens because the addition of new images in the training set leads to a loss in mAP once the model needs to learn how to generalize new patterns as proved in figure 2, where the mAP decreases as new images are included before starting increasing again when model learns new features.

In Figure 3 there are some examples of images automatically annotated. Notably, some labels are not so well marked, but it’s enough to guarantee more information to the model.

Samples of auto-annotated images. As saw, the labels could be more fitted to the object if done by a human (font: Author).
Samples of auto-annotated images. As saw, the labels could be more fitted to the object if done by a human (font: Author).

Some new experiments were performed considering a different epoch increment behavior as well as a higher confidence threshold. The result is present in Table 4:

Results using a second configuration! (font: Author)
Results using a second configuration! (font: Author)

Setting a confidence threshold to 90% ensures a higher chance of a correct label in predictions, being an important factor for model convergence. Although the training was done for 2,500 epochs in the initial iteration instead of just 1,000 once the first iteration is where most images are labeled, being necessary to the model learn more features and be able to beat the higher confidence. After the first iteration, the subsequent ones increment one 1,500 epochs until a limit of 8,500. These new configurations improved the final results.

TSODA may perform differently based on the kind of object of interest and it’s complexity. The results could be improved if trained by more epochs or set a higher confidence threshold with the drawback to increasing the training time. Also, the epochs increment by iteration must change depending on the problem, to control the model convergence based on the number of unlabeled images and threshold.

Nevertheless, this is a good alternative, once training time is cheaper than the manually labeling time that requires a human, and the TSODA was constructed in a manner that with just a few modifications it’s possible to train a completely new large-scale model from scratch.

The auto-created labels could also be manually adjusted in some images, which can improve the overall performance and is faster than creating all the labels manually.

Conclusion

The proposed TSODA can achieve satisfactory results in creating new labels to unlabeled images, reaching similar results to a strongly-labeled training approach, but with considerably less human effort. The solution also is adaptable for any other CNN detector architecture and is easy and fast to implement, helping the dataset creation process while measuring the overall object detector performance.

References

For more details and context about this semi-supervised project, see my preprint.

The post TensorFlow Semi-Supervised Object Detection Architecture appeared first on Towards Data Science.

]]>