Case Study: Improving Object Detection Performance by Leveraging Synthetic Data

December 13, 2021
Real-world data is indispensable when training object detection models for autonomous vehicle (AV) perception systems. Unfortunately, the real world doesn’t always easily provide all the data needed for successful training. For example, classes such as cyclists and motorcyclists may occur less frequently than pedestrians and cars, making it difficult for perception models trained on real-world data to correctly detect them (Figure 1). Similarly, the most dangerous situations such as accidents may be hidden in the last few percent of test drives.
Figure 1: Cyclists are often underrepresented in real-world datasets. This makes it difficult for a perception model trained only on real-world data to detect cyclists.
Even though underrepresented classes and long-tail events occur less frequently in the real world, object detection models still need to be trained to handle them just as well as more common classes and situations. In the past few years, perception teams have started to utilize synthetic data to help address some of these limitations in real-world datasets. There still exists a domain gap between real-world and synthetic data, which is important to acknowledge, but recent methods are overcoming this gap with a combination of improved synthetic data and new machine learning training strategies.

To demonstrate this use case, Applied Intuition’s perception team has conducted a case study that uses synthetic data generated by Spectral as a supplemental training resource to address a class imbalance found in a real-world dataset. The study shows that synthetic data may be used to help mitigate class imbalances and address areas where real-world data is limited.
Sign up to read the full-length case study.
Oops! Something went wrong while submitting the form. Please try again.

Goal and Scope

This case study uses nuImages—a commonly used dataset by Motional—as a baseline training dataset. In the dataset, the cyclist class occurs 170 times less frequently than more prominent classes such as cars and pedestrians (Figure 2).
Figure 2: The class distribution for five of the classes contained in the nuImages training set used in this case study. People and cars occur more frequently (a total of ~90% of the five classes used in this study). Cyclists occur only 0.3% of the time.
The study generates and uses a synthetic dataset to improve the perception algorithm’s object detection performance on cyclists while retaining or improving object detection performance on other classes. It also explores whether the use of synthetic data can reduce the amount of real-world data needed to improve the model’s object detection performance.

Implementation

The study consists of the following steps:
  1. Analyze a baseline model trained only on real-world data from the nuImages dataset.
  2. Generate labeled synthetic data that specifically targets the lack of representation of the cyclist class in the real-world dataset. More examples of cyclists are created in the synthetic dataset.
  3. Use the above synthetic data as a supplemental training resource in addition to the nuImages data.

1. Baseline model analysis

First, it is measured how a perception model reacts to a class imbalance in the real-world nuImages data. A Cascade Mask R-CNN perception model is trained on this dataset until convergence. Its resulting object detection performance is lower on the cyclist class compared to all other classes (Figure 3).
Figure 3: The object detection performance of the baseline perception algorithm when trained on the nuImages data. Aggregate performance (bounding box, segment) and per-class performance (car, truck, cyclist, motorcycle, person) is measured in mean average precision (mAP) scores (i.e., the measure of the accuracy of object detection) and reported as averages over 0:5:0.95 Intersection-over-Union (IoU) values (i.e., the measure of how much the predicted boundary overlaps with the ground truth).

2. Synthetic data generation

Next, synthetic data is generated in Applied’s perception simulation tool Spectral to upsample the underrepresented cyclist class (Figure 4). This case study uses procedural 3D environment generation, automatic scenario creation, and a synthetic data generation pipeline to enable this process.
Figure 4: Class distribution of real-world (nuImages) and synthetic datasets. Cyclists (yellow) are upsampled in the synthetic dataset (right).
Automatic scenario generation enables the creation of realistic distributions for all scenario parameters and the sampling from these distributions to achieve coverage of various scenarios. In the synthetic dataset used in this study, three different forms of “scenario” generation were used:
  1. Sequential scenario creation by defining individual actor behaviors
  2. Sequential scenario creation using traffic generators
  3. Non-sequential data frames using distributions and smart actors
Examples of each of these methods are shown in the following images, along with their ground truth data (Figure 5 a) - 5 c)).
Figure 5 a): Spectral synthetic images using sequential scenarios with per-actor definition.
Figure 5 b): Spectral synthetic images using sequential scenarios with per-actor definition.
Figure 5 c): Spectral synthetic images using non-sequential frames with randomized smart actors.

3. Perception model training with synthetic and real-world data

The above synthetic dataset is then used to improve model performance in the following experiments.

i) Mixed training experiment
Synthetic data and real-world data are combined into one large training dataset. Batches that contain both real-world and synthetic data are randomly sampled from this dataset during training. Two trials are conducted to adjust the ratio of synthetic to real-world data and explore whether using more synthetic data and less real-world data impacts the model’s object detection performance:
  • Trial with a 0.5:1 ratio of synthetic to real-world data
  • Trial with a 1:1 ratio of synthetic to real-world data
ii) Fine-tuning experiment
A model is trained to convergence on only the synthetic dataset using a small holdout synthetic set for validation. Three trials are then conducted to fine-tune the model on the following amounts of real-world data:
  • Fine-tuning without data ablation: 100% of the nuImages training set
  • Fine-tuning with data ablation: 70% of the nuImages training set
  • Fine-tuning with data ablation: 50% of the nuImages training set

Key Results

1. Quantitative results

Compared to the baseline model with 100% of the real-world data, mixing synthetic and real-world data (mixed training) leads to improvements on the cyclist class (Figure 5). Pre-training a model on synthetic data and then fine-tuning it on 100% of the real-world data (fine-tuning without data ablation) shows the highest performance improvement, outperforming the baseline model consistently on all classes (Figure 6).
Figure 6: Class-wise mAP scores. Mixed training and fine-tuning experiments improve the mAP scores on cyclists compared to the baseline with 100% of the real-world data, while improvements are limited on other classes.
Pre-training a model on synthetic data and then fine-tuning it on 70% of the real-world data (fine-tuning with data ablation) leads to an improved performance, both on the cyclist class (Figure 6) and overall (Figure 7).
Figure 7: Mean average precision (mAP) scores from the fine-tuning experiment. The mAP score of fine-tuning with 70% of the real-world data (green) outperforms the baseline with 100% of the real-world data (blue).
See all quantitative results in the case study
Oops! Something went wrong while submitting the form. Please try again.

2. Qualitative results

The case study suggests that synthetic data can help improve object detection performance in difficult cases. The following images show cases in which the baseline model fails to adequately detect a cyclist while the model pre-trained on synthetic data succeeds (Figures 8 a) - 8 c)).
Figure 8 a): The baseline model trained only on real-world data fails to detect a cyclist at a close distance from the ego vehicle (left). The model pre-trained on synthetic data successfully detects that cyclist (right).
Figure 8 b): The baseline model trained only on real-world data fails to detect a cyclist close to the ego vehicle (left). The model pre-trained on synthetic data successfully detects that cyclist (right).
Figure 8 c): The baseline model trained only on real-world data fails to detect a cyclist in the shade (left). The model pre-trained on synthetic data successfully detects that cyclist (right).

Commercial Applications

This case study shows an early indication that synthetic data is a useful supplementary tool to real-world datasets when training AV perception models on both nominal and edge cases. Even though situations such as fallen objects, pedestrians or live animals on a highway, and bad visibility due to dense fog are rare, AVs must be prepared to safely navigate them. Synthetic data can help address class imbalances by improving a perception model’s object detection performance on minority classes such as cyclists. Synthetic data also provides a fast, cost-effective, and ethical way to generate training datasets and target rare or difficult cases when real-world data is too infrequent or dangerous to collect.
Read the full case study
Oops! Something went wrong while submitting the form. Please try again.
Contact our engineers to learn more about Applied’s synthetic datasets.