Training With Synthetic Lidar Data to Improve Perception Model Performance: Takeaways from Automotive LIDAR 2022

September 30, 2022

At this year’s virtual Automotive LIDAR conference, Applied Intuition gave a presentation about training perception models of advanced driver-assistance systems (ADAS) and autonomous vehicles (AVs) with synthetic lidar data. Our presentation explained the challenges of developing and validating perception algorithms such as lidar models and how autonomy programs can generate and use synthetic training data in addition to real lidar data to improve model performance while lowering costs and speeding up time to market. The following blog post summarizes key takeaways from our presentation for those who could not attend.

Challenges With Lidar Model Development and Validation

When developing and validating AV perception models, a machine learning (ML) model’s performance highly depends on the quality and quantity of available training data. Unfortunately, real-world data—which most autonomy programs use to train their perception algorithms—can be slow, expensive, and even dangerous to collect. Once collected, real-world data then needs to be labeled, which often is a slow and error-prone process.

Leveraging Synthetic Data for Training

To solve the challenges that real-world data collection and labeling impose on effective ML training, autonomy programs can leverage synthetic data to train lidar-based perception algorithms. This process follows three steps:

  1. Generate labeled synthetic lidar data (Figure 1) targeting the failure modes of existing lidar-based perception models.
  2. Train lidar-based perception models with the labeled synthetic data in addition to real data.
  3. Test the updated models to identify new areas for improvement, and repeat steps 1 and 2.
Figure 1: Synthetic lidar data with per-point semantic material labels for vehicles.

Synthetic data generation platform

Synthetic data generation platforms help autonomy programs carry out step 1 of the mentioned process more easily. A synthetic data generation platform should be optimized for ML engineers and help them define and generate physically accurate synthetic data at scale. The platform should be able to generate synthetic data that is analogous to the real-world circumstances in which the model will be deployed. This means modeling the exact sensors used by the system, generating a variety of ground-truth labels, and even procedurally generating 3D worlds that look like the autonomy program’s domain (Figure 2).

Figure 2: Data generation platforms should help ML engineers generate physically accurate synthetic data at scale.

Applied’s Approach

Applied Intuition provides a synthetic data generation platform called Synthetic Datasets. Synthetic Datasets provide labeled synthetic data for ML training, including the training of synthetic lidar models. Additionally, our sensor simulation tool Spectral offers support for software-in-the-loop (SIL) and hardware-in-the-loop (HIL) testing of perception modules and end-to-end autonomy stacks.

To show the advantages of Synthetic Datasets, Applied has conducted several studies. First, our inference study demonstrates how optimizations to synthetic data generation, like ML-based sensor model tuning, reduce the simulation-to-real domain gap (i.e., a degradation in lidar performance due to a difference between the synthetic data used for training and the target domain in the real world) (Figure 3). Closing the simulation-to-real domain gap makes the synthetic data more impactful when used in training and testing.

Figure 3: Our inference study has shown that when first optimizing synthetic data using inference as a guide, an ML model can then be trained with synthetic lidar data to improve both aggregate and class-based results.

Second, our training study proves that synthetic lidar data materially improves model performance, enabling developers to rely more on synthetic data and less on real-world data collection and labeling (Figure 4).

Figure 4: Our training study has shown that pre-training a lidar model with joint data (i.e., synthetic and real data) and then fine-tuning (i.e., retraining) with real data improves object detection performance by 6% compared to training on real data exclusively.

Contact our team to learn more about Synthetic Datasets, Spectral, and how they help autonomy programs train their perception algorithms more effectively.