Implementing a Real-time, AI-Based, Face Mask Detector Application for COVID-19 

 Face Mask Detector 

Businesses are constantly overhauling their existing infrastructure and processes to be more efficient, safe, and usable for employees, customers, and the community. With the ongoing pandemic, it’s even more important to have advanced analytics apps and services in place to mitigate risk. For public safety and health, authorities are recommending the use of face masks and coverings to control the spread of COVID-19. 

NVIDIA developed NVIDIA Clara Guardian, which is an application framework and partner ecosystem that simplifies the development and deployment of smart sensors with multimodal AI in healthcare facilities. Clara Guardian comes with a collection of healthcare-specific, pretrained models and reference applications that are powered by GPU-accelerated application frameworks, toolkits, and SDKs. You can use NVIDIA Transfer Learning Toolkit (TLT) to develop highly accurate, intelligent video analytics (IVA) models with zero coding and use the NVIDIA DeepStream SDK to deploy multi-platform scalable video analytics. 

In this post, we show experiments using TLT to train a face mask detection model and then using the DeepStream SDK to perform efficient, real-time deployment of the trained model. Face mask detection systems are now increasingly important, especially in smart hospitals for effective patient care. They’re also important in stadiums, airports, warehouses, and other crowded spaces where foot traffic is heavy and safety regulations are critical to safeguarding everyone’s health. 

This post only outlines the developer recipe. No trained model or datasets are provided by NVIDIA. You can access the recipe and scripts to build your own app using the NVIDIA-AI-IOT/face-mask-detection GitHub repo. 

 

Overcoming challenges with building an AI-based workflow 

For implementing real-time and accurate deep learning applications on embedded systems, you must effectively optimize models during AI training and inference. The goal here is to train an AI model that is not only accurate but lightweight and performant for real-time inference on the edge. Pruning the model helps reduce the overall size of the model which will result in higher performance. This must be done without losing accuracy as compared to the original model.  

The next step in model optimization is weight quantization, transforming floating-point to integer. Training is typically done at FP32/16 precision but for inference you can run inference at INT8 precision. This is very important for edge devices where computing resources are limited. This is done either during training or post- training. TLT provides you with both: quantization-aware training (QAT) and post-training quantization (PTQ) options.  

Finally to maximize inference throughput, you must efficiently process streaming video data by minimizing memory copies, using all the hardware acceleration, and using TensorRT for inference.   

The NVIDIA Transfer Learning Toolkit (TLT) and NVIDIA DeepStream SDK abstract away complexity associated with building and deploying deep learning models. This end-to-end pipeline helps in reducing overall time to deploy real-time AI/DL applications. TLT and DeepStream are containerized so that you don’t need to install CUDA, cuDNN, Deep Learning frameworks (TensorFlow, Keras or PyTorch), or TensorRT for inference. In this post, we discuss how to use containers on your machine and provide commands on NVIDIA-AI-IOT/face-mask-detection GitHub open-source repo. 

To use TLT and DeepStream; you do not necessarily have to know all the concepts in depth, such as transfer learning, pruning, quantization, and so on. These simple toolkits abstract away the complexities, allowing you to focus on your application. 

TLT provides a variety of pretrained models, about 13 commonly used image classification models and six object detection models with all 13 classification models as a backbone. For more information about the available pretrained models, see here. You can use these models based on a trade-off between accuracy and complexity (inference FPS). For this experiment, you use DetectNet_v2 with the ResNet-18 backbone.  

AI-based face mask detection 

The developer recipe shows the high-level workflow of downloading the pretrained model and downloading and converting datasets to the KITTI format to use with TLT. The quantized TLT model is then deployed using DeepStream SDK to detect masked and no-mask faces. 

 Transfer Learning Toolkit workflow 

The TLT workflow involves downloading the pretrained model, converting the data to the KITTI format, and pruning the model. 

Download the pretrained model 

TLT provides pretrained models for image classification, instance segmentation, and object detection on NVIDIA NGC. TLT provides a simple and intuitive command line interface to download models of your choice. It also provides purpose-built pruned models such as PeopleNetTrafficCamNetDashCamNetFaceDetect-IRVehicleTypeNet, and VehicleMakeNet for popular use cases, such as counting people and identifying vehicles at toll booths and traffic intersections, and more. 

Convert the dataset to the KITTI format 

For object detection models, input images and labels must be in KITTI annotations format and all input images need to have same size (that is, multiple resolution is not allowed). In the GitHub repo referenced in this post, we provided KITTI format conversion scripts for four publicly available dataset sizes: FDDBWiderFaceMaFA, and Kaggle Medical Mask Dataset. You can get these from the NVIDIA-AI-IOT/face-mask-detection GitHub repo.  

Train and prune the model 

After the input data is processed, you use a downloaded, pretrained model to perform transfer learning. Model pruning not only reduces model parameters, but also in some cases helps reduce overfitting. It gives better accuracy compared to unpruned models and improves inference performance. Model pruning can be considered as an important task for running large and complex object detection models on embedded platforms. For more information, see Pruning Models with NVIDIA Transfer Learning Toolkit. 

Figure 2 shows the effect of pruning on the overall throughput of the face mask detection application. 

Figure 2. Recommended TLT workflow. 

After you achieve satisfactory accuracy for the pruned model, it is ready for deployment. We suggest deployment in FP16 or INT8 format for the best performance. NVIDIA Jetson AGX Xavier and Jetson Xavier NX allow the use of INT8 precision for GPU as well as with the NVIDIA Deep Learning Accelerator (NVDLA). In the case of INT8 precision, tlt-export generates a calibration file that is used to reduce loss of information due to quantization error, that is, moving from FP32 to INT8. 

We also provide a visualization function so that you can visualize the evaluated output and TensorRT deployment output on test images. 

Real-time deployment using DeepStream 

The DeepStream SDK allows you to build and deploy real-time video analytics pipelines for highest throughput. After you export the model from TLT to an encoded TLT model file (.etlt), you can convert the model to a TensorRT engine file using tlt-converter and deploy on the NVIDIA Jetson platform. Generated TensorRT engine can be used as an input to DeepStream SDK. Alternatively, you can also use the .etlt model directly with DeepStream. 

In the GitHub repo, we provide configuration files to set up input for deepstream-app so that you can take advantage of the video analytics pipeline and TensorRT integration for inference. For a better understanding of updating configuration files according to camera input and video files stored on Jetson, we provide two different configuration files for deepstream-app.