The previous mainstream visual model usually adopts a single-task "train from scratch" scheme. Each task is trained from scratch, and each task cannot learn from each other. Due to the bias of insufficient single-task data, the performance heavily depends on the distribution of task data, and the scene generalization is often poor. Recently, the booming large-scale data pre-training technology learns more general knowledge by using a large amount of data, and then migrates it to downstream tasks. The pre-training model obtained based on massive data has better knowledge completeness, and fine-tuning based on a small amount of data can still achieve better results in downstream tasks. However, based on the model production process of pre-training + downstream task fine-tuning, it is necessary to train models for each task separately, which consumes a lot of resources.
The VIMER-UFO ([*UFO：Unified Feature Optimization*](https://arxiv.org/pdf/2207.10341v1.pdf)) AllinOne multi-task training scheme proposed by Baidu can be directly applied to handle multiple tasks by using data from multiple tasks to train a powerful general-purpose model. The VIMER-UFO not only improves the performance of a single task through cross-task information, but also eliminates the fine-tuning process of downstream tasks. The VIMER-UFO AllinOne model can be widely used in various multi-task AI systems. Taking the smart city scene as an example, VIMER-UFO can use a single model to achieve the SOTA effect of multiple tasks such as face recognition, human body and vehicle ReID. At the same time, the multi-task model can achieve significantly better results than the single task model, demonstrating the effectiveness of the information reference mechanism between multiple tasks.
This track aims to improve the generalization ability of the model through multi-task joint training, and solves the conflict between different task. Based on traffic scenarios, this track selects three representative tasks of classification, detection, and segmentation for AllInOne joint training.
Task definition: Given the data set of the three tasks of classification, detection, and segmentation, a unified large model is used for AllInOne joint training, so that a single model has the ability of classification, detection, and segmentation.
High-performance image retrieval in traffic scenes plays a crucial role in traffic law enforcement and public security management. Traditional image retrieval methods usually use attribute recognition to retrieve images by comparing with the expected attributes. With the development of multi-modal large model technology, the unification of text and image representation and modal conversion has been widely used. Using this ability can further improve the accuracy and flexibility of image retrieval.
The goal of this track is to improve the accuracy of text-based image retrieval in traffic scenes. Therefore, we have annotated text descriptions for images of traffic participants from various public datasets and online sources to construct many-to-many image-text pairs. Participants can conduct research on multimodal techniques based on these pairs to improve the accuracy of text retrieval for images.