Competition_foundation model

CVPR24-FMW

Second Workshop on Foundation Models

Competitions last year

Track1 Top 3 Certificates

Track2 Top 3 Certificates

Top 10 Certificates

track1_1
track1_2
track1_3

track2_1
track2_2
track2_3

track1_4
track1_5
track1_6

track2_4
track2_5
track2_6

track2_7
track2_8
track2_9

The previous mainstream visual model usually adopts a single-task "train from scratch" scheme. Each task is trained from scratch, and each task cannot learn from each other. Due to the bias of insufficient single-task data, the performance heavily depends on the distribution of task data, and the scene generalization is often poor. Recently, the booming large-scale data pre-training technology learns more general knowledge by using a large amount of data, and then migrates it to downstream tasks. The pre-training model obtained based on massive data has better knowledge completeness, and fine-tuning based on a small amount of data can still achieve better results in downstream tasks. However, based on the model production process of pre-training + downstream task fine-tuning, it is necessary to train models for each task separately, which consumes a lot of resources.

The VIMER-UFO ([*UFO：Unified Feature Optimization*](https://arxiv.org/pdf/2207.10341v1.pdf)) AllinOne multi-task training scheme proposed by Baidu can be directly applied to handle multiple tasks by using data from multiple tasks to train a powerful general-purpose model. The VIMER-UFO not only improves the performance of a single task through cross-task information, but also eliminates the fine-tuning process of downstream tasks. The VIMER-UFO AllinOne model can be widely used in various multi-task AI systems. Taking the smart city scene as an example, VIMER-UFO can use a single model to achieve the SOTA effect of multiple tasks such as face recognition, human body and vehicle ReID. At the same time, the multi-task model can achieve significantly better results than the single task model, demonstrating the effectiveness of the information reference mechanism between multiple tasks.

This track aims to improve the generalization ability of the model through multi-task joint training, and solves the conflict between different task. Based on traffic scenarios, this track selects three representative tasks of classification, detection, and segmentation for AllInOne joint training.

Task definition: Given the data set of the three tasks of classification, detection, and segmentation, a unified large model is used for AllInOne joint training, so that a single model has the ability of classification, detection, and segmentation.

Click Here to Join!

Track1: Multi-Task Track

Track 1

Track2: Cross-Modal Image Retrieval Track

High-performance image retrieval in traffic scenes plays a crucial role in traffic law enforcement and public security management. Traditional image retrieval methods usually use attribute recognition to retrieve images by comparing with the expected attributes. With the development of multi-modal large model technology, the unification of text and image representation and modal conversion has been widely used. Using this ability can further improve the accuracy and flexibility of image retrieval.

The goal of this track is to improve the accuracy of text-based image retrieval in traffic scenes. Therefore, we have annotated text descriptions for images of traffic participants from various public datasets and online sources to construct many-to-many image-text pairs. Participants can conduct research on multimodal techniques based on these pairs to improve the accuracy of text retrieval for images.

Click Here to Join!

Track 2 Certificates

Track 3 Certificates

Track 2

Track 3

CVPR24-FMW

Second Workshop on Foundation Models

track1_1

track1_2

track1_3

track2_1

track2_2

track2_3

track1_4

track1_5

track1_6

track2_4

track2_5

track2_6

track2_7

track2_8

track2_9