Second Workshop on Foundation Models  



    When: June 17, 2024      Where: Seattle USA

    Program: To be updated 

Time (Seattle) 
Time (China)  
Title / Presenter
8:00 - 8:55
23:00 - 23:55 
8:55 - 9:10
23:55 - 0:10
Intro for foundation model
Teng Xi  
9:10 - 9:20
0:10 - 0:20
Openning remarks
Salman / Fahad
0:20 - 1:00
Keynote Talk #1
Ani Kembhavi
10:00 - 10:45
1:00 - 1:45
Poster Session Break
10:45 - 11:20
1:45 - 2:20
Keynote Talk #2
11:20 - 11:55
2:20 - 2:55
Keynote Talk #3
Irfan Essa
11:55 - 13:30
2:55 - 4:30
Lunch Break 
13:30 - 14:10
4:30 - 5:10Keynote Talk #4
Yuliang Liu
14:10 - 14:25
5:10 - 5:25
Oral presentation 1
SILC: Improving Vision Language Pretraining with Self-Distillation
Muhammad Ferjad Naeem
14:25 - 14:40
5:25 - 5:40
Oral presentation 2
Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation
Gijs Dubbelman
14:40 - 14:55
5:40 - 5:55
Oral presentation 3
Toward a Diffusion-Based Generalist for Dense Vision Tasks
 Fan Yue
15:00 - 16:00
6:00 - 7:00
Poster Session Break
16:00 - 16:40
7:00 - 7:40Keynote Talk #5Li Yuan
16:40 - 17:20
7:40 - 8:20
Keynote Talk #6
Hao Su
17:20 - 17:30
8:20 - 8:30
Closing Remarks
Workshop organizers

    When: June 17, 2024      Where: Seattle USA  Summit 434


Keynote Speakers

Title: Should we be searching over Generative Neural Architectures

Abstract:There has been impressive recent progress in unsupervised learning of Deep Nets and 
Neural Architectures. In particular, we now have the ability to learn feature vector representations
 which can exploit the enormous amount of unannotated data.  But is this enough to overcome the
 limitations of Deep Nets compared to the human visual system? Namely their task specific and 
domain specific nature and their difficulty at extending to out-of-distribution data and to tasks
 they have not been trained to perform? We argue that generative models suggest a promising 
mathematical strategy for overcoming these limitations which differs from discriminative methods 
like Deep Nets and give examples of good performance on out-of-distribution data and tasks that the 
algorithms were not trained on.

Speak Profile:
Alan Yullie is a Bloomberg Distinguished Professor of Cognitive Science and Computer Science at Johns Hopkins University. 
He directs the research group on Compositional Cognition, Vision, and Learning. He is affiliated with the Center for Brains, 
Minds and Machines, and the NSF Expedition in Computing, Visual Cortex On Silicon. Alan Yuille received the BA degree in 
mathematics from the University of Cambridge in 1976. His PhD on theoretical physics, supervised by Prof. S.W. Hawking, 
was approved in 1981. He was a research scientist in the Artificial Intelligence Laboratory at MIT and the Division of Applied 
Sciences at Harvard University from 1982 to 1988. He served as an assistant and associate professor at Harvard until 1996. 
He was a senior research scientist at the Smith-Kettlewell Eye Research Institute from 1996 to 2002. He was a full professor 
of Statistics at the University of California, Los Angeles, as a full professor with joint appointments in computer science, 
psychiatry, and psychology. He moved to Johns Hopkins University in January 2016. His research interests include computational 
models of vision, mathematical models of cognition, medical image analysis, and artificial intelligence and neural networks.

Title: Neural Architecture Search

Abstract: Deep neural networks have achieved extraordinary success in recent years. However, finding 
appropriate network architectures still involves extensive human efforts and experience. As an alternative, 
NAS was recently proposed to automatically discover suitable networks by searching over a vast architecture
 space. It has rapidly become a research hotspot and achieved cutting-edge performance in various computer 
vision tasks, ranging from image classification, segmentation to detection. Aiming at the effectiveness and 
efficiency of neural architecture search, this talk briefly introduces the existing NAS methods and covers some 
of the recent work and achievements of Professor Rongrong Ji’s research group.

Speak Profile:
Rongrong Ji is a distinguished professor at Xiamen University, a recipient of the National Natural Science Fund for 
Distinguished Young Scholars. His research falls in the field of computer vision, multimedia analysis, and machine learning.
 He has published 100+ papers in ACM/IEEE Transactions, including TPAMI and IJCV, as well as top-tier international conferences, 
such as CVPR and NeurIPS. His publications have got over 10K citations in Google Scholar. He was the recipient of the first prize of
 technology invention of the ministry of education in 2016, the first prize of the Fujian provincial science and technology award in
 2018, science and technology award for youth of Fujian province in 2019. He has served as the area chair of top-tier international 
conferences such as IEEE CVPR and ACM Multimedia. He is also the Vice Director of Academic Working Committee of Chinese 
Society of Image and Graphics, and a member of the Artificial Intelligence Professional Construction Advisory Committee of the 
Electronic Information Education Commission of the Ministry of Education.

Title: Dynamic Neural Networks

Abstract: In recent years, network architecture innovations are pushing forward the application of deep
 learning in various areas. This talk will introduce the paradigm that improves the inference efficiency of deep 
networks with dynamic architectures. Compared to the mainstream CNN backbones with static components, 
dynamic models can change its depth/width/parameters at the inference stage, conditioned on each input 
sample, thus leading to substantially improved computational efficiency. The advantages of dynamic models 
and possible future directions will be discussed.

Speak Profile:
Gao Huang is an Assistant Professor in the Department of Automation at Tsinghua University. Previously, he was apostdoc 
researcher in the Department of Computer Science at Cornell University from 2015 to 2018. His researchinterests lie in 
machine learning and computer vision, especially deep learning. He has authored about 50 papers ontop-tier journals and 
conferences (PAMI, CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, etc.), which collect more than 16,000 citations. He is a recipient 
of the CVPR Best Paper Award, for the invention of DenseNet. 

Title: Deep (Convolution) Networks from First Principles

Abstract: In this talk, we offer an entirely “white box’’ interpretation of deep (convolution) networks from the 
perspective of data compression (and group invariance). In particular, we show how modern deep layered 
architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived
 from the principle of maximizing rate reduction (with group invariance). All layers, operators, and parameters 
of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All 
components of so-obtained network, called ReduNet, have precise optimization, geometric, and statistical 
interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental t
radeoff between invariance and sparsity for class separability; it reveals a fundamental connection between
 deep networks and Fourier transform for group invariance – the computational advantage in the spectral 
domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation
 (optimization) and backward propagation (variation). In particular, the so-obtained ReduNet is amenable to
 fine-tuning via both forward and backward (stochastic) propagation, both for optimizing the same objective. 
This is joint work with students Yaodong Yu, Ryan Chan, Haozhi Qi of Berkeley, Dr. Chong You now at Google
 Research, and Professor John Wright of Columbia University.

Speak Profile:
Yi Ma received the first prize of Excellent Student Scholarship from Tsinghua University in 1994 and the Regents Fellowship from 
U.C. Berkeley from 1995 to 1996. His PhD research won the David Marr Best Paper Award with S. Soatto, J. Kosecka, and S. Sastry, 
at the International Conference on Computer Vision (ICCV) in 1999. He also received honorable mention for the Longuet-Higgins 
Best Paper Award with R. Vidal at the European Conference on Computer Vision (ECCV) in 2004, the Sang Uk Lee Best Student 
Paper Award with his students Shankar Rao, Hossein Mobahi, and Allen Yang at the Asian Conference on Computer Vision (ACCV) 
in 2009, and the second prize of the Best Paper Award of the IMA Journal on Information and Inference in 2015. Yi Ma was the 
recipient of the Faculty Early Career Development (CAREER) Award from the National Science Foundation(NSF) in 2003. He was 
also the recipient of the Young Investigator Program (YIP) Award from the Office of Naval Research (ONR) in 2005. He received 
the Gold Star Award from Microsoft Corporate in 2009 and the Best Research Team of the Year Award from Microsoft Research 
Asia in 2012. He has given over two dozens of Plenary Talks at international conferences and workshops. He was on the Incomplete
 List of Teachers Ranked as Excellent of the University of Illinois for Spring'01, Fall'02, and Spring'06. Yi Ma is an IEEE Fellow since
 2013, an ACM Fellow since 2017, and a SIAM Fellow since 2020. He is ranked the World's Highly Cited Researchers since 2016 by
 Clarivate Analytics of Thomson Reuters and ranked Top 50 of the World's Most Influential Authors in Computer Science by 
Semantic Scholar, reported by Science Magazine in April 2016.

Title: Learning 3D environment representations through intelligent anticipation

Abstract: Embodied agents operating in unfamiliar indoor environments must explore efficiently and build 
useful representations of the environment. We introduce the idea of predicting unobserved content in 3D 
spaces to (1) learn agents that build maps efficiently and (2) learn transferrable representations that benefit
 several downstream navigational tasks. First, we propose the idea of occupancy anticipation, where we infer 
spatial occupancy for unobserved regions to map unfamiliar environments rapidly. For example, we can predict
 whether there is a corridor outside the room and whether there is floor space behind the bed. Embodied 
agents equipped with the ability to anticipate occupancy build 30% more accurate maps when compared to 
prior work, and navigate more efficiently in the challenging Gibson and Matterport3D datasets. Our approach 
won the 2020 Habitat PointNav Challenge. Next, we propose the self-supervised approach of environment
 predictive coding to learn effective representations of observation sequences gathered by an embodied 
agent. We learn these representations on video walkthroughs generated by other agents, and transfer the 
representations to various geometric and semantic navigation tasks. Our approach improves the learning 
efficiency of embodied agents by a 2-4x compared to methods that only learn image-level representations, 
and leads to better navigation performance.

Speak Profile:
Kristen Grauman is a Full Professor in the Department of Computer Science at the University of Texas at Austin where she 
leads the UT Computer Vision Group. Her research is in computer vision and machine learning. She is a Fellow of AAAI, an 
Alfred P. Sloan Research Fellow, and a recipient of the Presidential Early Career Award for Scientists and Engineers, the 2013 
Computers and Thought Award, and several best paper awards. Prof. Grauman serves as Associate Editor-in-Chief for the 
IEEE Transactions on Pattern Analysis and Machine Intelligence. She was elected to the Academy of Distinguished Teachers 
in 2017, and received her B.A. from Boston College and her Ph.D. from MIT.  Within computer vision and machine learning, 
Prof. Grauman's primary interests are visual recognition, image and video search, video analysis, first-person vision, embodied 
and multi-modal perception, and interactive machine learning.

Title: Automated Architecture and Training Recipie Search of Computationally Efficient Deep Neural Networks

Speak Profile:
Peter Vajda is Research Scientist Manager at Facebook. He is leading the effort for Mobile Vision team on Efficient Deep 
learning for Computer Vision. From 2012-2014, he was a Visiting Assistant Professor in Professor Bernd Girod's group in 
Stanford University, Stanford, USA. From September 2007 to January 2012, he was research assistant andPh.D. student in 
Professor Ebrahimi's group in Ecole Polytechnique Fédéral de Lausanne (EPFL), Lausanne,Switzerland. 

Title: Examining Deep Neural Architectures in Practice

Abstract: The past decade witnesses the significant progress in deep learning. Probably as the most tricky 
hyper parameter in deep learning, the neural architecture becomes the key to accelerate deep learning 
computation. In this talk, I will introduce our recent works on neural architecture search. Beyond accuracy, 
adversarial robustness is another important factor to be investigated. Further, instead of the toy setting for an 
efficient neural architecture, we need to keep the target of the searched neural architecture in mind. An 
extension of graph neural architecture search is also included to broaden the boundary of the study on neural 
architecture search.

Speak Profile:
Chang Xu is Senior Lecturer and ARC DECRA Fellow at the School of Computer Science, University of Sydney. He received the 
Ph.D. degree from Peking University, China. His research interests lie in machine learning algorithms and related applications in 
computer vision. He has published over 100 papers in prestigious journals and top tier conferences. He has received several paper
 awards, including Distinguished Paper Award in IJCAI 2018. He regularly severed as the PC member or senior PC member for many
 conferences, e.g. NeurIPS, ICML, ICLR, CVPR, ICCV, IJCAI and AAAI. He has been recognized as Top Ten Distinguished Senior PC 
Member in IJCAI 2017.