Amid the rise of LLMs, is Computer Vision dead?

Jose Gabriel Islas Montero

October 17, 2023

min read

Amid the rise of LLMs, is Computer Vision dead?

The field of computer vision has seen incredible progress, but some believe there are signs it is stalling. At the International Conference on Computer Vision 2023 workshop “Quo Vadis, Computer Vision?”, researchers discussed what’s next for Computer Vision.

‍

In this post we bring you the main takeaways from some of the best minds in the Computer Vision landscape that gathered for this workshop during ICCV23 in Paris.

‍

Quo Vadis, Computer Vision?
The Anti Foundation Models
Data over Algorithms
Video can describe the world better than Text
After Data-Centric, the User will be the core
Bring back the fundamentals
So, is Computer Vision dead?

‍

Disclaimer: We went under cover into the workshop to bring you the most secret CAMRiP quality insights! 🕵️

‍

1. Quo Vadis, Computer Vision?

‍

Computer vision has reached a critical juncture with the emergence of large generative models. This development is having a dual impact. On one hand, it is opening new research avenues and attracting academics and businesses eager to capitalize on these innovations. However, the swift pace of advancement is also causing uncertainty among computer vision researchers about where to focus next.

‍

Many feel conflicted wondering if they can match the progress in generative models compared to more established computer vision problems. This ICCV 2023 workshop (see Figure 1) brought together experts like David Forsyth, Bill Freeman, and Jitendra Malik to discuss this pivotal moment.

‍

In the following sections we provide some highlights of the lively discussions followed on how computer vision should adapt and leverage generative models while still tackling core challenges in areas like video and embodied perception. There was consensus that combining strengths of computer vision and generative models thoughtfully is key, rather than seeing them as competing approaches.

‍

2. The Anti Foundation Models

MIT’s professor Bill Freeman, provided three reasons why he doesn’t like foundation models:

‍

Reason 1: They don’t tell us how vision works

In short, Bill Freeman argues that foundation models are capable of solving vision tasks but despite this achievement, nobody is able to explain how vision works (i.e. they are still a black-box).

‍

Reason 2. They aren’t fundamental (and therefore not stable)

As shown in Figure 2, professor’s Freeman hints that foundation models are simply just a trend.

Figure 2. Bill Freeman second argument on why foundation models are lacking

‍

Reason 3. They separate academia from industry

Finally, professor’s Freeman argues that foundation models create a boundary between those in academia (i.e. creative teams but no resources) versus those in industry (i.e. unimaginative teams but well-organized resources).

‍

3. Data over Algorithms

Berkeley’s professor, Alexei (Alyosha) Efros, shared the two ingredients for achieving true AI:

Focus on data over algorithms: GigaGAN [1] showed that large datasets enable old archiectures such as GAN to scale.
Bottom-up emergence: data per-se is mostly noise, what is crucial is the right kind of (high-quality) data.

‍

Also, he argues that LLMs are winning because they are being trained on all the available data with just 1 single epoch! (see Figure 3).

Figure 3. According to LLMs training requires only 1 epoch!

‍

4. Video can describe the world better than Text

An audacious take was made by Berkeley’s professor Jitendra Malik, where he suggested that video is a more efficient (and perhaps effective) way to describe the world.

Figure 4. Video (i.e. frames) can represent the same information more efficiently than Language (i.e. tokens)

‍

His views are supported by arguing that any book (see Figure 4 for some examples) can be represented in a more compact way using video (i.e. frames) than text (i.e. tokens): the same information can be conveyed way more efficiently using video than text.

‍

Professor Malik believes video will help put Computer Vision again on the map in the next few years.

‍

5. After Data-Centric, the User will be the core

Figure 5. “Humans” as the next big thing after the current data-driven approaches

‍

Princeton’s professor, Olga Russakovsky, provided fascinating insights on what is next after the data-centric approach to machine learning.

‍

She elegantly explained, Figure 5, how the field has evolved from a pure focus on models (i.e. year 2000) to the current moat of “data is king”, and argues that a time where the human (i.e. user) is the center is next.

‍

Figure 6. Web data is not enough to achieve a true representative sample

‍

For instance, she makes the case for the need of gathering truly representative data from all over the world rather than simply focusing on web data, see Figure 6.

‍

6. Bring back the fundamentals

Figure 7. MIT’s professor and researcher Antonio Torralba on stage

‍

Finally, MIT’s professor, Antonio Torralba gave a lightweight talk where he candidly shared his views on why curiosity is more important than performance (see Figure 8), especially in today’s LLMs driven world.

‍

Professor’s Torralba argues that the field of Computer Vision has been already in a position where (mostly) outsiders confidently argue that the field has stalled, yet time has proven that someone comes up with some clever idea by focusing on the fundamentals rather than following the crowd.

Figure 8. Creativity as the alternative bet when everyone else is aiming for better performance

‍

7. So, is Computer Vision dead?

The ICCV23 workshop makes clear that rather than being dead, computer vision is evolving. As leading experts argued, promising directions lie in the interplay between vision and language models.

‍

However, other frontiers also hold potential, like exploring when large vision models are needed or providing granular control over frozen generative architectures, as described by one of the papers awarded with the Marr Prize [2] in ICCV23.

‍

While progress may require integrating strengths of vision and language, key computer vision challenges remain in areas like texture perception or peripheral vision where the question of how to throw away information is still a challenge. With an influx of new researchers and industry interest, the field is poised to take on some of these questions.

‍