Amid the rise of LLMs, is Computer Vision dead?

October 17, 2023
5
 min read
Amid the rise of LLMs, is Computer Vision dead?

The field of computer vision has seen incredible progress, but some believe there are signs it is stalling. At the International Conference on Computer Vision 2023 workshop “Quo Vadis, Computer Vision?”, researchers discussed what’s next for Computer Vision.

In this post we bring you the main takeaways from some of the best minds in the Computer Vision landscape that gathered for this workshop during ICCV23 in Paris.

Table of Contents

  1. Quo Vadis, Computer Vision?
  2. The Anti Foundation Models
  3. Data over Algorithms
  4. Video can describe the world better than Text
  5. After Data-Centric, the User will be the core
  6. Bring back the fundamentals
  7. So, is Computer Vision dead?

Disclaimer: We went under cover into the workshop to bring you the most secret CAMRiP quality insights! 🕵️

1. Quo Vadis, Computer Vision?

Figure 1. Quo Vadis, Computer Vision workshop at ICCV23

Computer vision has reached a critical juncture with the emergence of large generative models. This development is having a dual impact. On one hand, it is opening new research avenues and attracting academics and businesses eager to capitalize on these innovations. However, the swift pace of advancement is also causing uncertainty among computer vision researchers about where to focus next.

Many feel conflicted wondering if they can match the progress in generative models compared to more established computer vision problems. This ICCV 2023 workshop (see Figure 1) brought together experts like David Forsyth, Bill Freeman, and Jitendra Malik to discuss this pivotal moment.

In the following sections we provide some highlights of the lively discussions followed on how computer vision should adapt and leverage generative models while still tackling core challenges in areas like video and embodied perception. There was consensus that combining strengths of computer vision and generative models thoughtfully is key, rather than seeing them as competing approaches.

2. The Anti Foundation Models

MIT’s professor Bill Freeman, provided three reasons why he doesn’t like foundation models:

Reason 1: They don’t tell us how vision works

In short, Bill Freeman argues that foundation models are capable of solving vision tasks but despite this achievement, nobody is able to explain how vision works (i.e. they are still a black-box).

Reason 2. They aren’t fundamental (and therefore not stable)

As shown in Figure 2, professor’s Freeman hints that foundation models are simply just a trend.

Figure 2. Bill Freeman second argument on why foundation models are lacking

Reason 3. They separate academia from industry

Finally, professor’s Freeman argues that foundation models create a boundary between those in academia (i.e. creative teams but no resources) versus those in industry (i.e. unimaginative teams but well-organized resources).

3. Data over Algorithms

Berkeley’s professor, Alexei (Alyosha) Efros, shared the two ingredients for achieving true AI:

  • Focus on data over algorithms: GigaGAN [1] showed that large datasets enable old archiectures such as GAN to scale.
  • Bottom-up emergence: data per-se is mostly noise, what is crucial is the right kind of (high-quality) data.

Also, he argues that LLMs are winning because they are being trained on all the available data with just 1 single epoch! (see Figure 3).

Figure 3. According to LLMs training requires only 1 epoch!

4. Video can describe the world better than Text

An audacious take was made by Berkeley’s professor Jitendra Malik, where he suggested that video is a more efficient (and perhaps effective) way to describe the world.

Figure 4. Video (i.e. frames) can represent the same information more efficiently than Language (i.e. tokens)

His views are supported by arguing that any book (see Figure 4 for some examples) can be represented in a more compact way using video (i.e. frames) than text (i.e. tokens): the same information can be conveyed way more efficiently using video than text.

Professor Malik believes video will help put Computer Vision again on the map in the next few years.

5. After Data-Centric, the User will be the core

Figure 5. “Humans” as the next big thing after the current data-driven approaches

Princeton’s professor, Olga Russakovsky, provided fascinating insights on what is next after the data-centric approach to machine learning.

She elegantly explained, Figure 5, how the field has evolved from a pure focus on models (i.e. year 2000) to the current moat of “data is king”, and argues that a time where the human (i.e. user) is the center is next.

Figure 6. Web data is not enough to achieve a true representative sample

For instance, she makes the case for the need of gathering truly representative data from all over the world rather than simply focusing on web data, see Figure 6.

6. Bring back the fundamentals

Figure 7. MIT’s professor and researcher Antonio Torralba on stage

Finally, MIT’s professor, Antonio Torralba gave a lightweight talk where he candidly shared his views on why curiosity is more important than performance (see Figure 8), especially in today’s LLMs driven world.

Professor’s Torralba argues that the field of Computer Vision has been already in a position where (mostly) outsiders confidently argue that the field has stalled, yet time has proven that someone comes up with some clever idea by focusing on the fundamentals rather than following the crowd.

Figure 8. Creativity as the alternative bet when everyone else is aiming for better performance

7. So, is Computer Vision dead?

The ICCV23 workshop makes clear that rather than being dead, computer vision is evolving. As leading experts argued, promising directions lie in the interplay between vision and language models.

However, other frontiers also hold potential, like exploring when large vision models are needed or providing granular control over frozen generative architectures, as described by one of the papers awarded with the Marr Prize [2] in ICCV23.

While progress may require integrating strengths of vision and language, key computer vision challenges remain in areas like texture perception or peripheral vision where the question of how to throw away information is still a challenge. With an influx of new researchers and industry interest, the field is poised to take on some of these questions.

References

[1] Scaling up GANs for Text-to-Image Synthesis

[2] Adding Conditional Control to Text-to-Image Diffusion Models

Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan

If you would like to know more about Tenyks, sign up for a sandbox account.

Stay In Touch
Subscribe to our Newsletter
Stay up-to-date on the latest blogs and news from Tenyks!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Reach Super-Human Model Performance at Record Breaking Speed!

Figure out what’s wrong and fix it instantly
Try for Free