The History of Computer Vision on the Path to AGI
After the highly sought-after history of LLMs, we continue digging into machine learning annals and what is important to learn from them to thrive nowadays
Introduction
If we ever achieve Artificial General Intelligence (AGI), it will not happen solely thanks to Large Language Models (LLMs). While these models have made remarkable progress in natural language processing (NLP) and have given us the impression that machines can finally 'talk' to us, language by itself is simply not enough – and sometimes even unnecessary – for an intelligent creature. Language, in this sense, is not what directly contributes to intelligence; rather, it serves as a sign for us – self-acclaimed intelligent beings – that we are 'talking the same language' with the models.
If AGI ever becomes possible, language will once again play a more significant role in communication, allowing us to understand the models better. However, it is essential to consider how children and animals learn and develop intelligence. Children initially learn by sight, by ear, and by touch, relying on their senses to understand the world around them. Similarly, animals can understand us through speech recognition and/or visual perception without relying on text, emphasizing the importance of non-linguistic cues in intelligence.
While not dismissing the importance of advancements in language models and acknowledging the significant progress in NLP (we even have an amazing series about The History of LLMs), as exemplified by models like GPT (Generative Pre-trained Transformer), it is crucial to highlight the importance of computer vision (CV) in understanding the physical world. In this adventure series about CV, you will learn about all the struggles the researchers went through from the late 50s up to nowadays; the main discoveries and major roadblocks for real-life implementation, dead ends, and breakthroughs; and how much computer vision has changed our lives. But first, let's explore what computer vision entails.
What is computer vision?
In humans and other animals, vision involves the detection of light patterns from the environment and their interpretation as images. This complex process relies on the integration of sensory inputs by the eyes and their subsequent processing by the brain, interconnected by the nervous system. A resulting intricate network enables humans to filter and prioritize stimuli from a constant influx of sensory information, transforming light patterns into coherent, meaningful perceptions.
We will see in the future episodes, that researchers tried very different approaches to teach machines to see, perceive, and interpret images. Don't be fooled (or upset) by the anthropomorphizing wording. Despite the field's advancements, current computer vision techniques remain fundamentally different from human vision. They are based on sophisticated algorithms and models that approximate visual processing, rather than duplicating the biological complexity of human sight.
It's an irony, but even with such advancements in language processing, we still can't process with the right language to avoid anthropomorphizing machines.
How does it work?
