Is AI Getting Out of Control?
As the state of technology continues to advance and begins to mature overtime, you might hear doomsayers prophesying that we as humans are on the brink of obsolescence and are destined to become subservient to our ‘Robot Overlords’ who will rule the world with an iron fist and force us normies to obey their every command. Well, yes and no. (It’s not that draconian…. Until it is ??).
What intrigued me to write this article is the wave of news and information that has been making headlines around the world regarding the applications of Artificial Intelligence (AI), and how fast it has been progressing overtime. Although AI has garnered much attention among the technocrats and technophiles out there and is believed to be overhyped, I still stick to my original belief that the masses are still underestimating the pace in which AI is advancing and becoming mainstream, until it’s too late. I believe that the advent of emerging technologies, especially that of Artificial Intelligence, Biotechnology, Blockchain, Internet of Things and Augmented/Virtual reality will bring about massive changes that would revolutionize the world in ways we could have never imagined. As this article mainly focuses on the applications of AI, join me as I delve into some of the wonders of AI while attempting to comprehend its capabilities and functions in today’s world using real-world examples.
GPT 3, DALL·E & DALL·E 2
You might have heard these 3 phrases come up frequently in the news, or from blogs and articles, as the world strived to make head or tail of what these buzzwords were and how they were impacting the world on a daily basis.
1. First up is GPT-3, which is the acronym for third generation Generative Pre-trained Transformer, and is known as a “neutral network machine learning model trained using internal data to generate any type of text”. It was developed by OpenAI (an AI research and deployment company), which utilizes deep learning through a small amount of input text to generate various types of content such as stories, translations, and huge volumes of coherent machine-generated text. How GPT-3 is able to achieve this is due to it being a neural network with a training model that has a learning parameter of 175 billion parameters, which at the time it was developed, gives it the edge over other prior models when producing text that remains convincing enough to make it seem it was written by a human due to its resemblance.
Input Text → Human-Like Text
2. DALL·E on the other hand is a “12 billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs.” Released by Open AI, it has the capability to perform a variety of functions, ranging from creating anthropomorphized versions of animals and objects to rendering types of text and implementing transformations to existing images. Being a transformer language model (like GPT-3) that was designed and trained on large numbers of images and accompanying captions, it can essentially generate realistic looking images from scratch just from the using text prompts such as the description of the scene or object being inputted. How it does this is that DALL·E receives the “text and image as a stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another”. (OpenAI 2021) A token in this sense is termed as a form of discrete vocabulary, where DALL·E’s vocabulary list has tokens for both text and image concepts, akin to the notion of how each English letter is considered a token of the 26-letter alphabet. OpenAI describes each image caption as being represented with a max of 256 Byte Pair Encoding tokens with a vocabulary size of 16384, while each image as a representation of 1024 tokens with a vocabulary size of 8192.
Text Prompt → Realistic Looking Images
Source: OpenAI
3. DALL·E 2 as the name suggests, is a ‘build on’ from DALL·E and is stated by OpenAI to be able to generate more realistic and accurate images with 4x greater resolution.
DALL·E 2 works on a 3.5 billion parameter model which is lesser than DALL·E’s 12 billion-parameters, but can perform extra functions such as expanding images that is beyond what’s in the original canvas, as well as being able to make edits that look realistic to existing images from a natural language caption by altering various elements while taking into consideration the reflections, shadows and textures of the image. Not delving into the details of the underlying mechanisms behind DALL·E 2 on how it operates, the two main building blocks responsible for powering DALL·E 2 are CLIP (Contrastive Language-Image Pre-training) and Diffusion. CLIP is an OpenAI model that DALL·E 2 learns from which essentially provides the “link between textual semantic and their visual representations”, like a bridge between text and images. CLIP learns the degree of relation any given caption is to an image, rather than predicting a caption given an image, all learnt through the hundreds of millions of images it’s being trained on.
Source: OpenAI
Diffusion is the process which involves a pattern of random dots, where that pattern would gradually alter to form an image should specific aspects of that image be recognized. Diffusion models learn to generate data by “reversing a gradual noising process.” It ruins images (add noise to it), then try to reconstruct the images to their original form, hence learning how to generate images from various kinds of data.
Text Prompt → Realistic Looking Images (High Res + Better Caption Matching + Photorealism)
Source: OpenAI
Stable Diffusion
*CLIP is the neural network released by OpenAI that efficiently learns visual concepts from natural language supervision.
Stable Diffusion is known as a text-to-image model that aims to empower billions of people around the world to create their own piece of art within seconds. Released in 2022 through a collaboration with the team at CompVis and Runway, Stable Diffusion is a latent diffusion model, where its core dataset was trained on LAION-Aesthetics , subset of LAION 5B (a CLIP-filtered data set of 5.85B high quality image-text pairs). A new CLIP-based model that was used to filter LAION-5B according to how “beautiful” an image was, was then used alongside the ratings provided from the alpha testers of Stable Diffusion to create LAION-Aesthetics.
Stable Diffusion ultimately “runs on under 10 GB of VRAM on consumer GPUs, generating images at 512x512 pixels in a few seconds”. This allows researchers and the public to operate this under a range of conditions, thus enabling the democratization of image generation and creation. As a result of Stable Diffusion’s code being made public and being an open-sourced project, it can run on most consumer GPUs.
Text Description → Detailed Images
Stable Diffusion Demo on Hugging Face
Midjourney
Midjourney is an “independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species” through the utilization of an AI-powered system which can create artwork-based images from user’s text prompts. Midjourney is generally primed to creating unique pieces of art for its users, making its service accessible to everyone by allowing them to join its Discord server.
Text Prompt → Art
Community Showcase of Art Source: Midjourney
Make-A-Video (Meta)
Facebook Meta recently announced Make-A-Video, which is a form of new AI system allowing individuals to turn text prompts into brief, high quality video clips. Think of it as a Dall-E but for video clips. Building upon Meta AI’s progress in generative technology research, Make-A-Video aims to open up various opportunities for creators and artist globally by giving people the tools to easily and quickly create new content. MetaAI claims that “with just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors, characters, and landscapes.” Apart from being able to create videos from images, it also has the ability to take existing videos and create new similar ones.
Text prompt → Video clips
Text Prompts to Videos. Source: Meta AI
Imagen Video (Google)
Unveiled by Google Research, Imagen is a “a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.” It’s a text-to-image AI technology, comparable to the likes of Dall-E 2 and other latent diffusion models. However it wasn’t until only recently, that Google announced Imagen Video, which is a text-to-video AI generative model system that utilizes a cascade of video diffusion models to produce high-definition videos from a text prompt, something quite similar to that of MetaAI’s Make-A-Video. Google claims that Imagen Video can generate videos with high fidelity alongside a high-degree of controllability and world knowledge.
Text prompt → Video clips
Imagen Produced Video. Source: Google
GitHub copilot
Owned by Microsoft and powered by OpenAI Codex (a new AI system created by OpenAI), Github Copilot is an “AI pair programmer that offers autocomplete-style suggestions as you code”. How it functions is that programmers are able to receive suggestions from GitHub Copilot in two ways, either by writing the code they want to use, or by writing a natural language comment that describes what they would like the code to do. From there onwards, Github Copilot basically analyzes the content in the file they are editing as well as other related files, to ultimately offer suggestions from within the text editor. You can think of it as a form of advanced auto-complete for computer code which assists programmers become faster and more efficient with their coding, by handling the more repetitive and mundane parts of their code. The CEO of Github Thomas Dohmke, proclaimed that among files that programmers work with, nearly 40% of the coding work is being written by GitHub Copilot in languages such as Python, where he believes that number is set to increase in the future as Copilot continues to create more time and space for developers to focus their efforts on solving more complex problems and building better software.
Text/Code Prompt → Code
Copilot in use. Source: Github
Joe Rogan + Steve Jobs podcast
The last example which I’ll draw from is one that I found to be really mind-blowing and impressive. It’s on a 20-minute deepfake podcast conversation between renowned commentator Joe Rogan and Apple co-founder Steve Jobs. Think of it as a normal podcast between two parties, with one party asking questions and interviewing while the other answers, except that none of these parties in this instance were actually present and conducting the podcast at any given point in time. Created by Dubai-based AI voice synthesis generator & realistic text to speech company Play.ht, the company claims that “The Steve Jobs episode was trained on his biography and all recordings of him found online so that the AI could accurately bring him back to life.” Likewise, thousands of hours of audio from the Joe Rogan Experience podcast were curated to portray a ‘believable’ representation of Joe himself so that the conversation between the two would sound as realistic and as listenable as it could be. Play.ht explains that the AI Voices are “created by machine learning models that process hundreds of hours of voice recordings from real voiceover artists and then learn to speak based on the audio recordings”, while attempting to imitate human’s tones and emotions. Moreover, it was revealed that the script itself was also created by AI. Listeners of the podcast have stated that despite the first bit of the podcast sounding a bit ‘clunky’, it slowly began to sound believable with topics that listeners could begin to resonate with and make a connection to what both parties were saying. Even their tone of voice and speech cadence made it feel to some, that the duo were actually conducting a real life ‘human-like’ podcast. This to me was a stark example to how far AI has come and the implications it’ll have.
Text → Speech
Implications
There are many other AI-based examples out there with new innovations occurring on a daily basis which I have omitted, but the examples above have made apparent the breakneck speed at which AI is evolving and maturing. We now have Text à Text, Images, Art, Videos, Code, Speech & more. What’s next? Speech à _________?
Think about Translators, Authors/Writers, Artists, Content Creators/Producers, Bloggers, Graphic Designers, Illustrators, Telemarketers, Call Centre Operators (IVR Systems). These are all jobs/occupations that are associated with the above examples, where AI demonstrates the ability to perform similar tasks. Note that this excludes the other various sectors such as manufacturing, retail and procurement that we’ve seen AI has had a profound impact on. Before rushing to point out the negative implications associated with AI such as the loss of jobs, and the notion of an “AI takeover” scenario, it’s wise to note that AI would mostly take the form of supplementing people with their jobs, and performing ‘mundane’ tasks while ultimately freeing human capacity to innovate more, or accomplish other ‘high-performing’ tasks. Sure, it’s imminent that AI will automate or ‘take over’ certain roles, rendering them ‘obsolete’ over time, but it does open up numerous opportunities for the average Joe in helping them create/build the things they desire. For example, individuals can potentially turn towards AI to become a new ‘type’ of content creator, such as being able to become their own ‘movie producer’ of short films or creating their own videobook/e-learning course that is accompanied by AI generated voiceovers and imagery. Those that are able to utilize the perks of AI while invoking a sense of emotional feel or aspect into their work/content, are bound to excel in the coming technological decade.
Now, to address the original question which is the title of this article “Is AI Getting Out of Control?” Well, yes to a certain extent and no for the time-being. No, in the sense that AI is not yet sentient, and according to some, we’re still quite far from achieving AGI (Artificial General Intelligence) which is the representation of generalized human cognitive abilities in software. Yes, in the sense that if AI were used by parties with malice intent. Imagine how it’d be like if voice synthesis AI software and deepfakes tech were to be misused. As AI progresses in terms of realism, so do the reality of deception they potentially pose to others, who might be incognizant to the fact that they are actually interacting with an AI agent. While most people rely on the sole purpose of AI in being able to complement their daily activities and on improving standards of living in terms of enhancing efficiency and convenience, one can easily see the fine line drawn between using AI as a tool and being ‘consumed’ by it, when alluding to the social ramifications of things. Moreover, it’s hard to fathom whether or not AI becomes sentient one day, to which I can confidently say that my reasoning of answering “Yes” to the question would entail a much longer explanation if it does.