Because the technology is now widely accessible and there are less technical restrictions around it, artificial intelligence-powered picture generators have essentially become commodities during the past two years.
They have been used by almost all of the big technology companies, such as Google and Microsoft, as well as innumerable startups hoping to win a piece of the lucrative generative AI market.
That isn’t to say that their performance is reliable just now; far from it. While image generator quality has increased, the improvement has been slow and occasionally agonizing.
However, Meta asserts that she has made progress.
The company claims that its CM3Leon AI model, which stands for “chameleon” in awkward leetspeak, offers state-of-the-art performance for text-to-image generation.
As one of the earliest image generators able to provide captions for photos, CM3Leon stands out for establishing the foundation for future image-understanding models, according to Meta.
“With CM3Leon’s capabilities, image generation tools can produce more coherent imagery that better follows the input prompts,” Meta stated in a blog post that was shared with TechCrunch earlier this week.
“We believe that CM3Leon’s strong performance across a variety of tasks is a step toward higher-fidelity image generation and understanding.”
The majority of contemporary image generators, such as OpenAI’s DALL-E 2, Google’s Imagen, and Stable Diffusion, rely on a technique known as diffusion. In diffusion, a model learns to gradually remove noise from a starting image that is wholly noisy, bringing it closer to the goal stimulus with each successive step.
The outcomes are outstanding. However, because of its high computing cost and poor speed, diffusion is not practicable for most real-time applications.
In contrast, CM3Leon is a transformer model that uses a technique called “attention” to evaluate the value of input data like text or images.
Transformers’ architectural oddities, such as attention, can speed up model training and facilitate parallelization. In other words, with significant but not impossible improvements in computation, larger and larger transformers can be taught.
Meta asserts that CM3Leon is even more effective than the majority of transformers, needing five times less computing power and a smaller training dataset.
It’s interesting to note that a model dubbed Image GPT developed by OpenAI few years ago investigated the use of transformers for image production. But in the end, it gave up on the concept in favor of diffusion, and it may soon switch to “consistency.”
Meta utilized a dataset of millions of Shutterstock-licensed pictures to train CM3Leon. The most powerful CM3Leon Meta created has 7 billion parameters, more than twice as many than DALL-E 2.
(Parameters are the model’s components that are learned from training data and effectively describe how well the model performs on a certain task, such as producing text or, in this case, photos.)
Supervised fine-tuning, or SFT for short, is a technique that contributes to CM3Leon’s improved performance.
SFT has been successfully used to train text-generating models like OpenAI’s ChatGPT, but Meta hypothesized that it might also be helpful when employed in the image domain.
In fact, instruction tuning enhanced CM3Leon’s performance in both the creation of photos and the production of image captions, allowing it to respond to queries about photographs and edit images by following text directions (such as “change the color of the sky to bright blue”).
“Complex” objects and language prompts with too many restrictions are challenges for the majority of image generators. However, CM3Leon doesn’t, at least not as frequently.
In a few selected examples, Meta gave CM3Leon instructions like, “A small cactus wearing a straw hat and neon sunglasses in the Sahara desert,” “A close-up photo of a human hand, hand model,” “A raccoon main character in an Anime preparing for an epic battle with a samurai sword,” and “A stop sign in a Fantasy style with the text ‘1991.’” to have CM3Leon create images.
I used DALL-E 2 to compare the identical prompts for comparison’s purposes.
The outcomes were somewhat close. But to my eyes, the CM3Leon photos were typically more accurate and detailed, with the signs being the most glaring example. (Up until recently, diffusion models did a really terrible job of handling both text and human anatomy.)
CM3Leon can comprehend directions on how to alter already-existing photos. The model can produce something visually cohesive and, as Meta puts it, “contextually appropriate” — room, sink, mirror, and bottle included — when given the instruction to “Generate high quality image of ‘a room that has a sink and a mirror in it’ with bottle at location (199, 130),” for instance.
These types of subtle prompts are completely missed by DALL-E 2, which occasionally forgets to include the objects that were called out in the prompt.
Additionally, unlike DALL-E 2, CM3Leon may respond to a variety of inquiries and generate short or lengthy descriptions for an image.
Despite having less text in its training data, the model outperformed even specialized image captioning models (like Flamingo and OpenFlamingo) in these areas, according to Meta.
What about bias, though? After all, it has been discovered that generative AI models like DALL-E 2 perpetuate societal biases by producing images of authority figures with titles like “CEO” or “director” that primarily include white men.
The only thing Meta says in response to this query is that CM3Leon “can reflect any biases present in the training data.”
The business claims that generative models like CM3Leon are becoming more complex as the AI industry develops. While the industry is still developing its awareness and response to these issues, transparency, in our opinion, will be essential to accelerating development.
When and if Meta plans to release CM3Leon are unknown. I wouldn’t hold my breath given the debates around open source art generators.