Dall-E 2, Stable Diffusion, and the future of art creation

The recent explosion of AI art generators

A pretty impressive new technology has been dominating the machine learning news lately: diffusion models for generative art. The diffusion part of the model is relatively simple; random noise is added to an image, and a neural network is given the random noise image, along with a textual description of what was in the picture, and is asked to reconstruct the original. The network is trained by repeating this process millions of times, with images scraped from all over the internet. Once training is done, you can give the model a text description of what you would like to see (Godzilla eating a pizza, perhaps?), and the model will produce an image that it thinks best represents the text you provided. Like many great ideas, the algorithm for creating a diffusion-based model, in retrospect, seems pretty obvious.

It currently takes many high-end GPUs (graphics processing units) and a lot of time to train one of these diffusion models to convergence. Because of this, there are only a few fully trained models that are accessible to the public. The first is Dall-E 2, which was announced in early 2022 by OpenAI. The model was eventually made available to the public through a waitlist in the summer of this year. Midjourney is an independently funded research lab that released their own diffusion model for public use in the summer as well. And finally, Stability.ai released Stable Diffusion, their own diffusion-based art generator in late August. All three are all diffusion-based, there is a definite difference in style and quality between the models, which probably are primarily driven by budgets and selection of training data. In general, Dall-E tends to be better at producing more "realistic" images compared to the other two, but can many times get into a bit of an uncanny valley, with things looking almost correct but not quite. I personally find the style of Midjourney to be pleasing, while Stable Diffusion is, ironically, the least stable, in terms of output quality.

Probably the most interesting thing about these models, however, isn't their stylistic differences, but how the companies who trained them have handled the releases. OpenAI has kept their model behind a waitlist (not very open, eh?) - once you get accepted, you can submit some queries to Dall-E, and it will generate images for you. Midjourney is somewhat similar, but is run in a discord server. You can sign into the discord, submit your own descriptions to the model and watch other people generate their images in the same thread - its a great way to lose a few hours, and it makes it quite clear that understanding the model and making good prompts is a skill. The most interesting release method, however, is Stable Diffusion. Stability.ai released the model completely open source, with both the model and the weights being available to everyone, and it can be run on a laptop. This move kind of blew the doors off of the idea that these models would only be available to larger tech companies and venture-funded start-ups. While it's quite possible that the Googles and Metas of the world will always have the best versions of these expensive machine learning models due to their massive compute infrastructure and ability to curate very large datasets for training, it seems possible that we will, through various means, get the open source version soon after. It first happened with the BigScience collaboration releasing a model language model similar to GPT-3, and now we have Stability.ai releasing a completely open-source diffusion art generator as well. While this trend is certainly cause for optimism, it's unclear to me how long it will last. Model sizes are growing faster than our compute power, meaning that each successive generation of state-of-the-art models is getting more expensive to train. If current model size trends continue, then it may not be long until the resources required to train the largest (and best) models are truly only available to the largest corporations, and even copycats like Stable Diffusion won't be viable.

Impact of AI art on art creation

While the advent of large diffusion models is great for AI researchers, hobbyists, and tinkerers, it will not be quite so nice for working artists, unfortunately. These tools certainly won't completely displace artists, but what they will do to the market is two-fold. First, as these models get better I suspect they will decimate the low end of the art market. People who use freelance sites like Craigslist or Fiverr to get small pieces of art, illustration, or commercial work for signage and ads done will suddenly have access to options that will produce something that is probably 90-95% as good, but 10-100x cheaper and 1000x faster. Secondly, for the mid- to high-tier art and design market, these tools are going to make the best artists significantly better. Artists who learn to use these tools will become significantly faster and more creative at their jobs, reducing the total headcount needed in the industry. For completeness, I don't think this will have much of an effect on the world of "fine" art, where its not just the art itself, but the provenance and interpretation of the piece that collectors and museums are interested in.

This movement in the industry has strong parallels to a few other points in history: the invention of the printing press, and the introduction of steam- and eventually combustion-powered tools to farming.

The invention of the printing press democratized the transmission of knowledge. Prior to the press, if you wanted a book to be copied, you had to be rich enough to pay a monk or some other learned person to hand-copy the book, a process that could take the majority of a year for a single copy. The ramifications of the printing press are massive, including the eventual destabilization of the primarily monarchical kingdoms of the time and the nearly complete shattering of the Church's dominance in political power and control of knowledge, religious or otherwise, throughout Europe. While computer generated art isn't going to overthrow the world order anytime soon, the mechanism of disruption is very similar. Before the printing press, one needed either knowledge or creativity to write a book, depending on the topic, and skill and time to create each copy of the book. The printing press completely replaced the skill and time requirements for distribution. In much the same way, creating a painting requires creativity to decide what to paint/draw, and skill and time to execute the artwork. These AI art generators are effectively going to do to the art creation industry what the printing press did to the Church's monasteries.

I think the parallels to the introduction of engine-powered machinery to farming is perhaps even more direct. Before the industrial revolution, over 90% of all Americans lived and worked on farms. Now, in the 2000's, less than 1% of Americans work on farms. What changed? Technologies such as fertilizers, GMOs, pesticides, and selective breeding increased the yield of each plant, which helped, but by far the biggest change was the introduction of steam-powered and eventually combustion-powered tractors and combines to the farm. A single farmer with industrial farming equipment can maintain more productive land than a small town could, pre-industrial revolution (obviously, only for those plants where the process can be highly mechanized - corn, wheat, etc). I think in much the same way, these art generation tools are going to make individual artists significantly more efficient. When an artist doesn't need to paint each stroke or draw each line, but can focus primarily on the composition and framing, a work that previously would have taken hours or days could take a fraction of that.

While this may seem bad for artists in the short term, overall this march of progress will be beneficial to society as a whole. At the time of industrialization, there were riots and blood in the streets because farmers and others were losing their jobs. But it led to the rapid increase in labor available for factory and construction jobs that fueled the evolution of farming societies into modern industrial ones. Likewise, globalization caused great unrest, and caused real problems, but was one of the primary drivers of the automation and robotization of America's factories. Our factories are more productive than ever before, but with only a fraction of the workforce, driving down prices on goods and increasing the available labor pool for more meaningful, less automatable work. While the art and design industries are not nearly as large or important as manufacturing or farming, I suspect a similar story will play out over the next 20 years as these tools get better.