Images 2.0 Moves Image Generation From Novelty to Workflow Tool

Kenji HataOpenAIThursday, May 14, 202612 min read

OpenAI product lead Adele Li and researcher Kenji Hata argue that Images 2.0 marks a shift from novelty image generation to a working visual layer inside ChatGPT. In a podcast discussion with Andrew Mayne, they point to 1.5bn images generated weekly, sharper text rendering, stronger photorealism, broader aspect ratios and more consistent characters as evidence that the model is moving into education, internal communication, marketing assets, software mockups and other practical creative work.

Image generation is moving from novelty to working surface

OpenAI product lead Adele Li and researcher Kenji Hata describe Images 2.0 as a shift from novelty image generation toward a broader visual creation layer inside ChatGPT. Li treats the release as a shift in kind rather than a routine model upgrade. Her shorthand is deliberately sweeping: the new model is not only more aesthetic, but able to combine “science, art, architecture” in a single image. The claim is that image generation is becoming useful across personal, educational, and professional contexts, not only better at producing attractive pictures.

If DALL-E was the Stone Ages, Image Gen 2.0 is the Renaissance.

Adele Li

The scale of adoption is part of the case. In the two weeks after launch, Li says usage rose by more than 50%, with more than 1.5 billion images generated every week in ChatGPT. She points to viral consumer patterns across regions: color analysis and stickers in Asia, crayon and scribble styles in the United States, and a wider set of experiments that emerged because users could immediately see the model’s higher fidelity.

1.5B+

images generated weekly in ChatGPT, according to Li

The core product claim is that Images 2.0 is no longer mainly an engine for “fun” image generation. Hata says one of the research team’s closest use cases is now infographics and text-heavy images, because text in images has become “so much better.” That improvement, in his view, opens up productivity use cases that previous image models could not reliably serve.

Li breaks the model’s improvement into several dimensions: better text rendering, deliberate multilingual work, and stronger photorealism. Text on a page is higher fidelity, more legible, and more likely to be actual words. Multilingual performance matters because, she says, users in Asia and Europe are responding to those advancements. Photorealism was driven by earlier feedback that generated outputs could look unrealistic or alter people’s faces and bodies. The goal was to make images feel “more like yourself.”

Those changes are tied to a broader ambition: images as a general format for communication. Li says OpenAI approached the project from the premise that “every single output, or visual content that you see today” can be distilled into an image. In that framing, image generation is not a narrow creative feature. It is a way to produce posters, diagrams, decks, social headers, listings, thumbnails, study guides, comic books, game assets, and website and app concepts.

The leap is visible in text, composition, aspect ratio, and object binding

Kenji Hata resists the idea that the progress was mysterious, even if the result feels abrupt to users. He describes a steady internal progression in the model’s ability to bind many requested elements correctly. One informal test asks ChatGPT for a list of 100 random objects and then sends that list to the image generator. According to Hata, DALL-E 3 could produce perhaps five to eight objects in a grid, Images 1 could produce roughly 16, Images 1.5 could consistently produce around 25 to 36, and the current model can “probably do over 100,” usually getting almost all of them correct.

Model stage	Approximate grid-object performance described by Hata
DALL-E 3	About 5 to 8 objects
Images 1	About 16 objects
Images 1.5	About 25 to 36 objects consistently
Images 2.0	Probably over 100 objects, with almost all correct in the informal test

Hata’s informal object-grid test illustrates the steady improvement in multi-object binding.

The same improvement shows up in Andrew Mayne’s examples. Andrew Mayne recalls older models struggling even to spell “OpenAI,” and compares that with the current model producing pages of text and finely detailed layouts. He also describes testing whether the model could make pixel-accurate pixel-art-style work. When he gave it a 64-by-64 grid and asked it to draw inside it, the model was able to place art into the grid.

Aspect ratio is another major shift. Li says the model’s ability to render images in “any aspect ratio” surfaced as an emergent capability. Users began creating long panoramas and narrow bookmark-like images, and the same capability made 360-style outputs possible. OpenAI then added a way to view those images in a 360 environment on ChatGPT for web and mobile. Mayne says he used it to create a version of “dogs playing poker” that let the viewer sit inside the scene.

The model’s improvements also change the relationship between prompting and result quality. Mayne describes a feeling that almost anything he can “reasonably come up with” now produces a competent first result. Li says many users still come with vague requests — “make it better,” “make me look better,” “make me cuter” — and that the model and its surrounding harness have to translate those requests into likely intent. That translation is part of what she calls the model’s personality.

The speakers do not present better automation as a replacement for human direction. Mayne argues that as models improve, artists and visually fluent users gain more control, not less, because the model understands terms like depth of field and can respond to creative judgment. Li agrees: creative direction, taste, and judgment are the best ways to push the model further. In her view, Images 2.0’s strength is its ability to move across contexts — from an architectural diagram to a children’s book aesthetic — while preserving the user’s intended direction.

Photorealism came from iteration, post-training, and taste

Kenji Hata says the research team could see the model’s photorealistic improvement early in training. As checkpoints were sampled, the team compared outputs with Images 1. At one point, he says, they looked at a generated image — possibly a woman by the seaside — and concluded there was “no question” the new checkpoint was better. The contrast was strong enough that the older output suddenly looked poor by comparison.

We looked at it and we're like, alright, this is better than Images 1.

Kenji Hata · Source

Mayne frames the visual difference as a move away from the “glossy idealized magazine cover” look toward something resembling a strong photograph. He also recalls earlier generations of image systems, including old DALL-E iterations and GAN-based experiments, where outputs could require interpretation or squinting before the object was recognizable. Hata’s description of the current leap is more practical: each release yielded learnings, and the team carried those learnings into the next model.

Speed was one constraint. Mayne asks how the model could become more intelligent without returning to the old DALL-E-era experience of waiting a long time for a result. Hata says one part of the answer was token efficiency: OpenAI did work to make the model produce very good images with fewer tokens.

Li emphasizes post-training. The model had to understand world knowledge — how things look, science concepts, math, and other structured content — while also learning what kinds of images users would find beautiful or realistic. The team had to ask what “taste” would resonate with users and how to make outputs aesthetically strong across professional and personal contexts. The model’s range made training an unusually interesting problem: it had to be useful for both a consulting-style deck and a playful consumer image.

The speakers’ examples of evaluation are informal but revealing. Li has a personal test she calls the “me me me eval”: about 100 photos of herself, friends, and family, used to generate cards, birthday images, and goofy scenarios. Because users know their own faces and close relationships best, she sees this as a good way to test not only raw image capability but whether ChatGPT understands context. She wants to know whether it can remember family relationships and preferences and insert meaningful personalization at the right moments.

Hata’s recurring evaluations include the 100-object grid and photorealism prompts. He says another researcher had a favorite prompt involving a woman holding a jug of orange juice. Mayne mentions classic stress tests: someone writing with their left hand, wearing a watch on the right hand, with a clock showing a specified time; or a wine glass filled to the rim. Hata says he could sometimes prompt earlier systems into such cases, but only with laborious description.

Users are asking a high-capability model to look imperfect

The model’s viral use has not moved only toward polished realism. Adele Li says a major consumer theme is authenticity, imperfection, and nostalgia. Users have taken popular images or photos of people and asked the model to render them in intentionally crude Microsoft Paint-like styles, crayon drawings, scribbles, stickers, and other playful formats.

Li treats that not as a contradiction but as a sign of the model’s intelligence. “It takes a lot of intelligence to actually create something that is imperfect,” she says. In her reading, consumers are not only seeking better-looking images. They are looking for ways to interact with AI that feel informal, goofy, and self-expressive. They want AI to help them look good while still showing imperfection.

The pattern complicates the assumption that image generation progress means ever more polished, synthetic-looking visual output. In the examples Li cites, high capability is used to simulate low capability: a model that can produce photorealistic scenes is also asked to make images that look handmade, childish, nostalgic, or deliberately rough.

Li connects this to OpenAI’s broader mission language around distributing intelligence, but her concrete claim is narrower and more product-specific: self-expression through AI can let people produce versions of themselves they could not easily create before. The appeal lies partly in the surprise. A vague prompt or playful style request can return something that feels personally resonant without requiring the user to know the full visual specification in advance.

Education and internal communication are becoming image-heavy

Kenji Hata says OpenAI has an internal alpha channel for testing models, including a sub-channel focused on educators from elementary school through graduate level. One example that stood out to him came from a biology professor who submitted graduate-level textbook-style renderings on material Hata says he did not understand; the professor reportedly said the generated pages were perfectly accurate.

Li’s education claim is that the model can distill complex topics into images that are easier to understand. Students and teachers are using image generation to learn concepts, create study guides, and produce personalized content. Personalized learning, in her framing, means a teacher could create material that each child understands in their own language and preference. OpenAI is thinking about bringing more image-generation elements into ChatGPT generally so that when people are learning concepts, ChatGPT can teach with images.

Mayne compares the effect to classroom posters: visual explanations that can hold attention and reward close inspection. Li says the shift is already visible inside OpenAI. More than 50% of slides in internal presentations are now created with Image Gen, according to her. She sees that as evidence of a broader change in how people communicate ideas: not just by writing, but by generating images that explain, structure, and compose information.

50%+

of internal presentation slides created with Image Gen, according to Li

The technical ingredients here are not only drawing quality. Li points specifically to infographics, text rendering, and composition. The model’s ability to decide not only what to say but how to present it is, in her words, “a superpower.” Future directions she names include improving composition, expanding output types, and making generated images more editable in the product.

Images plus Codex can turn visual concepts into working software

Adele Li says OpenAI is still early in discovering the full set of use cases people will push the model toward. The next stage is a “creative agent”: a system that works alongside the user as a creative assistant, understands how they work, learns their preferences, and helps them reach a desired output. Her examples are practical and service-like: a personal interior designer, architect, or wedding planner inside Image Gen.

Andrew Mayne gives an example from his own workflow. He asked ChatGPT to find his book cover and create correctly sized social media headers for platforms such as X or Facebook. The first result had the right aspect ratio and appropriate context. Li reads that as the model doing the research and returning promotional material in a style and format relevant to the user.

The professional use cases Li names are already spreading across visual work. She says she has spoken with real estate agents using Image Gen for apartment listings and staging, YouTube creators using it for thumbnails and promotional content, and artists interested in connecting with fans. Her conclusion is broad, but still framed as a product view: for people in visual and creative industries, Image Gen is “such a hack” in the professional toolkit and, in her view, will become part of everyday workflows.

The more consequential claim is about the connection between Images and Codex. Many users begin by using Image Gen to design a website or app, then combine that aesthetic direction with coding agents. Strong image generation plus strong coding capability, Li says, means users can “zero-shot” apps from scratch. In that workflow, the image model is not only producing a mockup or an asset; it can provide visual direction that a coding agent then implements.

Mayne describes asking Codex to take his website and use Image Gen to create several concepts as contact sheets. He selected the upper-right concept and watched Codex implement it. He also mentions Codex using Image Gen to make sprites for a raven in a “Pets” context. Li says sprite sheets are going viral, as are game-design uses where people create new worlds.

Kenji Hata offers a practical sprite-sheet workflow: use a thinking mode or Codex, ask it to generate one initial sprite, then ask it to make the rest. Li adds that multi-image consistency has been strong, citing users creating 10-page comic books with consistent storylines and multi-page slide decks. She calls consistency of characters and aesthetics “completely unique to this model,” while also acknowledging that it is not perfect and remains an area for improvement.

The product ambition is a visual creation layer that can reflect each person’s aesthetic, style, and preferences. OpenAI is still trying to improve the model’s ability to help people get to the desired output “easier and faster.” The target Li describes is not a single final image, but an iterative creative assistant that can preserve identity, style, character, and intent across multiple artifacts — and, when paired with coding tools, carry some of that visual direction into software.

The best prompts give the model room to reason and a style to aim at

The prompt advice is consistent with the broader product direction. Adele Li recommends trying “image gen thinking” through the thinking or pro models. In that experience, users get a more powerful version of Image Gen that can search the web, analyze files, and use tools under the hood, producing better quality and stronger composition.

For that mode, Li recommends being open-ended. Rather than over-constraining the model, users should let it explore, reason, and find the information that matters. At the same time, it helps to give the model an aesthetic direction or grounding style. The useful combination is not vagueness alone, but an open task paired with taste.

Kenji Hata gives a simpler version of the same advice: be particular about style and what you like. His own preference is minimalist infographics, because he finds the model can sometimes be dense. Telling it to produce a clean, simple look helps steer it toward the result he wants.

Images 2.0 can infer much more than earlier models, but the speakers still present direction as valuable. It can turn vague requests into plausible outputs, render text, preserve characters across multiple images, work across aspect ratios, and draw on contextual knowledge. But the strongest results, by their account, come when users bring judgment: a desired aesthetic, a sense of what should be simple or dense, and enough creative direction for the model to act as an assistant rather than a slot machine.

Model Releases Agents and Autonomy Coding Assistants Multimodal AI Image and Video Generation AI in Education and Learning AI in Design and Creative Work