How DALL-E 2 could solve major computer vision challenges

We’re excited to bring Transform 2022 back to life on July 19th and virtually July 20-28. Join AI and data leaders for sensible conversations and exciting networking opportunities. Register today!


OpenAI recently introduced DALL-E 2, a more advanced version of DALL-E, an intelligent multimodal AI capable of generating images based entirely on text descriptions. DALL-E 2 uses advanced deep learning techniques that improve the quality and resolution of the generated images and provide more capabilities such as editing an existing image or creating newer versions of it.

Although many AI enthusiasts and researchers have touted how wonderful DALL-E 2 is for creating art and images out of thin words, in this article I want to explore a different application for this powerful text-to-image model – Computer Vision to generate datasets. To solve the biggest challenges.

Caption: DALL-E 2 generated image. “A rabbit spy is sitting on a park bench and reading a newspaper in a Victorian setting.” Source: Twitter

Disadvantages of computer vision

The Computer Vision AI application can range from detecting benign tumors in CT scans to enabling self-driving cars. Yet what is common to all is the need for abundant data. One of the most famous performance predictions of the Deep Learning Algorithm is the size of the underlying dataset on which it was trained. For example, the JFT dataset, the built-in Google dataset used to train image classification models, contains over 300 million images and over 375 million labels.

Consider how the image classification model works: The neural network converts pixel colors into a set of numbers representing its properties, also known as “embedding” of the input. Those features are then mapped to the output layer, which contains the potential score for each class of images detected by the model. During training, the neural network attempts to learn representations of the best features that differentiate between classes, e.g. Doberman v. A poodle.

Ideally, the machine learning model will learn to generalize to different lighting conditions, angles, and background environments. Yet too often, the deep learning model learns misrepresentations. For example, the Neural Network might have guessed that the blue pixels were a feature of the “Frisbee” class because all the images of Frisbee he had seen during training were on the beach.

One promising way to address such shortcomings is to increase the size of the training group, e.g. Adding more pictures of Frisbee with different backgrounds. Yet this exercise can prove to be costly and lengthy.

First, you will need to collect all the necessary samples, e.g. By searching online or capturing new images. After that, you’ll need to make sure there are enough labels in each class to prevent the model from overfitting or underfitting some. Finally, you will need to label each image, stating which image belongs to which category. In a world where more data is translated into a better-performance model, these three measures serve as barriers to achieving sophisticated performance.

But even so, computer vision models are easily fooled, especially if they are attacked with opposing examples. Guess what is another way to reduce counter-attacks? You guessed it – more labeled, well-curated and varied data.

Caption: OpenAI’s CLIP incorrectly classifies apples as iPods due to textile labels. Source: OpenAI

Enter DALL-E 2

Let’s take the example of the classification and class of dog breeds for which it is a little difficult to find images – Dalmatian dogs. Can we use DALL-E to solve the problem of lack of data?

Powered by DALL-E 2, consider applying the following techniques:

  • Use of vanilla. Feed the class name to DALL-E as part of the textual prompt and add the generated images to the labels of that class. For example, “A Dalmatian dog chases a bird in the park.”
  • Different environments and styles. To improve the model’s ability to generalize, use prompts with different environments while maintaining the same class. For example, “A Dalmatian dog chases a bird on the beach.” The same applies to the style of the generated image, e.g. “A Dalmatian dog in the park chases a bird in the style of a cartoon.”
  • Anti-samples. Use the class name to create a dataset of conflicting examples. For example, a car like the Dalmatian.
  • Variation. One of the new features of DALL-E is the ability to generate multiple variations of the input image. It can also take a second image and fuse both by combining the most prominent aspects of each. One can then write a script that feeds existing images from all datasets to generate dozens of variations per class.
  • eyen Painting DALL-E 2 can also make real edits to existing images, adding and removing elements while considering shadows, reflections and textures. This can be a powerful data augmentation technique to further train and enhance the underlying model.

Aside from generating more training data, the big advantage of all the above techniques is that the newly generated images are already labeled, eliminating the need for human labeling staff.

While image generating techniques such as Generative Adverse Networks (GAN) have been around for a long time, DALL-E 2 in its 1024 × 1024 high-resolution generation, the nature of its versatility in turning text into images and its strong semantic consistency Understanding the relationship between

Automatic dataset creation using GPT-3 + DALL-E

The input of DALL-E is the text prompt of the image we want to generate. We can take advantage of GPT-3, a text generating model, to generate dozens of textual prompts per class which will then be fed into DALL-E, which in turn will create dozens of images that will be stored per class.

For example, we can generate a prompt that includes a variety of environments for which we want DALL-E to create images of dogs.

Caption: GPT-3 generated prompt to use as input in DALL-E. Source: Author

Using this example and a template-like sentence such as “A [class_name] [gpt3_generated_actions]We can feed DALL-E with the following prompt: “A Dalmatian is sleeping on the ground.” This can be further optimized by fine-tuning GPT-3 to create dataset captions similar to the one in the OpenAI Playground example above.

To increase confidence in the newly added templates, one can set a certain threshold to select only the generations passing through a certain ranking, as each generated image is sorted by an image-to-text model called CLIP.

Limitations and mitigation

If not used carefully, DALL-E can produce inaccurate images or narrow space images by excluding certain ethnic groups or ignoring traits that may lead to bias. A simple example would be a face detector that is only trained on images of men. Moreover, the use of images generated by DALL-E carries a significant risk in certain domains, such as pathology or self-driving cars, where the cost of false negatives is extreme.

The DALL-E 2 still has some limitations, with creativity being one of them. Depending on the prompt, for example, suppose the true position of objects can be dangerous.

Caption: DALL-E still struggles with some hints. Source: Twitter

Ways to reduce this include taking human samples, where the human expert will randomly select samples to test their validity. To optimize such a process, one can follow an active-learning approach where images with the lowest CLIP ranking for a given caption are given priority for review.

Final words

DALL-E 2 is another exciting result of OpenAI that opens the door for new types of applications. Generating huge datasets to overcome one of the biggest obstacles to computer vision – data is just one example.

OpenAI hints that it will release DALL-E sometime during this coming summer, in a phased release with pre-screening for mostly interested users. Those who can’t wait, or are unable to pay for this service, can tinker with open source options like DALL-E Mini (interface, playground repository).

While the business case for many DALL-E-based applications will depend on its API users pricing and policy OpenAI set, it is certain that they will take a big leap in image generation.

Sahar Mor has 13 years of engineering and product management experience focused on AI products. He is currently the Product Manager at Stripe, a leader in strategic data initiatives. Previously, he founded AirPaper, a document intelligence API powered by GPT-3, and was the founding product manager at B2B AI accounting software company, Zeitgold (Acq. By Deel), where he produced his human-in-the-loop product. Created and scaled. , And Levity.ai, the no-code AutoML platform. He also worked as an engineering manager in early-stage startups and in the selected Israeli intelligence unit, 8200.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including tech people working on data, can share data-related insights and innovations.

If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing to your own article!

Read more from DataDecisionMakers

Similar Posts

Leave a Reply

Your email address will not be published.