As mentioned in my previous post, I've recently taken a great interest in artificial intelligence, but that doesn't mean I've lost interest in 3D development. In fact, I've become fascinated with the concept of using 3D software to create photorealistic, synthetic training datasets for image recognition tasks. The question is, can we create images that are realistic enough to fool artificial intelligence?
The Current Data Problem
I've found, from both researching and experimenting, that one of the biggest challenges facing AI researchers today is the lack of correctly annotated data to train their algorithms. In order to learn, artificial intelligence algorithms need to see thousands of examples that are correctly labeled. In image recognition tasks, that means picking out which pixels contain the objects you are looking for. Here's an example:
In the example above, I took a picture of a spider I found on my patio, then manually drew a bounding box and a pixel mask. This is the kind of annotation an AI would need to learn. While it took me zero seconds to spot the spider in my peripheral vision, it took me a minute to get my phone out and capture a candid shot and another couple minutes to do a pretty sloppy job creating the annotation back at my computer. Note that the orange color is just for illustration, the AI would make it's best guess on the unaltered image, then compare it to the correct bounding box and the shaded pixels defined in an accompanying annotation file.
The Case for Synthetic Datasets
Let's say you have painstakingly collected 1,000 images (actually considered to be a tiny dataset) of spiders you want to classify and it takes you about 2 minutes to properly annotate each one. 1,000 x 2 / 60 = 33.3 hours. Add in breaks and you have a full 40 hour work week, not including the time it takes to find all of those spiders. Not sure about you, but I don't have that kind of time and even if I did, I wouldn't want to spend it on spider pics. In fact, one of the biggest real, annotated image datasets, called COCO (Common Objects in COntext) contains >200,000 labeled images and took 70,000 person hours (on Amazon Mechanical Turk) to fully annotate. 70,000 / 200,000 x 60 = 21 minutes per image. Why such a higher number? COCO images label everything in complex scenes. For example, let's look at a couple photos I took that are similar to what you'd find in the COCO dataset. If you wanted to fully annotate the turtle image below, you'd have to draw a separate mask for the at least 10 overlapping turtles, plus the logs and sticks. For the restaurant scene, you'd need to draw a separate mask for each person, light bulb, plant, chair, table, window, condiment, etc. Sounds awful.
One thing that AI researchers often do is called "augmentation". Augmentation is the automated process of creating many variations of the same image to cheaply create more labeled images. For example, we might flip the image, rotate it a few degrees, zoom in a bit, etc. You can create a lot of variations this way, multiplying the size of your dataset, and it definitely helps to train the AI. Here is the original plus four examples of augmentation that could help the AI to learn what spiders look like in different orientations and sizes.
What if we went a bit deeper? Could we put the same spider on a different background? Here's something I came up with using the GIMP image editor:
It's the exact same spider, cut out and super-imposed over a different background, and it uses the exact same bounding box and mask. It's not hard to imagine how this technique, combined with augmentation, could be used to generate tons of variations without any manual work, once the original cutout is done. Pretty cool right?
Brace yourself while I take it a step further. Imagine that I had created (or paid for) a realistic 3D model of a spider that could be rendered in tons of different poses. I could then automatically create even more interesting variations on different backgrounds. Since I'm rendering the image, no one needs to manually annotate it because the same 3D software that creates the realistic render can also save a mask image. Even cooler.
I don't have any 3D rendered versions to show you, so use your imagination. We're done with spiders.
Real World Examples
Synthetic image datasets are not an original or recent idea. I've found many examples of people using synthetically generated datasets to train AI with impressive success.
One of my favorites is Spilly, a startup that superimposed 3D rendered human models on random images in tons of different poses to train an AI that could find people in videos. (Note that since a video is merely a set of images played in rapid succession, the task of finding things in videos is pretty much identical to finding things in photos.) They got very impressive results by training with synthetic images, and then fine tuning with a smaller set of real images. Here's their blog post about it and here's an absurd video demo they put together:
SynthText in the Wild Dataset
This is a synthetic dataset of 800,000 images that places fake text on top of real images. Check out the website, and an example:
A research paper titled "Towards End-to-end Text Spotting with Convolutional Recurrent Neural Networks" utilized the dataset to train an AI find text in the wild on street signs, business signs, etc. with impressive success. They started with synthetic images and then moved on to real world images for further training.
AirSim uses Unreal Engine to simulate realistic 3D worlds, then output specially annotated images (depth, segmentation, RGB). It's designed to be used by artificial intelligence on drones. Here's their GitHub repo, and here's a video showing what it looks like:
My Own Attempts
Obviously I'm pretty intrigued by this concept, so I've decided to try it out myself. The task I've chosen is to identify cigarette butts in photos. I find discarded cigarette butts extremely irritating, and the fact that smokers nonchalantly litter them EVERYWHERE really pisses me off. If I could teach an AI to recognize cigarette butts, then I could theoretically attach that functionality to a small, autonomous robot that could pick them up. Wouldn't it be wonderful?
First, I had the idea to create the images 100% synthetically. My plan was to learn how to create realistic grass in Blender, then position cigarette butts in the scene and render from lots of different angles. After a couple days making mediocre grass, I realized this was probably overkill, so I decided to superimpose 3D cigarette butts over 2D photos I took down the street from my house (not my weeds!). Here are a couple examples:
They don't look great, but they might be good enough to train an AI. I'm currently working on a pipeline that will generate thousands of them automatically so that I can train an existing, open source AI like Matterport's Mask R-CNN. I'm sure I can make them look better than this too, since I'm getting better at Blender every day. I'll share my results once I get a chance to test a decent sized synthetic dataset.