ML-Agents Platformer: Visual Coin Collector

unity_ml_agents_camera_vision_coin_collector.png

Intro

In this tutorial, you’ll learn how to use a camera for visual input to your agent instead of ray perception.

Prerequisites

If you didn’t come from the ML-Agents Platformer: Simple Coin Collector tutorial, make sure to do that first so that you’re starting from the same spot.

Companion YouTube Video

Check out the YouTube video where I explain the whole thing with more detail and extra contextual information.

Camera Vision | Unity ML-Agents

Replace Ray Perception with a Camera

  • First, remove the Ray Perception Sensor from the agent.

  • Next, make sure Use Child Sensors is enabled.

use_child_sensors.png

Note, if you still have a NNModel hooked up from a training run that didn’t have any Visual Observations, you will get a warning. The warning in the image above is basically saying: “Hey, this neural network wasn’t trained with camera input, so it won’t work the way you expect.”

  • Add a Camera object as a child of the Character object. Set the following parameters:

  • Position: 0, 1.1, -0.75

  • Rotation: 15, 0, 0

  • Field of View: 90

camera_object_setup.png
  • Remove the Audio Listener component.

  • Add a Camera Sensor to the Camera object

  • Drag the Camera object from the Hierarchy to the Camera field

  • SensorName: CameraSensor

  • Width/Height: 84/84

  • Grayscale: Disabled

  • Observation Stacks: 2

  • Compression: PNG

This will create an 84 pixel by 84 pixel RGB image that gets passed into your neural net. The Observations Stacks just means show the current frame and the previous frame. While this may not be necessary, in theory, it will allow for detecting motion.

camera_sensor_setup.png

If you want to see what the neural network sees, set up a new aspect ratio for the game display.

84x84_display.png
84x84_preview.png

Training

In the previous tutorial, we set up our config file to use a Simple visual encoder.

    network_settings:
      vis_encode_type: simple

According to the ML-Agents documentation:

(default = simple) Encoder type for encoding visual observations.

simple (default) uses a simple encoder which consists of two convolutional layers, nature_cnn uses the CNN implementation proposed by Mnih et al., consisting of three convolutional layers, and resnet uses the IMPALA Resnet consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. match3 is a smaller CNN (Gudmundsoon et al.) that is optimized for board games, and can be used down to visual observation sizes of 5x5.

So by using a simple encoder, we’re not doing anything fancy. If you’re interested in more advanced visualization, you may want to experiment with other options.

The training command is identical to any other training run.

It can take longer to train with visual inputs for a couple reasons:

  1. The more cameras rendering per scene, the lower the framerate, which slows down the simulation

  2. Much larger neural networks due to adding lots of pixel inputs as well as convolutional layers

For reference, with 8 simultaneous training areas, mine trained completely in about 15 minutes.