How to Filter the COCO Dataset by Category

coco-dataset-filtering.jpg

The COCO dataset has a lot of categories, but you may come across a need to filter it down to just one or a few. You probably already have a json file, but in case you don't check out http://cocodataset.org/#download and get the 2017 Train/Val annotations.

For example, if you want to filter the COCO dataset to only contain people and cars, this guide will help.

Note that this guide is for instances, not the other types of annotations (e.g. stuff). Let us know if you are interested in that.

Filtering with COCO-Manager

Want to just get it done as fast as possible?

Use the filter.py script from coco-manager GitHub repo.

This script will:

  • Look through your annotation file e.g. 'instances_val2017.json'
  • Remove any extra categories
  • Give the categories new ids (counting up from 1)
  • Find any annotations that reference the desired categories
  • Filter out extra annotations
  • Filter out images not referenced by any annotations
  • Save a new json file

For example usage, please see the README.md file in the repo.

Understanding

This should help improve your understanding of the solution.

If you need a video walk-through of the COCO dataset, check this video out.

Instances annotations for the COCO dataset are broken up into the following sections:

  • info
  • licenses
  • images
  • annotations
  • categories

Info and Licenses

These are purely informational and will likely remain unchanged when you filter. You could remove licenses that don't match up with any images, but that's unnecessary

Categories

This is the list of all categories of items that appear in the dataset:

"categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"},...

Each category has a numerical id, a name, and a supercategory. If we want to filter to just "person" and "motorcycle", we need to find annotations that contain category id 1 and 4.

Note that if we're going to filter the dataset, it probably makes sense to remove the extra categories and give them new ids. This is what I did with filter.py.

Images

This is a list of images. Here's an example:

{"license": 4,"file_name": "000000500663.jpg","coco_url": "http://images.cocodataset.org/val2017/000000500663.jpg","height": 480,"width": 640,"date_captured": "2013-11-17 18:01:17","flickr_url": "http://farm3.staticflickr.com/2452/4046745441_5a2f435499_z.jpg","id": 500663}

The key elements are the "file_name" and the "id". There's nothing here that says which categories or annotations are contained in the image. That means we have to look at the annotations to figure out which images to remove.

Annotations

This is a list of annotations. Here's an example:

{"segmentation": [[291.11,375.9,291.66,373.18,...]],"area": 505.7744000000001,"iscrowd": 0,"image_id": 500663,"bbox": [288.39,353.81,38.18,24.0],"category_id": 21,"id": 72296}

There can be many annotations corresponding to a single image. The most important pieces are "category_id" and "image_id". We want to keep track of these so that we can collect any annotations that contain our desired category ids and any image ids referenced.

Need More Functionality?

Let us know if you need other utilities for the COCO dataset. Maybe I'll have time to implement them as well.

In the meantime, please follow us on social media and YouTube!