So, this fall, we will make publicly available the following dataset as a tool for the technology industry and research community:
- A facial attribute and identity training dataset of over 1 million images to improve facial analysis system training being built by IBM Research scientists. It will be annotated with attributes and identity, leveraging geo-tags from Flickr images to balance data from multiple countries and active learning tools to reduce sample selection bias. Currently, the largest facial attribute dataset available is 200,000 images so this new dataset with a million images will be a monumental improvement. Additionally, data sets available today only include attributes (hair color, facial hair, etc) or identity (identifying that 5 images are of the same person) — but not both. This new dataset changes that to make a single capability to match attributes to an individual.
- A dataset which includes 36,000 facial images – equally distributed across all ethnicities, genders, and ages, annotated by IBM Research, to provide a more diverse dataset for people to use in the evaluation of their technologies. This will specifically help algorithm designers to identify and address bias in their facial analysis systems. The first step in addressing bias is to know there is a bias — and that is what this dataset will enable.