About the significance of machine cognition


About the significance of machine cognition

Video Dragnet via AI soon a reality?

A dragnet investigation is an attempt to find a person or thing (such as a car) by defining a certain area and the physical aspects of the person or thing sought and systematically checking every matching person/thing one comes across.

And everyone remembers the 1987 film by the same name, right? Right?

If you've ever been in the UK, you'll know that they are - depending on the city you're in - probably the most videotaped people on the planet. Any public place, even in smaller outskirts of London, for example, have cameras pointing every which way. Makes you wonder just how many sit behind the monitors that all this video signal feeds into.

And while AI systems can do incredible things in identifying people by their faces (see
this article from 2016), it is one thing to identify a stationary face nearly filling the available camera resolution. It is another, to identify a "perp" that walks tangentially to a video camera, maybe wearing a hat that shades part of the face.

Image classification tasks, when performed via neural nets, tend to require a huge amount of training data. This makes sense, after all we're talking about recognition of a matrix that may be 4 Megapixel and in color (at least 16, possibly 24 bits per pixel) - you do the math! And a huge learning set means hundreds or thousands of person-hours of manual tagging.

A method, invented by Ian Goodfellow during a discussion over beers in a bar, makes this huge learning set go away. The method is called "generative adversarial networks". In essence, the setup involves at least two AI systems that "play" against another. One AI (or set of AI) is called the "generative" side and the opposing AI is the "discriminative" side. With a basic training set in place, the generative side is triggered to produce pictures of birds, for example. To get the game going, the discriminator is presented with random images from the generator (bad) and real images that fit the training set (good). I.e. in this case a binary classification.

There is a feedback from the discriminator to the generator wether it classifies the picture as that of a bird. Award points are given, depending on which side "wins" each round. I.e. if the discriminator correctly identifies an images as incorrect, it gets a certain number of points. If the generator fools the discriminator, it gets the points. The goal of the game is, of course, to have the most points.

The method was introduced in 2014 and has taken the AI community like an Australian bushfire (you know - the one with the bushes that have oily bark). It is a simple and cheap method to introduce self-learning to AI systems, with minimal human intervention.

A lot has been done with the concept in the last three years, with one of the more
recent research papers by Hang Zan (et al) introducing stacked GANs, where self-learning of image generation is pushed to incredible new resolutions. Have a look at the paper for some really incredible, 256x256 pixel full-color images of birds that were generated from a text description.

Where am I going with this? Well, one of the tools used by police in dragnet operations - in the case of a person search - may be a facial composite, based on the description of a witness or victim. Putting together one of these images requires experts with long years of experience, despite the availability of software that assists in the process.

What if one could throw the textual description of a perpetrator, ideally from multiple witnesses, into a stacked GAN and have it spit out a selection of composites to use in the dragnet operation? And with many cities - especially in the UK - wired with high coverage for video surveillance, one could then use these images in another stacked GAN that analyzed them in comparison to still images from the video feed. Surely, this will require more
TPU-like power to do properly, but give it another 5 years and with Moore's law, we should be there.