Skip to main content

Video Synthesis With Convolutional Autoencoders

This project was my attempt to incorporate a neural network trained to encode images into a video feedback process.

In spring of 2015 I took a seminar in deep learning which got me real excited. Machine learning is the study of general methods for problem solving using optimization and data. Deep learning is a particular approach to ML using models with many differentiable layers of parameters. Especially interesting to me was representation learning: using ML methods to extract meaningful features from "raw" data like the pixels of an image. And I'd been talking a lot to Parag Mital and Andy Sarroff about their respective work with deep learning, sound and video. But what freaked me out the most about deep learning was the similarity between neural networks and the audio/video feedback I'd been using to make noise.

The kind of digital video feedback I'd been playing with was superficially like a recurrent neural network. At each time step, the current frame of video would be computed from the last (and optionally, the current frame of an input video). There would first be some linear function from images to images, like translation or blurring; generally, each pixel would take on a linear combination of pixels in the last frame and input frame. Then, there would be some pixel-wise bounded nonlinearity to keep the process from blowing up, like wrapping around [0, 1] or sigmoid squashing. That's the architecture of an RNN. The only difference was that rather than represent the linear transformation as a big ol' parameter matrix, I would hand-craft it from a few sampling operations in a fragment shader. And instead of training by backpropagation to do some task, I would fiddle with it manually until it had visually interesting dynamics.

I might have stopped there and tried to make my video-RNN parameters trainable. But to do what? It was pretty clear I wouldn't make much headway on synthesis of natural video in two weeks, without experience in deep learning software frameworks, and without even a GPU to run on. I wanted a toy-sized problem which might still result in a cool interactive video process. So I came up with a different approach: rather than try to train a recurrent network I would train a feedforward convolutional network, then transplant its parameters into a still partially hand-constructed video process. I came up with a neat way to do that: my CNN would be arranged as an autoencoder. It would have an hourglass shape, moving information out of 2-D image space and into a dense vector representation (which I vaguely hoped would make the network implement a "hierarchy of abstraction"). This would mean that I could bolt an "abstraction dimension" onto the temporal and spatial dimensions of a video feedback process. The autoencoder would implement "texture sampling" from the "less abstract" layer below and "more abstract" layer above. Then I could fiddle with the dynamics by implementing something like "each layer approaches the previous time-step minus the layer above plus the layer below, squashed".

I almost bit off more than I could chew for a seminar project: my approach demanded that I design and train my own neural network with caffe and re-implement the forward pass with OpenGL and spend time exploring the resultant dynamics. I was able to train my autoencoders on CIFAR with some success, and I was able to make some singular boiling multicolored nonsense. But I didn't get the spectacular emergence of natural image qualities I hoped for.

Here's the GitHub, which includes a technical writeup, a jupyter notebook with the autoencoder experiments in it, and the (probably very brittle) source code for an openFrameworks app which runs the process interactively, optionally with webcam input. It's based on early 2015 versions of caffe and openFrameworks. I may still try to get the openFrameworks app running again and capture some video, for posterity.

A few months later deep dream came out. Deep dream does a similar thing: it iteratively alters an image using a pre-trained CNN to manifest natural image qualities. The trick to deep dream is that the mechanism is the same as training the network, optimizing inputs instead of parameters. Vanilla deep dream converges, but it's simple to make a dynamic version by incorporating infinite zoom or similar. Too bad I didn't get into the filter visualization papers for this project -- I failed to realize that backpropagation could do exactly what I wanted!