Master Thesis defense by Anton Golles

Title: Text-conditioned Reverse Diffusion on a Latent Representation of Humanoid Motion

Abstract: Character animation is an extensive, time-consuming process in developing animated films and video games. We present a method for synthesizing humanoid motion sequences with a neural network that can be trained on a laptop, making it accessible to anyone. We explain the intricacies of learning the time-reversed diffusion process on a latent representation of high-dimensional, sequential data.

By off-loading some compression work from a transformer encoder to a linear encoder, we obtain a bigger model, but one that is easier to train. In exploring the latent diffusion model, we implement the pipeline for the MNIST dataset, a sub-task that grants perspective and guides our later design choices. We find that the forward diffusion process on a latent representation does not lead to exposure to our entire embedding space, limiting the possibility of learning the reverse diffusion process in the latent space.