ALOE

Artificial environment framework for benchmarks and Training

Introduction

Every engineer and scientist have difficulties to find videos or photos for training and testing neural networks. For GDPR and privacy policies, it is difficult to use videos taken from the internet or private videos, especially for non-academic purposes.

These problems can be solved with the ALOE framework. ALOE stands for “ArtificiaL peOplE” and it was created initially as a virtual crowd simulator. Now with ALOE you can generate a virtual scene that represents your use case for machine learning tasks.

The outputs of ALOE can be:

a clean video with the simulated scene (for machine learning inference)
a meta-tagged video with bounding boxes and informations (for benchmarking)
an interactive-visual program that can interact with your own system via ZMQ protocol

Due to the fact that it is a virtual reality simulation, every physical parameter is realistic, like gravity, friction, materials, etc; you can also simulate weather.

Virtual Emulation of a Scene

In the previous chapter we have introduced the fact that with ALOE it is possible to emulate a scene.

The scene is created in Unity, a wide spreaded tool used for simulation and gaming. Unity has its own physics engine for what concerns the dynamic of the scene and also a tool used for graphic and rendering. Unity also allows the user to import models generated from other famous tools like Blender. This feature is important because of the possibility of the physics reality of the single objects in the scene itself.

The reality of the scene is important when it comes both for training and inference of the Neural Networks.

The usage of Unity enhances also the dimension of the dataset. Theoretically we can generate an infinite quantity of simulation where to train the Neural Networks.

Generation of artificial dataset

At the end of the previous chapter we have stated that from a synthetic scene we can generate an enormous dataset. This dataset can be formed for example as an .mp4 video with the meta-data tagged. That is why it is important to have a graphic and a physics closer to an actual situation possible. Moreover the videos can also have different perspectives thanks to the possibility of Unity to change the position and orientation of the camera. Having different videos from different perspectives of a specific scene has two advantages:

Increase the dimension of the datasets. In fact, generally speaking, a bigger dataset results in a better inference from the Neural Network.
Teach to the networks that a particular phenomena is invariant under affine transformations.

The last advantage is the possibility to create GANs. GAN stands for “Generative Adversarial Network”. This network is used to train another network which must perform “better” than its “enemy”.The GAN has been used for example by DeepMind to train a Neural Network which has defeated the world champion of Go.

GANs are also used to train CNNs which must distinguish real images from fake images. An example has been the fight to “Deep Fakes”, which are basically artificial created videos with actual people. Those videos are problematic because of two facts: