# Deep Dive

Take a look under the hood of PYTHIA

PYTHIA uses revolutionary new mathematical ideas and concepts.

These concepts enable PYTHIA to combine the best qualities of different conventional approaches like speed, flexibility, transparency, and stability without adapting their weaknesses like overfitting, low generalizability, or extensive training effort.

## Comparison to conventional approaches

In the fields of Data Analytics and Predictive Analytics, PYTHIA brings several advantages in comparison to conventional approaches:

PYTHIA uses a very efficient and target-oriented learning strategy allowing it to produce results very quickly. Although handling vast amounts of data, the necessary amount of data is relatively small.

Furthermore, PYTHIA works on unprocessed and unsynchronized raw data since it can do data synchronization and data cleaning on its own.

If you look at process parameters, which can get excessively large, conventional approaches can treat a maximum of a few 100 parameters, especially in industry. PYTHIA can easily handle thousands.

Finally, PYTHIA can provide root cause indicators, which is crucial to find out why something is happening.

## Overview of PYTHIA

PYTHIA covers various applications in the areas of analysis, prediction and detection.

## The journey to PYTHIA

PYTHIA is not the result of a single, isolated approach. Instead, its development, marked by trial and error, has led along with many subfields of mathematics and theoretical physics. During this process, many fundamental ideas and concepts were tested for their effectiveness, partly included, partly discarded, and partially extended with newly invented approaches.

#### Neural Networks and Generalization

Our goal was to create a system, which enables people from industry to quickly develop insights from their data and reach data-driven and transparent conclusions. In particular, we aimed at a virtual sensor builder and the automated finding of root causes of disruptions.

As most people nowadays, we started our journey exploring the capabilities of neural networks and deep learning. A considerable drawback of this approach is that they are usually very sensitive to small input data changes. A neural network trained for image recognition might correctly recognize a stop sign on a large number of images. However, if you change only a few pixels, which humans would not even notice, suddenly, the stop sign appears to be a right-of-way sign. This effect is called over-fitting and indicates a significant expected generalization error.

But an essential requirement of prediction models is to distinguish which of the incoming data is relevant and which is not. This also means that small, irrelevant data changes must not change anything essential in the prediction.

Digging deeper, we did not find any consistent theory that could explain how to design neural networks and select the hyper-parameters to find a predictor that maximally generalizes. The only applicable theory of generalization we found uses the so-called Tikhonov Regularization. This can also be interpreted in terms of probability theory, as in the  Bayesian Interpretation of Kernel Regularization. The terms appearing there are, however, not computable for neural networks. Therefore, we rejected this approach.

The probably most accessible way to guarantee these terms’ computability is to use so-called kernel methods ( Reproducing Kernel Hilbert Spaces). A very well known theory, which combines these methods and can be interpreted in terms of stochastic processes, is the theory of Gaussian Process Regression as described in the book  Gaussian Process for Machine Learning [GPML].

#### Gaussian Processes and Hyperparameters

Gaussian process regression combines machine learning methods, which unite probabilistic models with methods to control the generalization error. Therefore, the predictors generated in this way generalize very well. However, Gaussian processes can not handle large amounts of data, as, for example, in an industrial context, as in their standard form, they scale like $\mathcal{O}(n^3)$. Although this scaling problem can be solved using the Nystroem approximation, two issues still remain: The choice of the right kernel and the even bigger problem of efficiently choosing the hyper-parameters.

A common approach is the maximization of evidence ( marginal log likelihood). This cannot be solved exactly but has to be approximated with some gradient-based method. In order to calculate the gradient of evidence according to the hyperparameters, the corresponding learning problem for the current hyper-parameter must be solved first.

For example, a machine in an industrial environment with 5.000 sensors and 10 operating states already has 50,000 hyperparameters. This will result in thousands of learning steps. Although the use of meta optimization can accelerate this procedure considerably, this results in a prohibitively great computational effort.

#### Regression Trees and Traceability

Even if one could get the parameter tuning of Gaussian processes under control, in all applications where decisions could have severe consequences, the “what” is worthless without the “why”.

As is very well known, neural networks do not explain how they come to a statement. And although the hyperparameters of Gauss processes are interpretable, they do not provide a concrete explanation for the current situation either.

Another class of regressors, the so-called regression trees, and their extensions provide good traceability. They can be trained efficiently and generalize quite OK. However, they are limited in the number of relations they can represent. In other words: Regression trees are more stable than neural networks and have better traceability than Gaussian processes. Still, they generalize less well than Gaussian processes and are less flexible than neural networks.

#### Capabilities of conventional Methods

Comparison of neural networks, gaussian processes (kernel machines) and regression trees.
###### Therefore, the goal for PYTHIA was clear: To develop a method that combines the flexibility of neural networks with the stability of Gaussian processes and the traceability of regression trees.

We decided to extend kernel machines. First, they have to be able to handle tens of thousands of hyperparameters. Second, we have to guarantee the traceability and interpretability of the results. The former problem was solved using a new interpretation of the hyperparameters in terms of geometrical quantities, which can be computed directly using only training data. The latter was solved by developing a method of segmentation in Reproducing Kernel Hilbert Spaces. The resulting segments correspond exactly to the different operating states of a machine in which various dependencies prevail. Detecting the current state and dependencies provides an explanation of why the prediction is giving the current value. Even more importantly, this enables to define counter-measures against, for example, process disruptions, which can be started in time.

#### Geometrical and physical interpretation of hyperparameters

We found out that all the hyperparameters in the regularization term can be translated directly into specific geometric quantities. Thus, the determination of the hyperparameters can be reinterpreted in such a way that the geometry of the information in the system must be extracted from the given training data. We have succeeded in doing this by considering another problem, which is independent of the actual machine learning problem, and solving this before the actual predictor is learned. The solution provides exactly those hyperparameters which minimize the expected generalization error within this class of regressors.

Finally, we have simply avoided the problem of tuning the hyper-parameters (learning rate, weight decay factors, …) because our approach has none. Instead of performing a grid-search or similar methods, we managed to just calculate them. The calculation is very analogous to vacuum expectation values of background fields yielding coupling constants of the fundamental interactions in the universe.

Furthermore, we found that the problem of machine learning of a target variable becomes equivalent to a scalar quantum field theory. The found method is then strongly related to methods of scaling dimensions and analogous to renormalization groups. The vacuum state of this theory is then exactly the predictor that ensures maximum generalizability. The probability to measure a value $\phi_t$ at a time $t$ and a value $\phi_{t+\Delta t}$ at $t + \Delta t$ is then exactly the transition amplitude $\langle \phi_t \phi_{t+ \Delta t} \rangle$.

#### Vacuum state analogy

The vacuum state can be visualized with the following analogy: Imagine a set of points in three-dimensional space (X,Y,Z). The goal is to find a function $\phi$ that predicts Z from X and Y. Therefore, the features are located in the X-Y plane while the labels are shown on the Z axis. At each of these points, a spring with a given spring strength is attached, which can only be deflected in the Z-direction. A surface (like a rubber blanket) with a given elasticity tensor E is suspended from these springs, which is for simplicity assumed to be constant (in PYTHIA itself, E is no longer constant, but due to segmentation). Also, a given density distribution $\rho$, which is also assumed to be constant (in PYTHIA itself, $\rho$ it is no longer constant). Far away from all training examples, the blanket lies on the ground (Dirichlet boundary condition). Thus, the energy of the system is the sum of the tension energy of the surface, the tension energy of the springs, and the potential energy of the surface due to its mass. The spring strength is indirectly proportional to the expected measurement error of Z.

Vacuum state given by the configuration of minimal total energy. The springs create a force between red and black dots. Red dots correspond to training samples, black dots correspond to the mounting points..

The density distribution and the surface elasticity are given by the hyper-parameters of the underlying problem. The function $\phi$, which predicts Z from X and Y, is then given by the surface’s shape, which has minimal total energy. High rigidity of the surface can be interpreted as the rigidity of the function $\phi$ i.e. penalization of strong variations in small neighborhoods. The hyperparameters that penalize variation of the prediction function are thus, in a sense, the elasticity of the prediction function’s curve. The hyperparameters which penalize large function values can be understood as the mass density of the curve. The variance of the underlying stochastic process corresponds to the spring strength.

Suppose there is a lot of uncertainty in the system. In that case, the low spring strength allows large deviations at the given training point, and the total energy is mainly determined by the potential energy and the tension of the surface. Hence, the solution will be a flat surface lying on the ground. If, on the other hand, there is no uncertainty whatsoever, we have springs of infinite strength, so we sought after the minimum of all surfaces passing through the training samples with respect to overall tension energy and potential energy. In this sense, we search for the surface, which passes all training samples and runs as smoothly as possible between the points.

If a given feature is irrelevant, we have maximum rigidity in this direction, and the prediction function must not depend on this variable. Therefore, the determination of the geometry or the elasticities is also a solution to the problem of feature selection.

## PYTHIA vs. Conventional

#### Creating a Virtual Sensor

PYTHIA’s capabilities and advantages become impressively visible when comparing the PYTHIA approach with conventional methods using the example of ‘creating a virtual sensor’.

Creating a software sensor is a very complex task. Usually, this is a large data science project involving data scientists and domain experts. The data has to be cleaned, relevant signals identified, the right model has to be chosen and trained. During training, many hyperparameters have to be tuned and a final model candidate has to be evaluated using live predictions. The findings of the live evaluation have to be incorporated into the model and final adjustments have to be made.

Comparison of PYTHIA with conventional methods when creating a virtual sensor

Using Pythia, you only have to give raw data to the system and start training for a sought after expression. After a short period (around one hour), the autonomous training is done, and your virtual sensor is ready for use.

## The Team behind PYTHIA

PYTHIA’s core team combines various fields of expertise from mathematics, theoretical physics, and software development. With their skills and enthusiasm, they make sure that PYTHIA is cutting-edge technology in every possible way.

##### Dr. Christian Paleani
###### Co-Founder and Chief Scientist

My name is Christian, and I am a theoretical physicist and mathematician. During my physics studies at the Technical University of Munich, I specialized in theoretical elementary particle physics and the compactification of heterotic string theory on orbifolds.

Afterward, I worked on generalized pseudo-holomorphic pairs bridging the mathematics and physics of topological string theory with H-flux. For this work, I received my doctorate and was an invited researcher at the Mathematical Institute of the University of Oxford.

##### Susanne Moll
###### Senior Scientist

My name is Susi. I studied Maths at the LMU in Munich. Most of my time there, I spent learning about yet another fascinating area of mathematics. Finally, I got my degree in algebraic geometry concerning rational points on singular del Pezzo surfaces.

During my work for PerfectPattern, I learned more about discrete mathematics and various optimization techniques, including programming in general and Java. I have the most fun working on math-y problems as they arise in all kinds of real-world applications.

##### Lukas Lentner
###### Co-Founder and Chief Engineer

I am Lukas. In my studies of Physics at the LMU Munich, I focused on probabilistic methods simulating quantum effects of solid-state particle systems.

At PerfectPattern, I fuse state-of-the-art cloud technology with our best-in-class machine-learning algorithms to realize genuinely innovative and handy products for our customers. As a team lead, I want to empower every member to reach our ambitious goals.

##### Lisa Kraus
###### Software Engineer

I am Lisa, and a mathematician specialized in applied topics like optimization, game theory, probability theory, and data analysis. I love when theorems combine mathematical beauty and real-life benefit – and there is no better example of that than Pythia.

As a software engineer, I complete Pythia with state-of-the-art infrastructure to make it easily accessible and usable. Besides that, I enjoy living out my additional teacher’s degree by explaining Pythia’s more profound concepts understandably.

##### Philipp Kaiser
###### Software Engineer

My name is Philipp, and I’m a thoroughbred software developer. After earning my master’s degree in informatics at the TU Munich, I failed my own little business, which led me to join a more successful one.

With my full-stack development and high-performance graphics programming experience, I try to keep an eye on the big picture by providing a common architecture for all the different pieces of code and helping colleagues when it’s getting tricky. Always open to new ideas, I’m continually testing various state-of-the-art programming concepts to make Pythia a stable, flexible, maintainable, and performant application.