Thesis of Romin Durand

Building a Data Lake for Event Logs and their Interpretable Inference

Start date: 01/03/2023
End date (estimated): 01/03/2026

Advisor: Angela Bonifati


Nowadays, companies use increasingly more data centers, due to the spread of the cloud computing paradigm. Data centers generate a tremendous amount of event logs, and research tends to leverage these data to extract useful information on dangerous events in the logs. 

First, we want to design a data lake for all kinds of event logs. The first goal consists of finding similarities of the semantics of the value fields in the logs, whatever the log type is (Windows, Linux, Apache,..). The way to extract the data from the data lake must be generic, whatever is the use of the data afterwards. Other projects in the company will benefit from the design of this data lake.

Then, we will use the data lake by extracting data for inference processes. The data will be normalised since there are different data formats and data inputs. Then,  metadata will be needed to know what is unified and what is not, and how the unification takes place. The idea is to extract patterns of the normal behaviours and the malicious ones thanks to both supervised and unsupervised learning. As a result, we will be able to build graphs of the state of our system to explain to the administrator what is happening and make predictions.

Finally, we plan to re-inject the results of inference processes into the data lake to improve the “wisdom of the data lake”. This aspect is also novel since there might be information in the results that is relevant for the normalisation of the logs. The produced results will be linked to the input datasets, make them persistent with no redundancy, and reuse them afterwards.