Thesis of Mattia Palmiotto


Subject:
Causal Data Integration

Start date: 02/09/2024
End date (estimated): 02/09/2027

Advisor: Andrea Mauri
Coadvisor: Angela Bonifati

Summary:

Causality is a concept that governs various phenomena in the real world. It lies at the core of human intelligence and has become a defining feature of Artificial Intelligence, aiding in the explanation of decision-making processes. The relationship between cause and effect, as well as conditional probabilities, are essential elements of structural causal models, which allow for a concise representation of the data generating process between variables. Causal reasoning utilizes causal graphs, enabling the derivation of interventional probabilities from conditional probabilities within observational data, without the need for additional experimentation. In addition to their role in this context, graphs also serve as the foundation for NoSQL data management systems, providing significant expressive and computational capabilities.

However, the current state of property graphs and property graph systems presents a challenge in encoding causal relationships and addressing relevant queries, resulting in untapped and unanswered questions within graph databases. While causal directed acyclic graphs (DAGs) are currently curated manually by experts in the field, they are not sufficiently stored, integrated, or versioned as data artifacts within graph data systems. This highlights the need for a shift towards causal data intelligence, which requires a solid theoretical foundation and a comprehensive set of tools to support causal graph operations. The development and implementation of such tools would enable databases to incorporate causal knowledge and pave the way for data-driven personalized decision making in various scientific fields.

Our focus is on designing explorative causal DAGs, which are essential for identifying hidden causes and effects within data. These DAGs serve as a guide for discovering evidence in graph data, using techniques such as graph pattern matching and similarity functions. A simple causal graph pattern, involving three variables X, Y, and Z in a chain/mediation, a confounder, or a collider (e.g. X→Y→Z, X←Y→Z, or X→Y←Z), can be used to frame the obtained DAGs. These DAGs, which may include nested subpaths, serve as refinements or augmentations of the initial causal DAGs used for exploration (e.g. X → (X′ → Y → Z′) → Z). They can also be enriched with provenance information and assessed with the help of users. Additionally, interventions can be encoded through path queries, such as the back-door and front-door criteria and instrumental variables. The computation of probabilities under these interventions can be translated into corresponding path queries on the causal DAGs.

This thesis also involves designing graph query contingency and responsibility, where we study the concept of contingency for property graphs and define the notion of subgraph responsibility in the answer to a causal graph query. We will also explore the complexity of contingency and responsibility for graph queries, going beyond traditional relational queries. 

Furthermore, we will develop methods for causal graph transformations, which will allow us to characterize relationships between pairs of causal DAGs and define dependencies between them. This will involve using formalisms such as GPC (Graph Pattern Calculus) and extending them to handle uncertain graph data and cause-effect relationships