Thesis of Abderaouf Gacem
Subject:
Start date: 01/11/2020
End date (estimated): 01/11/2023
Advisor: Hamida Seba
Coadvisor: Christophe Garcia, Mohammed Haddad
Summary:
Graph-structured data is ubiquitous across a wide range of domains, from social networks and biological systems to recommendation engines and knowledge graphs. The ability to learn meaningful representations of graphs, also known as Graph Representation Learning (GRL), is fundamental to unlocking the value encoded in these structures. However, GRL faces significant challenges, especially related to scalability and generalization. One promising approach to address these issues is data augmentation, a long-standing strategy in machine learning yet with a promising potential in the context of graph-structured data.
This thesis takes a principled and comprehensive look at graph data augmentation as a unifying theme to guide the design of effective and scalable GRL algorithms.We explore how augmentation can be integrated into various stages of the GRL pipeline, from input graph sampling to mini-batch training to latent space transformations.
Our investigation is structured around three core contributions:
1. Structure-Aware Biasing of Random Walks: We revisit random walk-based methods, which are foundational to many node embedding approaches.We propose a novel biasing strategy that leverages resistance distance, a metric capturing the global structure of the graph, to guide the walk generation process. This results in walks that are more informative, and balancing between local and global graph exploration. The method enhances the resulting embeddings while preserving the simplicity and scalability of first-order sampling schemes.
2. Mini-Batch Sampling for GNNs: We then move from random walk methods to graph neural networks (GNNs), where mini-batching is one way to achieve scalability. Motivated by the success of structure-aware augmentation in random walks, we ask: Can we rethink mini-batch generation through the lens of augmentation? To this end, we propose FireForest, a novel sampling strategy that creates mini-batches true to the global graph properties while exposing the model to structural variations. FireForest preserves key topological properties across batches and allows for efficient training without sacrificing representation quality, thus enabling scalable and robust GNN training.
3. Latent Graph Augmentation: Finally, we address the limitations of augmenting in the input graph space, which is vast and difficult to navigate. Inspired by the human ability to imagine plausible variations of perceived structures, we introduce an end-to-end learnable augmentation framework that operates in the latent space of GNNs. This approach allows the model to internally simulate graph augmentations, enabling it to extrapolate meaningful variations that improve generalization. It opens the door to an abstract and powerful form of augmentation.
Together, these contributions explore data augmentation as a versatile tool for graph representation learning. By incorporating augmentation in different GRL approaches, we offer a framework that enhances both the scalability and generalization ability of GRL models.