Thesis of Ieva Mitasiunaite

Subject:

Mining string datasets under similarity and soft frequency constraints: application to promoter sequence analysis

Start date:

Advisor: Jean-Francois Boulicaut

Summary:

The Inductive Database (IDB) framework enables to describe and to execute Knowledge Discovery from Data (KDD) scenarios by means of sequences of queries. Among them, the so-called inductive queries declaratively express the constraints that have to be satisfied by the solution patterns. It is challenging not only to identify useful primitive constraints to support subjective interestingness specification but also to be able to design efficient algorithms that can exploit them at evaluation time. It is even harder when complete solvers are desired: every pattern that satisfies a given combination of primitive constraints has to be extracted from the data. We are considering string dataset mining and the design of generic algorithms to solve inductive queries that are Boolean compositions of primitive constraints (e.g., a string must be frequent enough in a dataset and infrequent in another one and not include a given substring). In such a context, state-of-the-art proposals (e.g., the FAVST solver) solve such queries for monotonic and anti-monotonic primitive constraints (e.g., the typical conjunction of maximal and minimal frequency constraints). It is far more complicated and often impossible to design generic algorithms to solve not (anti-)monotonic constraints. We are considering these challenging issues that emerge as soon as we want to support the search for fault-tolerant patterns, i.e., for real-life noisy data analysis and/or degenerated pattern discovery. Therefore, our methodological contribution is twofold. First, we have considered different ways to specify sub-string pattern soft occurrences and soft frequency constraints that exploit similarity measures. Importantly, such constraints cannot be guaranteed to be (anti)-monotonic. Our proposal has been to design useful similarity constraints and related soft frequency constraints that can be specified as a conjunction of a monotonic and an anti-monotonic constraints and thus can be exploited efficiently. This has been implemented into our generic solver called Marguerite. Next, we have been considering the problem of extraction parameters tuning (e.g., providing threshold values for the frequency constraints), i.e., one of the current open problems to support constraint-based mining. We have been considering an original technique to guess the solution size thanks to a pattern sampling approach. We studied how to identify the most stringent constraints that provide solutions and whether one can thrust the extracted patterns as not being false positives thanks to a statistical measure called the Twilight Zone Indicator (TZI). Last but not least, we have used our methods and tools in a challenging application domains, namely gene promoter sequence analysis. In tight collaboration with a group of biologists whose expertise concerns stem cell self-renewal molecular mechanisms, we successfully applied Marguerite and the TZI measure to identify putative binding sites of the transcription factors involved in the process of cell differentiation.