Thesis of Stéphane Gosselin

Subject:

Thumbnail image classification and filtering

Start date: 01/09/2008
End date (estimated): 01/09/2011

Advisor: Christine Solnon
Coadvisor: Guillaume Damiand

Summary:

Many operations necessary for a statistical characterisation of sets of strings, trees, or graphs are intrinsically intractable (NP-difficult). In particular, if there are effective algorithms to compute the distance between two strings, the problem of the computation of the mean of a set of strings, and more generally of the various moments of higher degrees, is NP-hard. When one generalises these various concepts to sets of graphs, the corresponding computations all remain NP-difficult, and even simpler questions like that of computing the distance between two graphs become intractable. Let us note that in a context of classification and information retrieval, these combinatorics are exacerbated by the volume of the data.
Ant Colony Optimization. In order to be able to scale up the proposed techniques, innovative algorithms are necessary. We aim at investigating the capabilities of the Ant Colony Optimization (ACO) meta-heuristics for solving these problems. This meta-heuristic takes inspiration from the collective behaviour of ant colonies to
solve combinatorial optimisation problems. The idea is to represent the problem to be solved as the search for a best path in a graph. Artificial ants circulate in this graph in a random and incomplete way, looking for good paths. They communicate through the environment, by depositing on the edges of the
graph a trace of a volatile hormone called pheromone: this hormone tends to attract the artificial ants in a loop of positive feedback, guiding in an emergent way the colony towards a satisfactory solution, even if may not be optimal.
This kind of heuristic approach cannot guaranty the optimality of solutions found; however, we could observe in practice that it finds good results quickly, often optimal or very close to optimality. Let us add that in our context, the concept of optimality may not be of primary importance insofar as through the modelling of the data a lot of information has been lost.
In the context of our project we aim at studying the capabilities of this meta-heuristic for computing our statistical measures. We shall first focus on the problem of computing the median string of a set of strings. This problem is NP-hard: the search space is composed of all strings that may be formed over the considered alphabet. It can be formulated as a best path search in a rather straightforward way;
the main difficulty comes here from the fact that a same symbol may occur several times in a same string, the number of occurrences being not limited. Hence, we shall adapt the ACO meta-heuristic to this particular context. This preliminary work will then be extended to the more general problem of the computation of a mean graph.
Data reduction for pattern recognition. We aim at using our new algorithms for computing distances in real word applications such as thumbnail filtering. In such a context, the k nearest-neighbors algorithm has the double advantage to be a non-parametric method (no assumption is done about the underlying distribution) and to have an error upper bounded by twice the optimal Bayes error. However, its use requires a large amount of distance calculations. To reduce this constraint, we
aim at using data reduction techniques, i.e. at representing a (large) set of images by a relevant subset of prototypes. A first solution, investigated in Sub-goal 2, consists in representing clusters of images by their corresponding statistical moments. Another possible way would be to search for an optimal subset of instances that does not challenge the underlying probability density.
From a theoretical standpoint, this problem is NP-hard, and requires the use of heuristics. The work we achieved on data reduction by boosting has already provided very promising results on numerical data. Our approach is based on the construction of a neighbourhood graph on the learning data that individually represent weak learners. By studying their neighbourhoods, only the relevant instances are kept. We think that the similarity measures we are going to learn during this project can be used to build such graphs of images (represented in the form of strings, trees or
graphs) and then to extract the relevant prototypes.