Thesis of Guillaume Lefebvre


Subject:
Learning and exploiting semantic representations for hierarchical multi-label classification and learning object retrieval in the field of education and professional training

Start date: 14/12/2020
End date (estimated): 14/12/2023

Advisor: Alexandre Aussem
Coadvisor: Haytham Elghazel

Summary:

Inokufu was born from the convergence of two areas of expertise that were shared by the co-founders: andragogy and data sciences. The objective of the company is to combine extensive algorithmic analysis of educational data with a system of human, pedagogical, and business audits. In this manner, the educational data processed is of superior quality and can be utilized to develop Machine Learning and recommendation algorithms tailored to the domain of education and professional training.

The principal objective of this thesis project is to investigate, adapt, and develop sophisticated techniques for Hierarchical Multi-label Classification and Learning Objects retrieval, while addressing the specific characteristics of the educational and professional training domain. In particular, this thesis concentrates on learning semantic representations suitable for these tasks, based on Natural Language Processing methods adapted to the linguistic and semantic nuances specific to the domain of education and professional training.

Given the complex nature of this data, the requirements articulated by Inokufu encompass the following aspects:

  1. The capacity to process specialized terminology: texts from the domain of education and professional training employ specific terms that are frequently absent from general corpora. It is essential to obtain a model capable of understanding and exploiting these specialized terminologies, thereby ensuring that content is correctly represented and searchable. This enhances the precision of alignments between training offerings and user-expressed requirements.
     
  2. Hierarchical classification of educational and professional data: educational contents are often characterized by complex hierarchies (skills, certifications, and training paths). In order to navigate these structures in an efficient manner and assist users in locating the appropriate resources, it is essential to have a classification system that respects and reflects these hierarchies. The solution must enable content to be classified in a manner that preserves the relationships between different hierarchical levels, such as between general categories and their sub-themes.
     
  3. Effectiveness of semantic search: To enhance the user experience, it is essential to implement a semantic search that is capable of understanding the intentions of the users, even when they use approximate terms or varied expressions to describe their needs.

In order to address these requirements, this thesis presents two significant contributions:

  1. BERTEPro: a novel semantic representation framework tailored to texts within the domain of education and professional training. By combining a specific pre-training phase on domain-specific corpora with fine-tuning on general tasks, BERTEPro facilitates a more nuanced comprehension of semantic nuances and the generation of precise and relevant representations, thereby enhancing the capacity to classify and search for educational content.
     
  2. HMCCCProbT: a Hierarchical Multi-label Classification framework that is capable of modeling both local and global dependencies within hierarchical structures in an efficient manner. By employing a probabilistic transmission mechanism, HMCCCProbT facilitates enhanced accuracy while limiting errors associated with the propagation of erroneous decisions at each level of the hierarchy.

These two complementary approaches have been validated by experiments on real-world datasets from the domain of education and professional training. They have demonstrated their capacity to enhance the quality of classifications and the search for Learning Object in an educational context.