Thèse de Auday Berro

Thesis of Auday Berro

Subject:

Paraphrase generation for automated training of conversational services

Start date: 07/01/2020
Defense date: 25/06/2024

Advisor: Boualem Benatallah
Codirection: Khalid Benabdeslem

Summary:

In recent years, the widespread adoption of Dialogue Services (DS) across various industries has been facilitated by advancements in open-source technologies, computational power, and AI techniques. DS, like virtual assistants and task-oriented bots, facilitates natural language interactions, thus enhancing human-computer interactions. Particular attention has been paid to their ability to understand and engage in language interactions. They transformed the way we interacted with devices, websites, and applications. For example, a 2 year-old can play his/her favorite song by just saying “Alexa play the Baby Shark song” before learning how to use a computer.

However, developing and maintaining task-oriented bots remains challenging, despite the availability of services and advances in various research areas. A key challenge is translating user utterances into intents, particularly given the diverse ways in which the same intent is expressed in natural languages. E.g., consider utterance “What's the weather like in Lyon” here the bot must recognize the intent (weather-forecast) and the associated entities (location= Lyon). Translating user utterances into intents presents a significant challenge because of the rich and unbounded nature of the human language. The same intent can be expressed differently. For example, another user might ask “Hey what's the weather update for Lyon”.

Developing a bot implies the ability to transform a user expression into one or more intents, corresponding to the identification of the tasks the user wants to accomplish (e.g., weather forecast). The bot then extracts relevant entities~(e.g., location and forecast date). Finally, it maps the intent and parameters to back-end services (e.g., API calls) to obtain the results. This is typically done in two stages: (i) training a Natural Language Understanding (NLU) model to map user utterances to predefined intents and extract associated entities and (ii) developing webhook functions to map intents to executable forms (e.g. API) and fulfills user requests by performing tasks. Thus training an NLU requires a large set of utterances for each intent with all possible compositions of entities.

Utterances that refer to the same intent are known as paraphrases. The richness and ambiguity of human language emphasize the importance of paraphrasing in building diverse datasets. Paraphrasing is an NLP task that aims to reformulate a given utterance into many possible variations while preserving its meaning. Traditional methods like expert hiring or crowdsourcing are costly and time-consuming, thus automated Paraphrase Generation is a promising solution. This thesis proposed leveraging existing PG techniques to generate high-quality datasets for training chatbots. The focus is on collecting a large number of utterances while ensuring specific linguistic quality criteria such as semantic relevance and diversity.

Key contributions include the implementation and evaluation of a baseline PG pipeline and addressing challenges such as semantic relevance and diversity. A taxonomy of errors in transformer-based paraphrase generation models has led to the development of a novel annotated dataset and a multi-label paraphrase annotation model. Inspired by previous crowdsourcing studies, we investigated the potential of leveraging LLMs, such as GPT-3.5, for syntactically diverse paraphrase generation tasks. We replicated a study that proposed a multistage paraphrase pipeline guiding crowdsourcing to produce syntactically diverse paraphrases. We substituted human crowdworkers with LLMs and performed a comparative analysis to demonstrate their effectiveness in controlled paraphrase generation tasks. Overall, this thesis presents a comprehensive exploration of automated paraphrase generation techniques to address the challenges of acquiring high-quality datasets for building robust and responsive Dialogue Services.

Jury:

M. Bellatreche Ladjel	Professeur(e)	ENSMA	Rapporteur(e)
Mme. Benbernou Salima	Professeur(e)	Université Paris Cité	Rapporteur(e)
Mme. Ailem Melissa	Chercheur	Microsoft CA (USA)	Examinateur(trice)
M. Bounekkar Ahmed	Maître de conférence	Université Lyon 1	Examinateur(trice)
Mme. Rosset Sophie	Directeur(trice) de recherche	CNRS (LISN)	Examinateur(trice)
M. Zitouni Imed	Chargé(e) de Recherche	Google WA (USA)	Examinateur(trice)
M. Benabdeslem Khalid	Maître de conférence	Université Lyon 1	Directeur(trice) de thèse
M. Benatallah Boualem	Professeur(e)	Dublin city university	Co-directeur (trice)