Thesis of Thomas Veran

Preventing causes of accidents using historical data from highway companies

Defense date: 04/11/2022

Advisor: Jean-Marc Petit
Coadvisor: Pierre-Edouard Portier


Worldwide, highway accidents have important social and financial impacts. To reduce their frequency and gravity, crash prediction models (CPM) are used to identify hazardous roadway segments and to provide actionable clues about the associated risk factors. CPM are either parametric statistical models, in particular generalized linear models (GLM), or machine learning models with a large number of parameters without associated uncertainty estimates (e.g., ensemble of decision trees, support vector machine …). Simple parametric models tend to be more interpretable but less effective than highly flexible non-parametric models that work like black-boxes. When pondering high stake decisions, such as in the context of highway safety, field experts expect predictive models to be both effective and glass-box interpretable. The models must assist them in conceiving and deploying preventive or remedial safety actions.

As such, we contribute to enhancing the predictive performance of parametric models while maintaining their interpretability. In the first place, a well-chosen hierarchical structure can handle correlations among groups of observations and significantly improve the quality of the models’ predictions and of their interpretation. We propose to learn it by leveraging the output of a post-hoc explainability framework (viz., SHAP) applied to a highly flexible black-box model (viz., XGBoost). In our first contribution, this hierarchical structure informs a Bayesian multilevel GLM. Moreover, in an effort to further improve the predictive performance of the model without deteriorating its interpretability, we propose to extend its linear functional form to account for major first-order interactions between explanatory variables. These interactions are learnt by analyzing the results of a trained self-organized polynomial network from the Group Method of Data Handling (GMDH) family of supervised algorithms.

In our second contribution, we exploit the hierarchical structure even better by replacing the GLM with a simulated annealing based multi-objective symbolic regression algorithm to automate feature engineering and feature selection. Thus, by computing a cluster-specific ranking of expansions of regularized linear models ordered by increasing complexity, we facilitate a dynamic interpretative process which makes it possible to discover effective, efficient, and interpretable predictive models.

Experiments have been conducted on a highway safety dataset and on more than ten public datasets covering classification and regression tasks. They show promising results with our two contributions outperforming traditional glass-box interpretable models while getting close to the best non-parametric models. Finally, we illustrate the benefits of our approach by introducing, on a realistic case study, an application we designed for highway safety experts.

M. Gancarski PierreProfesseur(e)Université de StrasbourgRapporteur(e)
Mme Gianini GabrieleProfesseur(e)Université de MilanRapporteur(e)
Mme Sedes FlorenceProfesseur(e)Université Toulouse 3Examinateur​(trice)
M. Jacques JulienProfesseur(e)Université Lumière Lyon 2Examinateur​(trice)
M. Petit Jean-MarcProfesseur(e)LIRIS - INSA LyonDirecteur(trice) de thèse
M. Portier Pierre-EdouardMaître de conférenceLIRIS - INSA LyonCo-encadrant(e)
M. Fouquet FrançoisDocteurData Scientist chez Data New RoadCo-encadrant(e)