Dr. Michaël Aupetit

photoid_professional_500x600

Senior Scientist at Qatar Center for Artificial Intelligence

QCRI - HBKU - LinkedIn

Publications

DBLP - Google Scholar - ORCID - ResearchGate

IEEE ICDE 2026

M. S. Ahmad, Z. A. Naeem, M. Aupetit, A. Elmagarmid, M. Eltabakh, X. Ma, M. Ouzzani, C. Ruan, H. Al-Sayeh
HCT-QA: A Benchmark for Question Answering on Human-Centric Tables
arXiv - HF - GitHub

Human-centric tables (HCTs) are everywhere in official industrial, governmental, or institutional reports, but their complex layout makes it difficult to answer natural questions about them, even by recent LLMs HCT complex layouts

HCT-QA is a benchmark of HCTs and related question-answer pairs carefully collected from real sources, manually verified, and enriched with metadata for deep analysis HCT-QA data collection and validation

HCT-QA benchmark is enriched with thousands of synthetic HCTs, QA, and metadata with a generator, thanks to the correspondence between SQL and template-based questions, and pivoted relational tables and HCT to get valid answers at scale. HCT-QA synthetic data generator

IEEE TVCG 2026

R. Cutura, S. Sadler, Q. Quang Ngo, M. Aupetit, and M. Sedlmair
ISilDR: Isometric-Seriation-Based Dimensionality Reduction for Visual Cluster Analysis
PacificVis IEEE TVCG track - Slides

ISilDR is a new family of dimensionality reduction (DR) techniques that never produce false neighbors, by contrast to Orthogonal Linear Projections like Principal Component Analysis, which never produce Missing neighbors. This is the first time that such a family of DR techniques has been identified. All other DR techniques always produce some False Neighbors... And so what? ISilDR principle

When a group of data is identified in an ISilDR and in an OLP, we can infer that this group exists in the original multidimensional (MD) data space. ISilDR -> OLP Inference

When a group of data is identified in an OLP, then in an ISilDR, we can infer that this group exists in the original MD data space. OLP -> ISilDR Inference

This is the first time such a strong conclusion about an MD data pattern has been derived from combining two independent DR layouts. In contrast, observing the same group across any combination of two or more OLP layouts, two or more ISilDR layouts, or two or more tSNE or UMAP layouts, for instance, does not guarantee that this group exists in the MD data space. In general, in DR, two wrongs don't make a right! But here, OLP + ISilDR can tell us some truth about the MD data.

IEEE TVCG 2026

H. Jeon, M. Aupetit, S. Lee, K. Ko, Y. Kim, G. J. Quadri, and J. Seo
Distortion-Aware Brushing for Reliable Cluster Analysis in Multidimensional Projections
IEEE TVCG - Demo

Interactive Lasso Techniques vs Distortion Aware Brushing

Lasso techniques for selecting data from dimensionality-reduced (DR) layouts are prone to DR-induced distortions: the lasso can capture false neighbors and miss true neighbors. We propose Distortion-Aware Brushing, a technique that permanently rearranges the data in the DR layout to minimize local distortions. The clusters you build in the layout match the clusters in the multidimensional data space. Interactive Lasso Techniques vs Distortion Aware Brushing

Distortion Aware Brushing in action

IEEE TPAMI 2025

H. Jeon, M. Aupetit, D. Shin, A. Cho, S. Park, J. Seo
Measuring the Validity of Clustering Validation Datasets
IEEE TPAMI - Ranked Datasets - (Python) Dataset reader - (Python) Adjusted Internal Validation Measures

Class Label Matching and External Validation Measures

Consider only the CLM-best data has an effect on ranking stability

Cluster Label Matching (CLM) is the assumption that class labels in datasets used for benchmarking clustering techniques match their cluster structure. This assumption is essential for evaluating the quality of clustering techniques using External Validation Measures (EVMs) such as the Normalized Mutual Information or the Adjusted Rand Index. EVMs compare the classes obtained by a clustering technique to the class labels of the data (in multidimensional space). The issue is that no one checked whether the CLM assumption was valid, which calls into question comparisons of clustering techniques on such benchmark datasets. We propose a way to quantify the CLM of multidimensional datasets by defining across-dataset axioms and deriving internal validation measures (IVMs) like Silhouette or Davies-Bouldin, that enable comparison of datasets with different numbers of data points, dimensionality, and classes. Class Label Matching and External Validation Measures

Considering only the dataset with the best CLM to compare clustering techniques has a strong positive effect on ranking stability. Whichever subset of these good-CLM datasets is used for benchmarking clusterings yields a similar ranking of the compared techniques. By contrast, using all datasets, ignoring their CLM, or picking only the bad-CLM ones leads to far more unstable rankings, underscoring the importance of measuring CLM and selecting only the best-CLM datasets. Consider only the CLM-best data has an effect on ranking stability

The proposed adjusted IVMs are both computationally efficient and more highly correlated with the generally intractable ground truth than the standard IVMs. Speed and correlation with ground truth

Projects

TBA soon