Thanks to visit codestin.com
Credit goes to michaelaupetit.github.io

HuMA
Human
Machine
Alignment

Bookmark this to keep an eye on my project updates!

View My GitHub Profile

Dr. Michaël Aupetit

photoid_professional_500x600

Senior Scientist at Qatar Center for Artificial Intelligence

QCRI - HBKU - LinkedIn

Publications

DBLP - Google Scholar - ORCID - ResearchGate


HCTQA data collection HCTQA data collection HCTQA data collection Human-centric tables (HCTs) are everywhere in official industrial, governmental, or institutional reports, but their complex layout makes it difficult to answer natural questions about them, even by recent LLMs HCT complex layouts HCT-QA is a benchmark of HCTs and related question-answer pairs carefully collected from real sources, manually verified, and enriched with metadata for deep analysis HCT-QA data collection and validation HCT-QA benchmark is enriched with thousands of synthetic HCTs, QA, and metadata with a generator, thanks to the correspondence between SQL and template-based questions, and pivoted relational tables and HCT to get valid answers at scale. HCT-QA synthetic data generator

ISilDR principle ISilDR -> OLP Inference OLP -> ISilDR Inference ISilDR is a new family of dimensionality reduction (DR) techniques that never produce false neighbors, by contrast to Orthogonal Linear Projections like Principal Component Analysis, which never produce Missing neighbors. This is the first time that such a family of DR techniques has been identified. All other DR techniques always produce some False Neighbors... And so what? ISilDR principle When a group of data is identified in an ISilDR and in an OLP, we can infer that this group exists in the original multidimensional (MD) data space. ISilDR -> OLP Inference When a group of data is identified in an OLP, then in an ISilDR, we can infer that this group exists in the original MD data space. OLP -> ISilDR Inference This is the first time such a strong conclusion about an MD data pattern has been derived from combining two independent DR layouts. In contrast, observing the same group across any combination of two or more OLP layouts, two or more ISilDR layouts, or two or more tSNE or UMAP layouts, for instance, does not guarantee that this group exists in the MD data space. In general, in DR, two wrongs don't make a right! But here, OLP + ISilDR can tell us some truth about the MD data.

Interactive Lasso Techniques vs Distortion Aware Brushing Distortion Aware Brushing in action Lasso techniques for selecting data from dimensionality-reduced (DR) layouts are prone to DR-induced distortions: the lasso can capture false neighbors and miss true neighbors. We propose Distortion-Aware Brushing, a technique that permanently rearranges the data in the DR layout to minimize local distortions. The clusters you build in the layout match the clusters in the multidimensional data space. Interactive Lasso Techniques vs Distortion Aware Brushing Distortion Aware Brushing in action Distortion Aware Brushing in action

Class Label Matching and External Validation Measures Consider only the CLM-best data has an effect on ranking stability Speed and correlation with ground truth Cluster Label Matching (CLM) is the assumption that class labels in datasets used for benchmarking clustering techniques match their cluster structure. This assumption is essential for evaluating the quality of clustering techniques using External Validation Measures (EVMs) such as the Normalized Mutual Information or the Adjusted Rand Index. EVMs compare the classes obtained by a clustering technique to the class labels of the data (in multidimensional space). The issue is that no one checked whether the CLM assumption was valid, which calls into question comparisons of clustering techniques on such benchmark datasets. We propose a way to quantify the CLM of multidimensional datasets by defining across-dataset axioms and deriving internal validation measures (IVMs) like Silhouette or Davies-Bouldin, that enable comparison of datasets with different numbers of data points, dimensionality, and classes. Class Label Matching and External Validation Measures Considering only the dataset with the best CLM to compare clustering techniques has a strong positive effect on ranking stability. Whichever subset of these good-CLM datasets is used for benchmarking clusterings yields a similar ranking of the compared techniques. By contrast, using all datasets, ignoring their CLM, or picking only the bad-CLM ones leads to far more unstable rankings, underscoring the importance of measuring CLM and selecting only the best-CLM datasets. Consider only the CLM-best data has an effect on ranking stability The proposed adjusted IVMs are both computationally efficient and more highly correlated with the generally intractable ground truth than the standard IVMs. Speed and correlation with ground truth

Projects

TBA soon