About me: I am a final-year Ph.D. candidate at the i-VisionGroup, Department of Automation, Tsinghua University, advised by Prof. Jiwen Lu. I received my B.E. degree from the Department of Automation and a second B.A. degree from the School of Economics and Management at Tsinghua University in 2021. My research philosophy focuses on bridging the gap between theoretical interpretability and practical efficiency in deep learning.
Research: My primary research interests lie at the intersection of Computer Vision and Deep Learning Theory. Currently, I am focused on the following pillars:
Email / Google Scholar / Github / Xiaohongshu / CV
* indicates equal contribution
To investigate the relationship between continuous and discrete tokenizers, we propose ReVQ. This method yields a high-performance VQ-VAE requiring only 40 GPU hours of training on a single RTX 4090.
We investigate iterative approaches for constructing discrete tokenizers and propose SFTok. Analogous to the discrete diffusion paradigm, SFTok is well-suited for integration into Multimodal Large Models (MLLMs), facilitating the realization of a unified discrete diffusion framework.
This study addresses the training instability of Vector-Quantized Networks (VQNs) by introducing OptVQ, a new method using the Sinkhorn algorithm for optimal transport. It achieves full codebook utilization (100%) and outperforms current VQNs.
This paper analyzes existing Shapley value estimators and proposes SimSHAP. Experiments validate that SimSHAP significantly accelerates the computation of accurate Shapley values.
We introduced the Concentration Principle and developed SAMP, an efficient model-agnostic interpreter incorporating infinitesimal constraint (IC) and momentum strategy (MS).
This paper proposes Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints, derived from model comprehensibility conditions.
We propose AVSL, which employs a generalized similarity learning paradigm to represent the similarity between images with a graph for a more accurate and explainable measure.
This paper proposes to adaptively learn an ensemble of features that characterizes an image from different aspects, employing a relational module to capture correlations among features.