Abstract |
Transformer is the backbone architecture of most recent phenomenal language models. In this talk, I will delve into the approximation techniques for the attention and the MLP modules in Transformers. Firstly, I will discuss the connection between attention mechanisms and kernel estimators, and accordingly adapt the Nyström techniques for fast kernel computation to attention approximation. Then, we turn to the compression of the MLP layers in Transformers, which preserves the neural tangent kernel (NTK) thereof and accelerates both fine-tuning and inference for large language models. The two aspects collectively showcase the statistical structures behind popular deep learning designs. |
About the speaker |
Yifan Chen is an assistant professor in computer science and math at HKBU. He is broadly interested in developing efficient algorithms for machine learning, encompassing both statistical and deep learning models. Before joining HKBU, Yifan earned his Ph.D. degree in Statistics from UIUC in 2023.
|