thaonguyen19.github.io - Thao Nguyen

Example domain paragraphs

Hi! I'm currently a second-year PhD student in Machine Learning at the University of Washington, co-advised by Professors Ludwig Schmidt and Sewoong Oh . My research interests include reliable machine learning, neural network representations, and improving the quality of pretraining datasets. I was an AI Resident at Google Brain from Oct 2019 to Sept 2021. Prior to that I completed my undergrad at Stanford, majoring in Computer Science, and had the chance to spend a wonderful summer at Two Sigma. From June

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a participatory benchmark where the training code is fixed and researchers innovate by proposing new training sets. Concretely, we provide a testbed for dataset experiments centered around a ne

Neural network representations contain structure beyond what was present in the training labels. For instance, representations of images that are visually or semantically similar tend to lie closer to each other than to dissimilar images, regardless of their labels. Clustering these representations can thus provide insights into dataset properties as well as neural network internals. In this work, we study how the many design choices involved in neural network training --- training data, loss function, and

Links to thaonguyen19.github.io (4)