Synthetic data researcher. I lead the synthetic data team at Poolside, building the Laguna large language models. Before that, the Phi models at Microsoft Research.
My research focuses on expanding what synthetic data can teach models at all stages of training. Rather than treating generation as a way to collect more tokens, I study synthesis as an engineering problem in how generated data can complement organic data, regularize how content is presented, and vary, in more complex ways, how it is taught.
My work ranges from simple transformations that vary expression while preserving signal, to compositional multi-step pipelines that change how concepts are taught. Moreover, I am interested in systematically experimenting with synthesis methods to understand which forms of generated data contribute useful training signal, preserve diversity, and compound into new capability rather than simply imitate a stronger model.