Marah I Abdin
Synthetic data research for large language models
I talk to LLMs for a living.

Synthetic data researcher. I lead the synthetic data team at Poolside, building the Laguna large language models. Before that, the Phi models at Microsoft Research.

Experience
Poolside AI
2025 – Present
Team Lead, Synthetic Data
Started and now leading the synthetic data team, with scope across pre-training, post-training, RL, and agents. The work spans many scales, domains, and pipeline complexities composing Laguna's 5T+ synthetic token corpus. A move from big-lab research to a fast-growing startup, in a role that balances leading the team while remaining a hardcore IC.
Microsoft Frontiers & Microsoft AI (MAI)
2023 – 2025
MTS / Researcher
Worked on the phi open-source models, from Phi-1.5 through Phi-4-reasoning, across the Physics of AGI and AI Frontiers groups within Microsoft Research, and at Microsoft AI working toward MAI-1. My synthetic data contributions focused on scale, quality, diversity, and high-reasoning capabilities in STEM and code.
Microsoft Research
2019 – 2023
Research & engineering
Across Microsoft Research labs in Redmond and New York, working on NLP, human-centered and applied AI, urban environmental sensing, and statistical and network-science research.
Research interests

My research focuses on expanding what synthetic data can teach models at all stages of training. Rather than treating generation as a way to collect more tokens, I study synthesis as an engineering problem in how generated data can complement organic data, regularize how content is presented, and vary, in more complex ways, how it is taught.

My work ranges from simple transformations that vary expression while preserving signal, to compositional multi-step pipelines that change how concepts are taught. Moreover, I am interested in systematically experimenting with synthesis methods to understand which forms of generated data contribute useful training signal, preserve diversity, and compound into new capability rather than simply imitate a stronger model.

Organic / seed-heavy Synthetic / pipeline-heavy lexical regularization compositional distillation
Selected work
Laguna M.1 / XS.2 Technical Report J. Abadji, M. Abdin, C. Adams, et al.2026 Entropy by Design: Synthetic Data at Scale M. Abdin · NeurIPS talk2025 Phi-4 Technical Report M. Abdin, J. Aneja, H. Behl, S. Bubeck, et al.2024 On the Diversity of Synthetic Data and its Impact on Training LLMs H. Chen, A. Waheed, X. Li, …, M. Abdin2024
© Marah Abdin