Marah Abdin

Synthetic data researcher. I lead the synthetic data team at Poolside, building the Laguna large language models. Before that, the Phi models at Microsoft Research.

Experience

Poolside AI

2025 – Present

Team Lead, Synthetic Data

Started and now leading the synthetic data team, with scope across pre-training, post-training, RL, and agents. The work spans many scales, domains, and pipeline complexities composing Laguna's 5T+ synthetic token corpus. A move from big-lab research to a fast-growing startup, in a role that balances leading the team while remaining a hardcore IC.

Microsoft Frontiers & Microsoft AI (MAI)

2023 – 2025

MTS / Researcher

Worked on the phi open-source models, from Phi-1.5 through Phi-4-reasoning, across the Physics of AGI and AI Frontiers groups within Microsoft Research, and at Microsoft AI working toward MAI-1. My synthetic data contributions focused on scale, quality, diversity, and high-reasoning capabilities in STEM and code.

Microsoft Research

2019 – 2023

Research & engineering

Across Microsoft Research labs in Redmond and New York, working on NLP, human-centered and applied AI, urban environmental sensing, and statistical and network-science research.

Research interests

My research focuses on expanding what synthetic data can teach models at all stages of training. Rather than treating generation as a way to collect more tokens, I study synthesis as an engineering problem in how generated data can complement organic data, regularize how content is presented, and vary, in more complex ways, how it is taught.

My work ranges from simple transformations that vary expression while preserving signal, to compositional multi-step pipelines that change how concepts are taught. Moreover, I am interested in systematically experimenting with synthesis methods to understand which forms of generated data contribute useful training signal, preserve diversity, and compound into new capability rather than simply imitate a stronger model.

Selected work

Marah Abdin in conversation at Poolside: Orchestrating Synthetic Data

Laguna M.1 / XS.2 Technical Report J. Abadji, M. Abdin, C. Adams, et al.2026

Synthetic data section · §3.2.2 →

Entropy by Design: Synthetic Data at Scale M. Abdin · NeurIPS talk2025 Phi-4 Technical Report M. Abdin, J. Aneja, H. Behl, S. Bubeck, et al.2024 On the Diversity of Synthetic Data and its Impact on Training LLMs H. Chen, A. Waheed, X. Li, …, M. Abdin2024

All publications →