Senior Research Engineer - Voice
Skills
Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London, with offices and teams across Europe and the US.
As AI continues to shape the way we live and work, Synthesia develops products to enhance visual communication and enterprise skill development, helping people work better and stay at the center of successful organizations.
Following our recent Series E funding round, where we raised $200 million, our valuation stands at $4 billion. Our total funding exceeds $530 million from premier investors including Accel, NVentures (Nvidia's VC arm), Kleiner Perkins, GV, and Evantic Capital, alongside the founders and operators of Stripe, Datadog, Miro, and Webflow.
What you'll do at Synthesia
As a Research Engineer you will join a team of 40+ Researchers and Engineers within the R&D Department working on cutting-edge challenges in the Generative AI space, with a focus on creating high-quality, expressive and real-time synthetic voices. Within the team you’ll have the opportunity to work on the applied side of our research efforts and directly impact our solutions that are used worldwide by over 60,000 businesses.
If you are an expert in ML, LLMs, speech generation, conversational models, this is your chance to make a global impact. You will join our Audio Post-Training Team, which works on generative speech and voice synthesis, ensuring our in-house voice models reach production-level quality, speed, and robustness. Typical projects include:
Develop and evaluate streaming and speech-to-speech systems, enabling low-latency, interactive voice synthesis.
Adapt models for new conditioning inputs (emotion, speed, prosody, speaker control, etc.).
Implement post-training optimization techniques (quantization, pruning, distillation) to improve efficiency and latency in real-time speech generation.
Integrate and test novel architectures, such as neural codecs, diffusion, or flow-matching models, to enhance realism and responsiveness.
Contribute to defining new evaluation metrics for conversational speech, including latency-aware and online MOS prediction systems.
Stay updated with the latest research in audio diffusion, autoregressive models, neural codecs, and multimodal LLMs.
Apply DPO (Direct Preference Optimization) and distillation to fine-tune large-scale speech models.
What we're looking for:
Strong understanding of generative modeling, ideally applied to sequential or multimodal data.
Hands-on experience with large language models (LLMs) or similar transformer-based architectures.
High proficiency in PyTorch, including experience with distributed training and model optimization.
Solid grasp of time-series modeling and tokenization, preferably in the context of audio or speech.
Demonstrated ability to prototype quickly, test hypotheses, and iterate efficiently.
Proven experience in training deep learning models end-to-end, from data preparation to evaluation.
Strong general software engineering skills, enabling contributions to a large, shared research infrastructure.
Nice to have experience:
Experience with real-time or streaming architectures is a big plus.
Familiarity with state-of-the-art architectures in audio and speech generation (e.g., diffusion models, neural codecs, flow-matching models, autoregressive decoders).
Experience with speech-to-speech or text-to-speech (TTS) systems.
Evidence of original research contributions, such as publications or open-source work in top-tier venues (e.g., ICASSP, Interspeech, NeurIPS, ICML).