The Missing Soundtrack: Why Audio is AI's Hardest Frontier
💙 This month we deep dive into why audio is the final frontier of generative AI featuring an interview with the founder of Mirelo AI, the research lab building audio models for videos.
New Renaissance Notes: a monthly newsletter where we share insights about the intersection of culture, creativity and technology. We include our updates from New Renaissance Ventures - Europe’s first venture fund dedicated to the Creative and Cultural Industries.
If you haven’t subscribed yet you can join our journey here:
See here for our previous edition: ‘The Child Who Never Had to Struggle’
💙 NRV News:
🇬🇷 This week found us in Athens for Panathēnea, May 27–29. It was an incredible few days, highlighted by Severin taking the stage to speak on ‘The Solo GP Playbook’ panel alongside an exceptional lineup, sharing the event's stages with teams from ElevenLabs, Runway, Sequoia, Index, and Balderton. Thank you to everyone who came out to find us!

🎧 This month speaking on the Opus 1 Foundation podcast with its founder & CEO Christopher Coritsidis. The New York based nonprofit activates scalable solutions to social and economic challenges through the power of the arts, advancing opportunities for greater equity, inclusion and innovation worldwide. Want to know why we are backing the creative sector? Listen to the full episode here:
🪩 We’ll also be at SXSW London during the first week of June and in Berlin a week later for SuperVenture. Reach out if you’ll be in town!
🇦🇹We’re excited to share that the next CultTech Summit will take place in October in Vienna. New Renaissance Ventures is a proud partner of the Summit and we are hosting 2 days of Masterclasses on Generative media during the official Summit program. Best way to stay tuned for our speaker lineup is to get your super early bird ticket.
💙 New Renaissance Talks
E7: Building the “Figma for Sound” (And Raising a $41m Seed from Index and a16z) with Carl Johann Simon-Gabriel, CEO and Co-Founder of Mirelo AI
In this episode, Severin Zugmayer interviews Carl Johann Simon-Gabriel (CJ), CEO and Co-Founder of Mirelo AI, a market-leading foundation model for generating synced sound and music for videos. CJ shares his journey from classically trained organist and ML researcher at AWS to building what he describes as "the Figma for sound design," discussing why audio is the hardest modality in generative AI, the evolution from pure music generation to video-synced sound effects, and the unexpected importance of soft skills and co-founders when scaling a startup.
You can listen to the full episode on Spotify here:
How to Win by Not Following the Crowd
CJ Simon-Gabriel was at AWS’s AI research lab when ChatGPT arrived. The directive from top management was clear: pivot to language models. Everyone did. His friends at Google Brain, at Facebook Research, all redirected to LLMs overnight.
CJ went the other way.
Speaking on New Renaissance Talks, he recalled realizing that if transformers could revolutionize text, they could do the same for music.
This was mid-2023. Before Suno and Udio existed. He founded Mirelo in November 2023, but by the time their music model was fully functional in April 2024, the rest of the world had caught up. So, CJ pivoted again. The real problem wasn't music-for-music; it was audio-for-video.
“Very quickly we decided to focus not just on music for the sake of music, but music for videos. And this morphed into sound effects for videos. And maybe one day all the audio tools for videos," CJ shares.
Today, Mirelo Studio lets you upload any video and get a synced soundtrack in seconds. Each on separate stems. Four versions of each stem. And if you dislike one, you regenerate just that section.
With an open API, native plugins for Adobe Premiere and DaVinci Resolve, and a recent $41 million seed round co-led by Index Ventures and Andreessen Horowitz, Mirelo is proving that the road less traveled was the right one.
Why Audio Is Harder Than It Looks
Here’s what makes audio uniquely brutal as an AI problem:
It’s three distinct modalities pretending to be one. Sound effects operate on milliseconds. Music on bars and phrases. Speech on sentences. They have different temporal structures, different training data, different evaluation benchmarks. A company that nails speech (like ElevenLabs) doesn’t automatically nail footsteps. A company that nails music (like Suno) doesn’t automatically nail door creaks.
The research gap is real.
"If you look at machine learning publications, 50% is large language models, the other 50% is computer vision. And in between you have a bit of audio,"CJ explains.
Because everyone raced to build the exact same text and image tools, audio became the road not taken. Mirelo had to do fundamental research where competitors could just copy existing papers.
The audio data is entangled. A movie soundtrack is a mix of dialogue, music, and effects baked together. To train a model that generates clean stems, you need clean stems to train on and those are rare, expensive, and often locked inside studio archives.
Timing is everything. According to CJ, timing takes absolute priority over lyrics. Gunshot that lands half a second late doesn't feel cinematic, it feels broken. In visual media, sync is the entire product.
The Figma of Sound
"What Figma was for design, we want to be for sound design," - CJ
The comparison is precise. Before Figma, design was a specialist discipline locked inside expensive isolated tools. Figma didn’t replace professional designers, it expanded who could participate. Non-designers could comment, inspect, contribute. Designers got faster. The whole category grew.
Mirelo wants the same for audio. YouTubers, indie game developers, marketing teams creating video content none of them are sound designers. Most of them treat audio as the annoying, manual final step.
The goal is to make audio stop being the bottleneck that makes creators miserable.
And here’s the counterintuitive part: the professionals benefit most. CJ draws a parallel to coding, explaining that generative AI elevates experienced professionals far more than amateurs. Just as a 10x programmer derives the greatest utility from AI assistants, the creators producing the most compelling AI-generated videos today are almost always individuals who already have deep backgrounds working inside professional video studios
Why Audio Might Get Absorbed, Not Disrupted
The bear case for dedicated audio startups is simple: video models like Sora and Veo are starting to generate native sound automatically. If your business only exists to unmute silent clips, what happens when videos are no longer born silent?
CJ’s answer is layered:
Quality: Native model sync is still rough—good for quick demos, unusable for production.
Control: Native audio comes as a single, flattened track. You can’t isolate footsteps from ambient noise or the musical score.
Emotion: Film audio is inherently intentional. Just as directors discard on-camera audio to manually craft the soundscape in post-production, creators want independent control over sound to build psychological tension. Audio remains a separate layer because creators want it that way.
Want to experience this firsthand? Head to Mirelo AI, sign up for Mirelo Studio, and claim your free credits to start generating synced sound for your own videos.
Looking Forward: 5 Unsolved Gaps in AI Audio
Mirelo is fixing audio for video, but the wider generative music industry still has major problems to solve.
To map out where the next wave of opportunity lies, this section is inspired by the work of NRV VC Scout Valerio Velardo, an AI music consultant bridging machine learning and composition, who identified the critical flaws that current generative systems are failing to close.
If you look under the hood of today's mainstream consumer tools, there are five major gaps waiting for the next generation of founders to solve:
Songs with No Direction
AI models currently sound convincing moment-to-moment, but they lack structural memory. There is no buildup, payoff, or long-term form. The fix requires moving toward hierarchical AI architectures: a structural "conductor" model that dictates the overall composition path, feeding commands down to a local "performer" model handling the immediate sonic details.
Great Tone, Zero Grammar
Today's mainstream models copy the surface texture of music but don’t understand its grammar. They replicate statistical patterns without grasping harmony or voice-leading logic. The industry must move toward hybrid AI systems that combine deep learning (for texture) with symbolic, rules-based layers (for harmony and form).
The “Un-editable” Finished File
Most platforms hand you a flat, static audio file. Want to tweak one chord or swap an instrument? You have to regenerate the entire track from scratch. It is the creative equivalent of rewriting a whole book chapter just to fix a single typo.
Solving this requires multi-modal training that aligns raw audio, musical scores, and semantic tags simultaneously so artists can isolate and edit specific elements. This is precisely what Mozart AI (NRV portfolio company) is building with their Generative Audio Workstation (GAW). Backed by a $6M seed round from Balderton Capital, their platform bypasses flat files entirely by offering context-aware stem generation and granular section editing, allowing producers to pull apart tracks, rearrange digital note progressions, and tweak individual elements without breaking their creative flow.

Prompts Aren’t a Musician’s Language
True creators think in motifs, melodies, and rhythms, not text strings. Forcing an artist to describe a chord progression via a text box ruins the natural creative workflow. The next evolution of audio tech must rely on input-agnostic plugins that accept humming, acoustic sketches, and digital instrument data, integrating directly into existing DAWs to meet musicians where they already work.
The Western Blind Spot
Because the largest models are trained almost exclusively on Western mainstream catalogs, they treat rich global musical heritages, like Maqam, Hindustani classical, or microtonal compositions, as statistical rounding errors. This makes current tools functionally useless for global composers looking to create outside the Western pop. The fix is a deliberate shift toward training specialized models on global traditions as foundational, first-class systems rather than creative edge cases.
Solving these complex infrastructure gaps requires the brightest minds in tech and culture to get in the same room. If you want to meet the founders and creators building this future in person, here are the rooms you need to be in next month:
💙NRV Picks: Creative Tech Events (June)
1-6 June, 2026 SXSW 📍London - SXSW London is the global festival for the convergence of business, technology and creativity. Secure your spot and meet us there!
3 June, AI x Games Dinner - #NYTechWeek 📍New York - Greater Zurich Area and nunu.ai (NRV portfolio company) invite you to an exclusive AI x Gaming dinner in the heart of NYC. Register here.
3 June, ElevenCreative Sessions 📍New York - A pop-up creator session during NY Tech Week. Register here.
9 June, NY MusicTech Meetup & Demo Night 📍New York - A special edition of the monthly NY MusicTech Meetup, as part of New York Music Month. Register here.
9 June, fal x VEED Meetup 📍London - An evening of learning, connecting, and chatting about generative media. Register here.
Thank You For Reading
Thanks for subscribing to New Renaissance Monthly - please feel free to share this newsletter with anyone you think would appreciate it! We welcome feedback so let us know your thoughts!
If you haven’t subscribed yet you can here:
For more regular updates, insights and resources follow us on Linkedin.
If you missed the past newsletters, you can catch up here.




