Turn any recording into a watchable, captioned video in minutes.
Drop a podcast, voice memo, webinar export, or interview MP3. ngram transcribes the audio, plans visual scenes around the spoken content, and renders a branded video with burned-in captions.
Drop an audio file, or click to browse
MP3, WAV, M4A, AAC, OGG, FLAC · up to 500 MB
Trusted by teams at
How it works
Four steps. About three minutes of waiting.
No premiere project, no waveform-and-static-image trick, no manual scene-by-scene work. Drop the audio, accept the storyboard, ship a branded video.
Drop your audio in
MP3, WAV, M4A, AAC, OGG, or FLAC up to 500 MB. Paste a hosted podcast link, or a transcript if you don't have the recording yet.
AssemblyAI transcribes the track
Speaker turns, topic shifts, quotable lines and natural section breaks come back as timestamps. The transcript becomes the script the storyboard hangs off.
ngram plans the visuals
The agent maps each section to a scene — AI imagery, motion text, B-roll, or speaker card — and stamps the brand kit on every frame and caption.
Render and publish
Export in 16:9, 1:1, and 9:16 in one render. Push to a /watch/ link, drop to LinkedIn or YouTube, or hand off to the timeline editor.
Output controls
Smart defaults for podcasts. Real knobs when you need them.
Transcript-driven scenes
Every scene is bound to a transcript range. Trim the script, the visuals follow — no dragging clips on a timeline to keep things in sync.
Burned-in branded captions
Captions sit on every export by default, styled by the brand kit — font, weight, position, accent color. Toggle to.srt or off per render.
Scene art per segment
AI imagery, B-roll, lower-thirds and pull-quote cards swap automatically when the topic shifts. No flat waveform-over-headshot trope.
Three ratios per render
16:9 for YouTube, 1:1 for the LinkedIn feed, 9:16 for Reels, Shorts, and TikTok — smart-reframed from one storyboard.
Beds that don't fight the speaker
Licensed background music from the S3 library, auto-ducked beneath voiced segments and brought up under silence and titles.
Clip out the highlights
Pick a quotable 30–90 second chunk and export it as a standalone clip — same visuals, same brand, vertical-ready.
Translate the voiceover
Regenerate the spoken track in any ElevenLabs-supported language, with translated captions and on-screen text re-rendered to match.
Source files gone in 24h
Audio uploads are processed in-region, encrypted at rest, never used to train models. in-region processing.
The rest of ngram
Audio to video is the front door. These run the rest of the pipeline.
Script Generation
Once your audio is transcribed, the agent tightens the spoken track into a publishable script — hook in the first line, body, closing CTA.
Learn moreAI Visuals
Scene-matched imagery generated from the transcript so each topic in the audio gets a distinct visual treatment instead of a static cover image.
Learn moreCaptions
Burned-in branded captions on every render, frame-aligned to the original audio waveform — the key value for muted-feed playback.
Learn moreBrand Kit
Logo, fonts, colors, intro and outro applied across every scene so a podcast feed and a launch video look like the same brand.
Learn moreMulti-format Export
Smart-reframe the same audio-driven storyboard to 16:9 YouTube, 1:1 LinkedIn, and 9:16 Shorts in a single render.
Learn moreTranslation
Translate the transcript, regenerate the voiceover, and re-render captions — turn one English podcast into localized video for every key market.
Learn moreUse cases
Where audio-driven video earns its place.
Podcast highlights for LinkedIn and Shorts
Turn each 45-minute episode into 6–10 captioned, branded clips formatted for LinkedIn, Reels, and Shorts — the math that pays for the show.
See use caseConference audio into a branded recap
A 30-minute talk recording becomes a tight visual recap with quote callouts, captions, and brand-aligned scenes — ready to share before the event ends.
See use caseVoicemail testimonials into visual proof
Take a recorded customer voice memo or call clip, sync it to a branded scene with their company logo, and ship a testimonial card without filming.
See use caseWebinar audio into shareable clips
Pull the audio export from a webinar tool and let ngram cut the strongest 60–90 second moments into vertical-ready videos with captions.
See use caseOne webinar audio, a month of marketing
Marketing teams point one audio export at ngram and walk away with a launch teaser, a long-form recap, and twelve social clips on brand.
See use caseVoice memos into demand-gen posts
A sales lead drops a voice memo about a customer win; ngram turns it into a captioned LinkedIn video with brand colors before the standup ends.
See use caseFounder voice notes into LinkedIn posts
Founders dictate a take into their phone, drop the file, and ship a captioned LinkedIn video that reads like a post but earns the algorithm's video boost.
See use caseRecorded SME audio into onboarding video
Recorded subject-matter-expert interviews and SOP audio become structured onboarding videos with captions, callouts, and section dividers.
See use caseAudio newsletters into embeddable video
Convert the audio version of your newsletter into a captioned, branded video readers can watch in the inbox instead of clicking a podcast app.
See use caseOther converters
Coming from somewhere else? There's a converter for that.
Same transcribe-then-storyboard pipeline, different inputs. Audio to video is one of 17 converters that share the brand kit, security model, and render stack.
The reverse trip. Strip a clean MP3 or WAV out of any video for a podcast feed, a transcript, or a translation pass.
Open converterClosest cousin of audio to video. Long-form recording in, 8–12 standalone short-form clips out, captions and brand applied.
Open converterIf your audio came from a Loom or screen capture, run it through here — the visuals carry weight your podcast art can't.
Open converterTools that pair with this converter
Sharpen the source. Edit the output.
Polishing the source audio
Fix the recording before the storyboard runs
Background Noise from Audio
Strip room tone and HVAC hum from podcast and voice-memo uploads so the transcript and the rendered voiceover both stay clean.
Open toolAudio to Text
Run an audio file through AssemblyAI on its own when you want the transcript first — then drop it back into the converter as the script.
Open toolAI Voice Dubber
Re-voice a non-English podcast into English (or the other direction) before you convert it to branded video for a new market.
Open toolAI Voice Generator
If you only have a script, generate the spoken audio first with the brand voice, then feed it back into audio to video.
Open toolEditing the rendered video
Take the rendered audio-to-video further
Video Editor
Open the audio-to-video render on a real timeline — trim scenes, shift captions, swap visuals, before publishing.
Open toolVideo Cutter
Trim by transcript, not timecode. Pick the 60-second quote, export it as a standalone short.
Open toolAdd Subtitles to Video
Burn or export.srt subtitles in any language for renders headed to muted-autoplay feeds or international audiences.
Open toolAdd Music to Video
Swap the background bed under the spoken track — pick a different mood or upload a licensed track of your own.
Open toolGenerating from scratch
If you don't have audio yet
Text to Speech Video
No recording? Type the script and ngram generates the voiceover and the video together, identical pipeline downstream.
Open toolAI Avatar Video Generator
Pair the generated voiceover with an avatar host so the result feels like a hosted segment instead of a faceless narration.
Open toolVideo Script Generator
Draft the spoken script before you record, so the audio you hand to the converter already has structure and a CTA.
Open toolText to Video
Skip recording entirely. Type the talking points and let ngram script, voice, and visualize — same look as audio-to-video.
Open toolBuilt for teams
Who reaches for audio to video in your company?
Content Creators
Podcasters and indie creators turning each episode into the social-video stream the algorithm actually rewards.
See workflowsProduct Marketing
Convert recorded customer calls, founder voice memos, and webinar audio into branded video for launches and lifecycle campaigns.
See workflowsDeveloper Relations
Take conference talk audio, podcast appearances, and meetup recordings and ship branded recaps before the event hashtag cools down.
See workflowsGrowth Marketing
Push paid social with voiced creative pulled from existing audio assets — testimonial calls, founder takes, internal interviews.
See workflowsCustomer Success
Turn customer-call recordings into testimonial videos, QBR moments, and onboarding clips without a production loop.
See workflowsFounders
Dictate a take on the way to the office and ship a captioned LinkedIn video before the first standup of the day.
See workflowsSales Enablement
Convert win-call audio and SME interviews into objection-handling videos that reps can actually drop into a deal cycle.
See workflowsAgencies
Spin up branded videos for every client from their own podcast feed, founder interviews, and recorded discovery sessions.
See workflowsIntegrations
Triggers, not logos. Wire audio to video into the tools you already run.
Every integration ships with a working template tuned for audio-driven workflows. Start from one, or build your own with the REST API and webhooks.
whenA new podcast episode lands in your RSS feed or hosting tool
thenRun audio to video and post the social clips in #marketing
whenClaude or ChatGPT is handed an MP3 of a customer call
thenConvert the audio to a captioned testimonial video and return the share link
whenA self-hosted workflow lands a finished podcast WAV on S3
thenTrigger an audio-to-video render without the recording leaving your VPC
whenRiverside or Descript finishes exporting a podcast episode
thenBuild an audio-to-video render and attach the share link in HubSpot
whenYou hit 'Convert to video' on a Spotify or Apple Podcasts episode page
thenGet back a captioned, branded video version in a new tab
whenAn audio-to-video render finishes for an episode
thenPush the 16:9 export and 9:16 Shorts cut straight to your YouTube channel
whenA founder voice memo finishes converting
thenSchedule the 1:1 captioned video to the LinkedIn page on your cadence
How it compares
If you've been using something else to turn audio into video.
Headliner and Wavve put a waveform over a still image. Descript edits the transcript but leaves the visuals to you. ngram plans the scenes, applies the brand, and renders the captioned video in one pass.
| Feature | ngram | Headliner | Descript | Wavve |
|---|---|---|---|---|
| Visual treatment per segment | Scene-matched art, B-roll, lower-thirds, quote cards | Waveform + still image | Manual scene work | Waveform + still image |
| Transcription engine | AssemblyAI with topic and speaker breaks | In-house transcription | In-house transcription | In-house transcription |
| Brand kit applied automatically | Logo, fonts, colors, intro and outro on every render | Template-level only | Manual per project | Template-level only |
| Multi-format export in one render | 16:9, 1:1, 9:16 from one storyboard | One ratio per export | One ratio per export | One ratio per export |
| Translation and re-voice | Translate transcript, regenerate voiceover, re-render captions | No | Translation as separate flow | No |
| Max input file size | 500 MB per file | Around 200 MB | Higher on paid | Around 100 MB |
| API and webhooks | REST API, MCP, n8n, Zapier, webhooks | — | API on enterprise | — |
| Source files retained | Auto-deleted in 24h | Variable | Project-bound | Variable |
FAQ
Common questions about audio to video
Still curious?
Audio → Video
Ready to turn your audio into a video your audience will actually watch?
Drop the MP3, watch the storyboard land in under a minute, ship the captioned video to LinkedIn, YouTube, or Shorts.