Turn any recording into a watchable, captioned video in minutes.

Drop a podcast, voice memo, webinar export, or interview MP3. ngram transcribes the audio, plans visual scenes around the spoken content, and renders a branded video with burned-in captions.

Input — Audio to VideoReady
Upload audioPaste a URLPaste a transcript

Drop an audio file, or click to browse

MP3, WAV, M4A, AAC, OGG, FLAC · up to 500 MB

MP3WAVM4AAACOGGFLAC

Trusted by teams at

Salesforce
Salesforce
HubSpot
HubSpot
PayPal
PayPal
Snap Inc.
Snap Inc.
Rocket Mortgage
Rocket Mortgage
Tektronix
Tektronix
Diligent
Diligent
Times Internet
Times Internet
Fivetran
Fivetran
Demandbase
Demandbase
Salesforce
Salesforce
HubSpot
HubSpot
PayPal
PayPal
Snap Inc.
Snap Inc.
Rocket Mortgage
Rocket Mortgage
Tektronix
Tektronix
Diligent
Diligent
Times Internet
Times Internet
Fivetran
Fivetran
Demandbase
Demandbase
Eightfold AI
Eightfold AI
PingCAP
PingCAP
Quizizz
Quizizz
Apryse
Apryse
Sandbox VR
Sandbox VR
Improvado
Improvado
Taggbox
Taggbox
Matrixport
Matrixport
Glasswall
Glasswall
ContractSafe
ContractSafe
Eightfold AI
Eightfold AI
PingCAP
PingCAP
Quizizz
Quizizz
Apryse
Apryse
Sandbox VR
Sandbox VR
Improvado
Improvado
Taggbox
Taggbox
Matrixport
Matrixport
Glasswall
Glasswall
ContractSafe
ContractSafe

How it works

Four steps. About three minutes of waiting.

No premiere project, no waveform-and-static-image trick, no manual scene-by-scene work. Drop the audio, accept the storyboard, ship a branded video.

01

Drop your audio in

MP3, WAV, M4A, AAC, OGG, or FLAC up to 500 MB. Paste a hosted podcast link, or a transcript if you don't have the recording yet.

02

AssemblyAI transcribes the track

Speaker turns, topic shifts, quotable lines and natural section breaks come back as timestamps. The transcript becomes the script the storyboard hangs off.

03

ngram plans the visuals

The agent maps each section to a scene — AI imagery, motion text, B-roll, or speaker card — and stamps the brand kit on every frame and caption.

04

Render and publish

Export in 16:9, 1:1, and 9:16 in one render. Push to a /watch/ link, drop to LinkedIn or YouTube, or hand off to the timeline editor.

Output controls

Smart defaults for podcasts. Real knobs when you need them.

Transcript-driven scenes

Every scene is bound to a transcript range. Trim the script, the visuals follow — no dragging clips on a timeline to keep things in sync.

Burned-in branded captions

Captions sit on every export by default, styled by the brand kit — font, weight, position, accent color. Toggle to.srt or off per render.

Scene art per segment

AI imagery, B-roll, lower-thirds and pull-quote cards swap automatically when the topic shifts. No flat waveform-over-headshot trope.

Three ratios per render

16:9 for YouTube, 1:1 for the LinkedIn feed, 9:16 for Reels, Shorts, and TikTok — smart-reframed from one storyboard.

Beds that don't fight the speaker

Licensed background music from the S3 library, auto-ducked beneath voiced segments and brought up under silence and titles.

Clip out the highlights

Pick a quotable 30–90 second chunk and export it as a standalone clip — same visuals, same brand, vertical-ready.

Translate the voiceover

Regenerate the spoken track in any ElevenLabs-supported language, with translated captions and on-screen text re-rendered to match.

Source files gone in 24h

Audio uploads are processed in-region, encrypted at rest, never used to train models. in-region processing.

Use cases

Where audio-driven video earns its place.

Creator social clips

Podcast highlights for LinkedIn and Shorts

Turn each 45-minute episode into 6–10 captioned, branded clips formatted for LinkedIn, Reels, and Shorts — the math that pays for the show.

See use case
DevRel conference talk

Conference audio into a branded recap

A 30-minute talk recording becomes a tight visual recap with quote callouts, captions, and brand-aligned scenes — ready to share before the event ends.

See use case
Customer testimonial

Voicemail testimonials into visual proof

Take a recorded customer voice memo or call clip, sync it to a branded scene with their company logo, and ship a testimonial card without filming.

See use case
Webinar clips

Webinar audio into shareable clips

Pull the audio export from a webinar tool and let ngram cut the strongest 60–90 second moments into vertical-ready videos with captions.

See use case
Marketing webinar clips

One webinar audio, a month of marketing

Marketing teams point one audio export at ngram and walk away with a launch teaser, a long-form recap, and twelve social clips on brand.

See use case
Marketing social clips

Voice memos into demand-gen posts

A sales lead drops a voice memo about a customer win; ngram turns it into a captioned LinkedIn video with brand colors before the standup ends.

See use case
LinkedIn video

Founder voice notes into LinkedIn posts

Founders dictate a take into their phone, drop the file, and ship a captioned LinkedIn video that reads like a post but earns the algorithm's video boost.

See use case
Training video

Recorded SME audio into onboarding video

Recorded subject-matter-expert interviews and SOP audio become structured onboarding videos with captions, callouts, and section dividers.

See use case
Newsletter video

Audio newsletters into embeddable video

Convert the audio version of your newsletter into a captioned, branded video readers can watch in the inbox instead of clicking a podcast app.

See use case

Tools that pair with this converter

Sharpen the source. Edit the output.

All ngram tools

How it compares

If you've been using something else to turn audio into video.

Headliner and Wavve put a waveform over a still image. Descript edits the transcript but leaves the visuals to you. ngram plans the scenes, applies the brand, and renders the captioned video in one pass.

FeaturengramHeadlinerDescriptWavve
Visual treatment per segmentScene-matched art, B-roll, lower-thirds, quote cardsWaveform + still imageManual scene workWaveform + still image
Transcription engineAssemblyAI with topic and speaker breaksIn-house transcriptionIn-house transcriptionIn-house transcription
Brand kit applied automaticallyLogo, fonts, colors, intro and outro on every renderTemplate-level onlyManual per projectTemplate-level only
Multi-format export in one render16:9, 1:1, 9:16 from one storyboardOne ratio per exportOne ratio per exportOne ratio per export
Translation and re-voiceTranslate transcript, regenerate voiceover, re-render captionsNoTranslation as separate flowNo
Max input file size500 MB per fileAround 200 MBHigher on paidAround 100 MB
API and webhooksREST API, MCP, n8n, Zapier, webhooksAPI on enterprise
Source files retainedAuto-deleted in 24hVariableProject-boundVariable

FAQ

Common questions about audio to video

MP3, WAV, M4A, AAC, OGG, and FLAC, plus most other browser-playable audio formats. Up to 500 MB per file. You can also paste a hosted podcast link from Spotify, Anchor, Apple Podcasts, Riverside, Drive, Dropbox, or S3, or hand over a transcript instead of the recording.

Still curious?

Audio → Video

Ready to turn your audio into a video your audience will actually watch?

Drop the MP3, watch the storyboard land in under a minute, ship the captioned video to LinkedIn, YouTube, or Shorts.