Narration-first DSL + audio pipeline for Remotion videos

These details have not been verified by PyPI

Project links

Project description

Babulus (Voiceover → Video Timing)

Babulus compiles a narration-first DSL into a timed script JSON. Remotion uses that JSON as the source of truth for scene/cue timing by converting seconds → frames at runtime.

The One-Sentence Mental Model

Your .babulus.yml defines IDs + times, Babulus outputs JSON with startSec/endSec, and your Remotion code does two explicit mappings:

scene.id → which React scene component to render
cue.id → which element/animation to start/show at that time

That’s “the connection”.

Data Shape (JSON)

script.json contains:

scenes[]: { id, title, startSec, endSec, cues[] }
cues[]: { id, label, startSec, endSec, text, bullets? }

The DSL (YAML)

A .babulus.yml file is a YAML document with a top-level scenes: list.

audio:
  # Optional: default provider for `kind: sfx` clips.
  sfx_provider: elevenlabs

scenes:
  - id: intro
    title: "Intro"
    time: "0s-8s"
    cues:
      - id: hook
        label: "Hook"
        time: "0s-3s"
        voice: "In this video, we'll build an agent."
      - id: bullets
        label: "Bullets"
        time: "3s-8s"
        voice: "We'll cover three things."
        bullets:
          - "Tools"
          - "Memory"
          - "Errors"
      - id: whoosh-demo
        label: "Transition"
        voice: "Now let's transition."
        audio:
          - kind: sfx
            id: whoosh
            at: "+0.0s"    # relative to this cue's start time
            volume: 25%    # accepts 0..1, 0..100, or "80%"
            prompt: "Fast cinematic whoosh transition, clean, no voice"
            duration_seconds: 3
            variants: 8
            pick: 2

Time formats

time may be either:

A range string: "12.5s-18.3s"
A relative range string inside a timed scene: "+0.5s-+1.2s" (adds the scene’s startSec)

If you omit id for a scene/cue, Babulus derives one from title/label (slugified). It’s optional, but for real projects you usually want explicit IDs so you can rename titles/labels without breaking the Remotion mapping.

Compile to JSON (CLI)

Install (for local development, from a clone of this repo):

python -m pip install -e . -U

You can then run either babulus ... (recommended) or python -m babulus ....

Manual timing compile

babulus compile \
  --dsl path/to/video.babulus.yml \
  --out path/to/script.json \
  --pretty

Transcript-driven alignment is supported if you pass --transcript path/to/words.json, where the JSON contains:

{ "words": [{ "word": "Hello", "start": 0.0, "end": 0.2 }] }

Audio-driven generation (the “real” pipeline)

This mode is for when you want cue timing to come from the actual generated audio (plus explicit pauses), rather than hard-coded time: ranges.

babulus generate --dsl path/to/video.babulus.yml

Defaults (derived from the DSL filename <video>.babulus.yml):

script-out: src/videos/<video>/<video>.script.json
timeline-out: src/videos/<video>/<video>.timeline.json
audio-out: public/babulus/<video>.wav
out-dir: .babulus/out/<video>

If you have exactly one DSL under ./content/, you can omit --dsl entirely:

babulus generate

Idempotence / caching:

By default, generate reuses cached audio segments when the inputs are unchanged (so changing one word only regenerates the affected clip).
Use --fresh to force regeneration of everything.

Watch mode

Regenerate automatically when you edit the DSL (and ./.babulus/config.yml if present):

babulus generate --watch --dsl path/to/video.babulus.yml

Clean

Remove generated artifacts (script/timeline/audio, .babulus/out/, staged public/babulus/ files).

Dry-run (prints what would be deleted):

babulus clean

Actually delete:

babulus clean --yes

Babulus loads API credentials from config in this order (unless BABULUS_PATH is set):

./.babulus/config.yml
~/.babulus/config.yml

If BABULUS_PATH is set, it will use:

$BABULUS_PATH if it points to a file
$BABULUS_PATH/config.yml if it points to a directory

Example config.yml shape:

providers:
  elevenlabs:
    api_key: "..."
    voice_id: "..."
  openai:
    api_key: "..."
  azure_speech:
    api_key: "..."
    region: "eastus"
  aws_polly:
    region: "us-east-1"
    voice_id: "Joanna"

Providers (TTS)

Set voiceover.provider in your .babulus.yml to one of:

dry-run (silent WAVs with estimated durations)
elevenlabs (TTS via ElevenLabs; segments are stored as MP3 and concatenated to your requested --audio-out)
openai (TTS via OpenAI, writes WAV)
aws-polly (TTS via AWS Polly, writes WAV by wrapping PCM)
azure-speech (TTS via Azure Cognitive Services Speech, writes WAV)

Credentials/config live in ./.babulus/config.yml or ~/.babulus/config.yml:

ElevenLabs: providers.elevenlabs.api_key, plus providers.elevenlabs.voice_id for TTS
OpenAI: providers.openai.api_key
Azure: providers.azure_speech.api_key + providers.azure_speech.region
AWS Polly: uses the standard AWS credential chain (env vars like AWS_ACCESS_KEY_ID, ~/.aws/credentials, SSO, etc.). Region/voice go in providers.aws_polly.

ElevenLabs pronunciation dictionaries

To fix pronunciation of project-specific words (like “Tactus”), you have two options:

Option A: Define lexemes in the DSL (recommended)

Put lexemes directly in the DSL, and Babulus will transparently create/update an ElevenLabs pronunciation dictionary in your workspace and attach it to every TTS request:

voiceover:
  provider: elevenlabs
  pronunciation_dictionary:
    name: tactus
  pronunciations:
    - lexeme:
        grapheme: "Tactus"
        alias: "tack-tus"

Notes:

The cloud dictionary is cached/tracked in .babulus/out/<video>/manifest.json so it only updates when lexemes change.
Babulus prepends the auto-managed dictionary to any explicitly listed dictionaries (max 3 total).

Option B: Reference an existing dictionary ID

Add a pronunciation dictionary in ElevenLabs yourself and reference it from the DSL:

voiceover:
  provider: elevenlabs
  pronunciation_dictionaries:
    - id: "pd_your_dictionary_id"
      version_id: null

This maps to ElevenLabs pronunciation_dictionary_locators on each TTS request (max 3 per request).

Pauses & Segments (Voiceover Authoring)

In generate mode, cue timing is computed from audio segment durations. You can also insert explicit pauses.

Delaying the start of a cue’s narration

If you want the voice to start later (while the scene is already on screen), put pause_seconds on the voice: mapping (or make the first segments[] item a pause).

scenes:
  - id: problem
    title: "Problem"
    cues:
      - id: problem
        label: "Problem"
        voice:
          pause_seconds: 2
          segments:
            - voice: "This line will start 2 seconds after the cue begins."

Important: voice.segments runs in order. A pause_seconds segment after a voice segment is a pause after speaking, not a delay before it.

You can also delay an individual voice segment by putting pause_seconds on that segment:

voice:
  segments:
    - voice: "First sentence."
    - voice: "Second sentence after a beat."
      pause_seconds: 0.5

Per-cue segments

Instead of a single voice: field, a cue can use segments: to split narration into smaller chunks and insert pauses:

scenes:
  - title: "Example"
    cues:
      - id: hook
        label: "Hook"
        voice:
          segments:
            - voice: "Tool-using agents are useful."
            - pause_seconds: 0.25
            - voice: "And dangerous."
              trim_end_sec: 0.12

Trimming breaths / tails

Some TTS voices add a little breath or tail at the end of a segment. You can trim that off:

voiceover:
  trim_end_seconds: 0.12

Or override per segment with trim_end_sec (legacy key) or trim_end_seconds (preferred).

Default pause between cues (with optional jitter)

You can set a default pause between cue items, optionally randomized (deterministically via seed):

voiceover:
  seed: 1337
  pause_between_items_seconds: 0.1
  pause_between_items_gaussian:
    mean_seconds: 0.12
    std_seconds: 0.05
    min_seconds: 0.02
    max_seconds: 0.35

Multi-Track Audio (SFX / Music / Files)

Declare audio clips next to the cue or scene where they should play.

audio:
  sfx_provider: elevenlabs
  music_provider: elevenlabs
  library:
    whoosh:
      kind: sfx
      prompt: "Quick whoosh transition"
      duration_seconds: 3
      variants: 5

scenes:
  - id: problem
    title: "Problem"
    cues:
      - id: problem
        label: "Problem"
        voice: "..."
        audio:
          - use: whoosh
            at: "+0.0s"     # relative to this cue's start
            volume: 35%
            pick: 2         # per-use: choose variant

  - id: intro
    title: "Intro"
    # Optional: scene-level audio (relative to scene start)
    audio:
      # Generated background music (default duration: this scene’s duration)
      - kind: music
        id: bed
        prompt: "Warm ambient background music, minimal percussion, no vocals"
        volume: 20%
        # play_through: true     # extend to end of video
        # duration_seconds: 30   # override default duration
      # Or, reference an existing file under `public/`:
      - kind: file
        id: bed-file
        src: "music/bed.mp3"
        volume: 20%
    cues:
      - id: hook
        label: "Hook"
        voice: "..."

Key ideas:

audio: under a cue defaults to playing at the cue start; use at: "+0.2s" to offset.
Use audio.library + use: to reuse the same generated clip in multiple places (with independent pick, volume, pause_seconds).
Use explicit anchors if needed: at: "cue:<cueId>+0.2s" or at: "scene:<sceneId>+0.2s".
SFX supports variants + pick for auditioning options.
Music clips default to the current scene duration; set play_through: true to extend to the end of the video.
Any clip can fade its volume over time using fade_to / fade_out (default fade_duration_seconds: 2).
src for kind: file should be a path under Remotion’s public/ directory (so staticFile(src) works).
volume accepts either 0..1 (Remotion gain) or 0..100 / "80%" (percent).

Volume fades example (clip-local seconds):

audio:
  music_provider: elevenlabs

scenes:
  - id: title
    title: "Title"
    audio:
      - kind: music
        id: bed
        prompt: "Ambient background music, no vocals"
        volume: 92%
        fade_to:
          volume: 50%
          after_seconds: 4
          # fade_duration_seconds: 4   # optional (default 2)
        fade_out:
          volume: 92%
          before_end_seconds: 4
          # fade_duration_seconds: 4   # optional (default 2)

What Babulus generates:

--timeline-out JSON includes audio.tracks[].clips[] with computed startSec.
For SFX variants, Babulus caches all candidates under --out-dir and (when --audio-out points into public/) stages the chosen SFX into public/babulus/sfx/<clipId>.wav and writes src: "babulus/sfx/<clipId>.wav" into the timeline so Remotion can play it.
For narration, when --audio-out points into public/, Babulus also stages each generated TTS segment under public/babulus/<video>/segments/ and emits them as separate kind: file clips (so you can see each utterance as its own audio item in Remotion).

ElevenLabs SFX integration:

Set audio.sfx_provider: elevenlabs in your .babulus.yml (or set audio.default_sfx_provider in ./.babulus/config.yml), and use kind: sfx clips with variants + pick.
Babulus caches variants under --out-dir and stages the chosen file under public/babulus/sfx/ so Remotion can play it.

Auditioning SFX variants (workflow)

SFX clips can generate multiple variants. Babulus keeps all variants cached under .babulus/out/<video>/sfx/.

To audition different variants without editing the DSL, use the selection file under .babulus/out/<video>/selections.json via the CLI:

bash bin/babulus sfx next --clip whoosh --variants 8
bash bin/babulus sfx prev --clip whoosh --variants 8
bash bin/babulus sfx set --clip whoosh --pick 3

With bash bin/babulus generate --watch, changing the pick will trigger a re-generate so Remotion updates the staged public/babulus/sfx/<clipId>.* file.

If you’re not using --watch, you can also apply the change immediately:

bash bin/babulus sfx next --clip whoosh --variants 8 --apply

Archiving options you don’t want to see right now:

bash bin/babulus sfx archive --clip whoosh --keep-pick
bash bin/babulus sfx restore --clip whoosh
bash bin/babulus sfx clear --clip whoosh

Remotion: The Two Mappings

1) `scene.id` → React scene component

You render a Sequence per scene using scene.startSec/endSec, then route by scene.id:

import { Sequence, useVideoConfig } from "remotion";
import scriptJson from "./script.json";

const secondsToFrames = (sec: number, fps: number) => Math.round(sec * fps);

const SceneRouter: React.FC<{ scene: any }> = ({ scene }) => {
  switch (scene.id) {
    case "intro":
      return <IntroScene scene={scene} />;
    default:
      return null;
  }
};

export const MyVideo: React.FC = () => {
  const { fps } = useVideoConfig();
  return (
    <>
      {scriptJson.scenes.map((scene) => {
        const from = secondsToFrames(scene.startSec, fps);
        const to = secondsToFrames(scene.endSec, fps);
        return (
          <Sequence key={scene.id} from={from} durationInFrames={to - from}>
            <SceneRouter scene={scene} />
          </Sequence>
        );
      })}
    </>
  );
};

2) `cue.id` → element/animation timing

Inside a scene component, find the cue you care about and convert cue.startSec to a frame.

const cue = scene.cues.find((c) => c.id === "hook"); // <- from the DSL
if (!cue) return null;
const cueStartFrame = secondsToFrames(cue.startSec, fps);

Audio (Typical)

If you generate a voiceover audio file, play it at the top-level:

import { Audio, staticFile } from "remotion";

<Audio src={staticFile("voiceover.mp3")} />;

Your script’s startSec/endSec should reference absolute seconds from the start of that audio track.

Audio cueing in Remotion (layered tracks)

Babulus generate writes an additional timeline.json which includes audio.tracks[] events (SFX/music/file clips).

In this repo, you can render them using src/babulus/AudioTimeline.tsx (it creates <Sequence><Audio/></Sequence> per clip).

Concrete Example (This Repo)

DSL (project-owned): content/intro.babulus.yml
Compiled JSON (generated): src/videos/intro/intro.script.json
Scene mapping (scene.id → React): src/videos/intro/IntroVideo.tsx
Cue timing usage (Solution cards): src/videos/intro/IntroVideo.tsx

Note: the YAML snippet above uses intro/hook as simple examples. In the actual Intro video DSL, the scene IDs are title, problem, solution, code, cta.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Jan 20, 2026

0.2.0

Jan 17, 2026

This version

0.1.0

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babulus-0.1.0.tar.gz (59.8 kB view details)

Uploaded Jan 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

babulus-0.1.0-py3-none-any.whl (66.4 kB view details)

Uploaded Jan 16, 2026 Python 3

File details

Details for the file babulus-0.1.0.tar.gz.

File metadata

Download URL: babulus-0.1.0.tar.gz
Upload date: Jan 16, 2026
Size: 59.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for babulus-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6a7f826dfffcc79c029ebcbb20d75ceb44c3d42b55b48fc738313dc622863ed6`
MD5	`3d056dc2a8b312f5ba612c629e706169`
BLAKE2b-256	`59534d492576484fb4ad5bd301531dd594a5e5fe0d1c5a54dacfbb87ea51e798`

See more details on using hashes here.

File details

Details for the file babulus-0.1.0-py3-none-any.whl.

File metadata

Download URL: babulus-0.1.0-py3-none-any.whl
Upload date: Jan 16, 2026
Size: 66.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for babulus-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6678d3390ff324632d5a8cd1f3d9bccae73f3538872edacbcde8046375c28cd`
MD5	`2aa05bc3db645959a8589829c4fba8a0`
BLAKE2b-256	`3df665535ecb1ec8404df169d360eff702936f06707cbf8b2d2ff395fd059ae7`

See more details on using hashes here.

babulus 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Babulus (Voiceover → Video Timing)

The One-Sentence Mental Model

Data Shape (JSON)

The DSL (YAML)

Time formats

Compile to JSON (CLI)

Manual timing compile

Audio-driven generation (the “real” pipeline)

Watch mode

Clean

Providers (TTS)

ElevenLabs pronunciation dictionaries

Option A: Define lexemes in the DSL (recommended)

Option B: Reference an existing dictionary ID

Pauses & Segments (Voiceover Authoring)

Delaying the start of a cue’s narration

Per-cue segments

Trimming breaths / tails

Default pause between cues (with optional jitter)

Multi-Track Audio (SFX / Music / Files)

Auditioning SFX variants (workflow)

Remotion: The Two Mappings

1) scene.id → React scene component

2) cue.id → element/animation timing

Audio (Typical)

Audio cueing in Remotion (layered tracks)

Concrete Example (This Repo)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1) `scene.id` → React scene component

2) `cue.id` → element/animation timing