Audio-Video Synchronization — frame-accurate clap-based sync for corpus recording.
Project description
-------------------------------------------------------------------------
█████╗ ██╗ ██╗ ██╗ ███████╗ ███████╗
██╔══██╗ ██║ ██║ ██║ ██╔════╝ ██╔════╝
███████║ ██║ ██║ ██║ ███████╗ ███████╗
██╔══██║ ╚██╗ ██╔╝ ██║ ██║ ██║
██║ ██║ ╚████╔╝ ██║ ███████║ ███████║
╚═╝ ╚═╝ ╚═══╝ ╚═╝ ╚══════╝ ╚══════╝
Audio-Video Synchronization in Python
Copyright (C) 2026 Brigitte Bigi, CNRS
Laboratoire Parole et Langage, Aix-en-Provence, France
-------------------------------------------------------------------------
AViSS description
Overview
Use cases
You recorded a speaker with one or two cameras and one or two separate audio recorders. You used a clap to mark a synchronization point. Now you need all your media files trimmed and aligned to the exact same frame boundary — ready for phonetic analysis or corpus annotation.
AViSS is the tool you need.
Features
AViSS performs frame-accurate, clap-based synchronization of audio and video files for speech corpus recordings. It is designed for researchers who need reproducible, high-quality media preparation without manual editing.
Among others, it allows the following:
- Frame-accurate video trimming via OpenCV / SPPAS
- Clap-based audio alignment (trim or pad to match the video frame boundary)
- Support for 1 or 2 audio files and 1 or 2 video files per session
- Optional video crop (x, y, w, h per video)
- Optional copyright overlay on video
- Optional video rotation (portrait mode)
- Optional mono 16 kHz WAV export
- Optional MP4 montage (H.264/AAC) for distribution
- Optional WebM montage (libvpx-vp9, two-pass) for web distribution
- Batch processing from a CSV file
- Fully configurable column names and output filename structure
How it works
AViSS is a faithful Python migration of the original montage scripts
(montage_step1.py / montage.py, B. Bigi, CNRS/LPL 2021-2024)
distributed with the CLeLfPC corpus (https://hdl.handle.net/11403/clelfpc).
The algorithm below is reproduced verbatim from those scripts.
Notation
| Symbol | Meaning |
|---|---|
vc |
video_clap + delay — effective clap time in the video (seconds) |
fps |
frame rate of the video (frames/second) |
dur |
expected output duration (seconds) |
Step 1 — clap frame (primary / reference video)
clap_frame_index = int(vc * fps) # floor, 0-based
clap_frame_time = clap_frame_index / fps
clap_delta = vc - clap_frame_time # sub-frame offset, in [0, 1/fps)
Step 2 — end frame (first excluded frame)
end_frame_index = 1 + int((vc + dur) * fps)
end_frame_time = end_frame_index / fps
Step 3 — cross-sync (secondary video, fps2 ≠ fps_ref)
When two cameras have different frame rates, the reference camera is the
one with the lowest fps. Its clap_delta is propagated to the secondary
camera so both outputs share the same sub-frame offset at the clap.
shift_frames = int(reference_delta * fps2)
clap_frame_index = int(vc2 * fps2) - shift_frames
end_frame_index = 1 + int((vc2 + dur) * fps2) + shift_frames
Note: the formula uses int(A*fps) - int(d*fps), not int((A-d)*fps).
These can differ by 1 frame when frac(A*fps) < frac(d*fps).
Audio alignment (per audio file)
Pass 1: shift audio so its effective clap (audio_clap + delay)
matches vc — trim from the start or prepend silence.
Pass 2: pad with silence or trim the end to reach end_frame_time.
Pass 3: trim clap_frame_time from the start.
The output audio starts at the clap frame boundary, preserving the
clap_delta sub-frame offset between the clap and the first sample.
Audio output files
Two files are produced per audio input:
<stem>-audio.wav— synchronized, original format (sample rate and channel count preserved). Used for montage.<stem>.wav— mono 16 kHz WAV. If the input has more than one channel, all channels are mixed down to mono (average).
Scientific context
AViSS was developed at the Laboratoire Parole et Langage (LPL), CNRS, Aix-en-Provence, France, for the preparation of speech corpora used in phonetic research, including cued speech and read speech corpora.
Install AViSS
Requirements
The following external programs must be installed and available in the PATH:
ffmpeg— video and audio processingsox— audio processing
From PyPI
> python -m pip install aviss
From its wheel package
Download the wheel file (aviss-xxx.whl) and install it with:
> python -m pip install aviss-xxx.whl
From the repository
Download or clone the repository, then install in editable mode:
> git clone https://github.com/brigitte-bigi/AViSS.git
> cd AViSS
> python -m pip install -e .
AViSS content
The AViSS package includes the following folders and files:
aviss/: the source code of the APIaviss/core/: pipeline, synchronization logic, audio and video operationsscripts/: ready-to-use scripts for common workflowstests/: unit testsdocs/: code documentationpyproject.toml: package configuration
Quick start
Prepare the CSV file
The input CSV file describes one recording session per row. The first row
is the header. Columns are separated by ; (or ,).
The following columns are required (names are configurable via settings_user.toml):
| Column | Description |
|---|---|
audio_file |
relative path to the audio file |
audio_clap |
clap time in the audio (MM:SS.mmm) |
video_file |
relative path to the video file |
video_clap |
clap time in the video (MM:SS.mmm) |
delay |
offset after the clap before cutting (seconds) |
duration |
expected output duration (MM:SS.mmm) |
Optional columns for crop, other media files, and output filename metadata
are described in the sync section of Customizing settings.
Example:
ID;avSession;Serie;audio_file;video_file;audio_clap;video_clap;delay;duration
spk1;9;2;audio/RME_0038.wav;video/MVI_0038.MP4;00:03.843;00:06.410;0.200;04:08.250
spk2;8;1;audio/RME_0035.wav;video/MVI_0035.MP4;00:04.787;00:09.995;6.230;02:57.000
Command-line usage
Synchronize one row (-l N = Nth data row, header excluded):
> aviss sync -c corpus/sessions.csv -l 1
Synchronize all rows:
> aviss sync -c corpus/sessions.csv
Synchronize and produce a distribution MP4:
> aviss sync -c corpus/sessions.csv -l 1 --montage
Synchronize and produce a WebM for web distribution:
> aviss sync -c corpus/sessions.csv -l 1 --webm
Print the full processing report:
> aviss sync -c corpus/sessions.csv -l 1 --verbose
Python API usage
from aviss import avCsvReader, avPipeline, avExporter
# Parse one row from the CSV
reader = avCsvReader("corpus/sessions.csv")
session = reader.read_row(1)
# Run the synchronization pipeline
pipeline = avPipeline(session)
result = pipeline.run()
if result.success is True:
exporter = avExporter(result,
stem="spk1_S09_s2",
work_dir="spk1_S09_s2")
exporter.montage()
# Process all rows
sessions = avCsvReader("corpus/sessions.csv").read()
for session in sessions:
result = avPipeline(session).run()
if result.success is False:
print(session, result.report)
Customizing settings
Place a settings_user.toml file in the same directory as your CSV file,
then override only what you need:
[output]
crf = 14
video_fps = 25.0
copyright = "Copyright (C) 2026 CNRS | LPL"
# Rotation — one integer per video in order.
# -1 = no rotation · 0 = CCW+vflip · 1 = CW · 2 = CCW portrait · 3 = CW+vflip
rotate = [2] # single video, portrait CCW
# rotate = [-1, 2] # two videos: front=none, side=CCW portrait
[[output.name_cols]]
col = "ID"
prefix = ""
fmt = ""
[[output.name_cols]]
col = "avSession"
prefix = "S"
fmt = "02d"
[[output.name_cols]]
col = "SerieLabel"
prefix = ""
fmt = ""
[sync]
col_audio_file = "my_audio"
settings_user.toml is loaded automatically from the CSV directory at sync time.
output keys
| Key | Default | Description |
|---|---|---|
crf |
18 |
Video encoding quality (H.264 CRF). Lower = better quality, larger file. Range: 0–51. |
video_fps |
50.0 |
Native frame rate of the recording camera (frames per second). |
copyright |
(none) | Text overlaid on the video (bottom-left). Use \\: to escape colons (ffmpeg). |
rotate |
(none) | Per-video transpose list. See values below. |
output_sep |
"_" |
Separator between tokens in the output filename. |
work_dir_suffix |
"" |
Suffix appended to the working directory name. |
Rotate values (one integer per video, in order — -1 = no rotation):
| Value | Effect |
|---|---|
-1 |
No rotation |
0 |
90° counter-clockwise + vertical flip |
1 |
90° clockwise |
2 |
90° counter-clockwise (portrait mode) |
3 |
90° clockwise + vertical flip |
sync keys
| Key | Default | Description |
|---|---|---|
col_audio_file |
"audio_file" |
CSV column name for the audio file path. |
col_audio_clap |
"audio_clap" |
CSV column name for the audio clap time. |
col_video_file |
"video_file" |
CSV column name for the video file path. |
col_video_clap |
"video_clap" |
CSV column name for the video clap time. |
col_video_name |
"video_name" |
CSV column name for the optional video label (used in output filename suffix). |
col_video_crop_x |
"video_crop_x" |
CSV column name for the crop left edge (pixels). |
col_video_crop_y |
"video_crop_y" |
CSV column name for the crop top edge (pixels). |
col_video_crop_w |
"video_crop_w" |
CSV column name for the crop width (pixels). |
col_video_crop_h |
"video_crop_h" |
CSV column name for the crop height (pixels). |
col_delay |
"delay" |
CSV column name for the delay after the clap (seconds). |
col_duration |
"duration" |
CSV column name for the expected output duration. |
output.name_cols format
Each [[output.name_cols]] entry defines one token in the output filename:
| Key | Type | Description |
|---|---|---|
col |
str | CSV column header whose value is used |
prefix |
str | String prepended to the value ("S", "T", "" for none) |
fmt |
str | "" → raw string · "02d" → zero-padded integer · "d" → plain integer |
Tokens are joined with output_sep (default "_").
A column whose cell is empty in the CSV is silently skipped.
Example: with col = "avSession", prefix = "S", fmt = "02d" and cell value 9, the token is S09.
Test the source code
Install the optional test dependencies:
> python -m pip install ".[dev]"
Unit tests
Run the unit test suite with coverage (requires coverage, included in the
virtual environment):
> .venv/bin/python -m coverage run -m unittest discover -s tests -p "test_*.py" \
&& .venv/bin/python -m coverage report -m
Expected overall coverage: ≥ 68 %.
If coverage is not installed, run the tests without it:
> .venv/bin/python -m unittest discover -s tests -p "test_*.py"
Integration test
The integration test uses synthetic media files built from the demo files
shipped in tests/demo/.
Generate test data
bash make_test_data.sh [demo_dir] [output_dir] [n_videos] [n_audios]
| Argument | Default | Description |
|---|---|---|
demo_dir |
demo |
Directory containing demo.mp4 and demo.wav |
output_dir |
data |
Directory where test files are written |
n_videos |
1 |
Number of video files to generate |
n_audios |
1 |
Number of audio files to generate |
Each generated video/audio file contains random silence/black before and after the content so that every run exercises a different synchronization offset.
Single video + single audio (default):
> cd tests && bash make_test_data.sh && cd ..
Writes tests/data/test_audio.wav, tests/data/test_video.mp4 and
tests/data/test.csv.
Two videos + one audio:
> cd tests && bash make_test_data.sh demo data 2 1 && cd ..
Writes test_video.mp4, test_video2.mp4, test_audio.wav and a CSV
with columns video_file, video_file2.
Two videos + two audios:
> cd tests && bash make_test_data.sh demo data 2 2 && cd ..
Then run the pipeline on the first CSV row:
> .venv/bin/python main.py sync -c tests/data/test.csv -l 1 --verbose
Expected output — audio (ffprobe tests/data/demo_S01/demo_S01.wav):
Duration: 00:00:10.47, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
Expected output — video (ffprobe tests/data/demo_S01/demo_S01.mkv):
Duration: 00:00:10.47, start: 0.000000, bitrate: 3288 kb/s
Both files must have the same duration as tests/demo/demo.mp4.
Scripts
mix_mono.py — mix two mono audio files
Combines two mono WAV files into a single mono WAV by averaging both channels. Useful when two microphones recorded the same speaker and the result must be a single audio file before synchronization.
> python scripts/mix_mono.py audio1.wav audio2.wav output.wav
| Argument | Description |
|---|---|
audio1 |
First mono WAV file |
audio2 |
Second mono WAV file |
output |
Output mono WAV file (must not already exist) |
Requires sox. Both input files must be mono WAV at the same sample rate.
extract_audio.py — extract audio from a video
Extracts the audio track of a video file and converts it to mono WAV at 48 kHz, 16-bit PCM.
> python scripts/extract_audio.py video.mp4
| Argument | Description |
|---|---|
video |
Input video file |
The output file is written next to the input video with a .wav extension.
Requires ffmpeg.
mp4_to_webm.py — convert MP4 to WebM
Converts an MP4 video to WebM (libvpx-vp9, two-pass encoding, CRF 16). An optional audio file can replace the video's audio track in the output.
> python scripts/mp4_to_webm.py video.mp4 [audio.wav]
| Argument | Description |
|---|---|
video |
Input MP4 file |
audio |
Optional audio file to mux into the output |
The output file is written next to the input video with a .webm extension.
Requires ffmpeg.
Projects using AViSS
AViSS was developed at LPL, CNRS, to prepare the CLeLfPC corpus (Corpus de Lecture en Langue Française Parlée Complétée). This work is carried out in the framework of the AutoCuedSpeech project, which partially funded AViSS development.
Contact the author if you want to add a project here.
Help / How to contribute
If you want to report a bug or suggest a feature, please send an e-mail to the author. Any and all constructive comments are welcome.
If you plan to contribute to the code, please read carefully and agree both the code of conduct and the code style guide.
AViSS Documentation
Documentation is generated from the source code using ClammingPy: https://github.com/brigitte-bigi/ClammingPy
To generate the documentation locally:
> python -m pip install ClammingPy
> python makedoc.py
License/Copyright
See the accompanying LICENSE and AUTHORS.md files for the full list of contributors.
Copyright (C) 2026 Brigitte Bigi, CNRS Laboratoire Parole et Langage, Aix-en-Provence, France
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Changes
-
Version 1.0:
- Initial version. Faithful Python migration of the original montage scripts (B. Bigi, CNRS/LPL 2021-2024) distributed with CLeLfPC.
- Frame-accurate, clap-based synchronization of audio and video files.
- Support for any number of audio and video files per session.
- Optional video crop, copyright overlay, rotation (portrait mode).
- Mono 16 kHz WAV output (all channels mixed down).
- Optional MP4 montage (H.264/AAC) and WebM montage (libvpx-vp9, two-pass).
- Batch processing from a CSV file.
- Fully configurable column names and output filename structure.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aviss-1.0-py3-none-any.whl.
File metadata
- Download URL: aviss-1.0-py3-none-any.whl
- Upload date:
- Size: 64.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef6f0ccea8a3678aa49c942e0bf009749bbb84e59f58771cb7e137b522976d9f
|
|
| MD5 |
fee87f16d4f9e429990a9c881617245c
|
|
| BLAKE2b-256 |
74b86d4d688dea502996c479b3a6893351a3ddf1bd0f956d7644a185d52e057d
|