Skip to main content

Music lyrics transcription tool using AI

Project description

🎵 Kamasi

Kamasi is a local automated music lyric transcription tool. It leverages state-of-the-art AI models to isolate vocals, transcribe them with high precision, and refine the final text using local LLMs.


🚀 Features

  • Vocal Separation: Isolate vocals from background music using Demucs to ensure the highest transcription accuracy.
  • High-Speed Transcription: Powered by Faster-Whisper for efficient and precise speech-to-text conversion.
  • LLM Refinement: Automatically fix punctuation, formatting, and transcription hallucinations using Ollama.
  • Privacy-First (100% Local): All processing happens on your machine. No data or audio files are ever uploaded to the cloud.
  • YAML Driven: Fully customizable workflow via a simple config.yaml file.

🛠 Installation

This project uses uv for lightning-fast Python dependency management.

  1. Clone the repository:
git clone https://codeberg.org/ley0x/kamasi.git
cd kamasi
  1. Install dependencies:
uv sync
  1. External Requirements:
  • FFmpeg: Required for audio processing and conversion.
  • Ollama: Required if you enable the LLM refinement stage.

⚙️ Configuration

Create or edit the config.yaml file in the root directory to set your preferences:

# Note: The input file is now specified as a command-line argument
# Usage: uv run kamasi <audio_file.mp3>

audio:
  separate_vocals: true  # Activer/Désactiver Demucs
  device: "cuda"         # "cuda" pour GPU NVIDIA ou "cpu"
  model: "htdemucs"      # htdemucs, htdemucs_ft, mdx_extra

transcription:
  model_size: "tiny"     # tiny, base, small, medium, large-v3
  language: "fr"         # Code langue (fr, en, etc.) ou null pour auto-detect
  compute_type: "int8"   # float16 pour GPU, int8 pour CPU

refinement:
  enabled: true
  ollama_url: "http://localhost:11434"
  model_name: "mistral-nemo:12b"
  prompt: |
    Write here your instruction for the LLM to follow.

📖 Usage

Basic Usage

Transcribe an audio file using the default config:

uv run kamasi audio.mp3
uv run kamasi "path/to/song.mp3"

CLI Options

Specify a custom configuration file:

uv run kamasi --config custom.yaml audio.mp3
uv run kamasi -c custom.yaml audio.mp3

Enable verbose logging (show detailed progress):

uv run kamasi --verbose audio.mp3
uv run kamasi -v audio.mp3

Show version information:

uv run kamasi --version

List available Ollama models:

uv run kamasi --list-models
uv run kamasi --list-models -c custom.yaml

Combine options:

uv run kamasi -v -c my-config.yaml audio.mp3

View help:

uv run kamasi --help

Batch Processing

Process multiple files:

for file in input/*.mp3; do
  uv run kamasi "$file"
done

The final lyrics will be saved as a .txt file in the current directory.

Note: By default, only errors are shown. Use --verbose / -v to see detailed progress including INFO and DEBUG messages.


📝 Recommendations

Some audio separation models recommandations:

  1. htdemucs (current choice) ✅ - Best overall choice
    • Fast processing
    • Excellent vocal separation quality
    • Well-balanced
  2. htdemucs_ft - If you want maximum quality and don't mind 4x slower processing
  3. mdx_extra - Alternative if htdemucs doesn't work well for your music style

Some ollama models recommandations:

  • Llama 3.1 (8B): All types of standard corrections.
  • Mistral-Nemo (12B): French language, French songs, rap (slang).
  • Qwen 2.5 (14B): Long texts, strict formatting required.
  • Gemma 2 (9B): Creativity, poetic or abstract texts.

Some prompt examples:

  • French songs:
**Rôle** : Tu es un expert en édition musicale et correction de paroles. Voici un texte brut issu d'une transcription automatique. Il contient des erreurs phonétiques (homophones) et de ponctuation.
Tes instructions :
    - Corrige l'orthographe et la grammaire.
    - Déduis les mots mal transcrits en te basant sur le contexte de la phrase et les rimes probables.
    - Ajoute la ponctuation et respecte les sauts de ligne (format paroles de chanson).
    - **IMPORTANT** : Ne réécris pas le style et n'invente pas de nouvelles phrases. Reste fidèle à l'audio original supposé.

Ne donne aucune explication, sors uniquement le texte corrigé.
Texte à corriger :
  • French songs (rap / urban):
**Rôle** :Tu es un expert de la culture Hip-hop, de l'argot urbain (français/anglais) et un éditeur de paroles professionnel.
**Contexte** : Voici une transcription brute d'un morceau de rap générée par une IA. Elle contient des erreurs phonétiques, rate souvent les mots d'argot ou les noms propres, et la mise en page est inexistante.

Tes instructions :
    - **Correction Phonétique Intelligente** : Corrige les mots mal entendus en te basant sur le contexte "Street" et la rime. (Exemple : Si l'audio transcrit "le ter-ter", ne corrige pas en "la terre", garde "le ter-ter").
    - **Respect de la Langue** : NE CORRIGE PAS la grammaire si c'est une faute volontaire de style (ex: "J'ai pas" au lieu de "Je n'ai pas", "C'est nous les meilleurs" au lieu de "Ce sont nous..."). Laisse le verlan et l'argot tels quels.
    - **Structure et Flow** : Formate le texte pour refléter le rythme. Fais des sauts de ligne courts. Essaie d'identifier et de marquer les sections : [Couplet], [Refrain], [Pont], [Outro].
    - **Ad-libs** : Si tu repères des interjections d'ambiance (Yeah, Han, Skrt), mets-les entre parenthèses ou en fin de ligne, ou supprime-les si elles nuisent à la lecture.
    - **Noms Propres** : Sois vigilant sur les noms de rappeurs, de marques, de villes ou de quartiers souvent cités dans le rap.

Contraintes :
    - Ne donne aucune explication avant ou après le texte.
    - Ne censure pas les vulgarités.
    - Affiche uniquement les paroles finales formatées.

Texte brut à traiter :
  • English songs:
Role: You are an expert music editor and lyrics corrector.

Context: Below is raw text generated by an automatic audio transcription tool. It contains phonetic errors (homophones), missing punctuation, and lacks formatting.

Your Instructions:

    Correction: Fix spelling and grammar errors caused by the transcription software.

    Contextual Deduction: Correct mistranscribed words based on the context of the sentence and probable rhymes.

    Formatting: Add proper punctuation and capitalization. Structure the text as song lyrics (short lines, stanzas, spacing between verses).

    Fidelity: IMPORTANT: Do not rewrite the style or change the meaning. Do not invent new lines. Stick as close to the phonetic audio as possible while making it readable.

    Output: Provide only the corrected lyrics. Do not add any conversational text or explanations.

Raw Text to Process:
  • English songs (rap / urban):
**Role**: You are an expert in Hip-Hop culture, AAVE (African American Vernacular English), urban slang, and a professional lyrics editor.
**Context**: Below is a raw transcription of a Rap song. The audio tool struggled with the speed, slang, and flow.

Your Instructions:
    - **Smart Phonetic Correction**: Fix words that were misheard based on "Street" context and rhyme schemes. (e.g., if the text says "trap house," do not change it to "trap mouse").
    - **Respect the Dialect**: DO NOT "fix" the grammar if it is intentional slang or AAVE (e.g., keep "I ain't got no" instead of changing it to "I do not have any"). Preserve contractions and street vernacular.
    - **Structure & Flow**: Format the text to reflect the rhythm (bars). Use short line breaks. Try to identify and label sections if clear: [Verse], [Chorus], [Bridge], [Outro].
    **Ad-libs**: If you detect background ad-libs (Yeah, Uh, Skrt), place them in parentheses (Yeah) or at the end of lines.
    - **Cultural Accuracy**: Be vigilant with proper nouns—names of rappers, luxury brands, cities, or specific neighborhoods often mentioned in Rap.

Constraints:
    - Do not provide any introductory text or summary.
    - Do not censor explicit language or profanity.
    - Output only the final formatted lyrics.

Raw Text to Process:

Recommanded AI temperatures: between 0.2 and 0.3.


🛠 Project Structure

The project follows a functional programming approach for clarity and modularity:

  • audio_processing.py: Vocal separation and audio cleaning.
  • transcription.py: Whisper engine logic.
  • llm_refinement.py: Ollama API integration.
  • config_loader.py: YAML settings management.

🧑‍💻 Development

Code Quality Tools

This project uses Ruff for linting and code formatting to maintain consistent code quality.

Install pre-commit hooks:

uvx pre-commit install

Run linting with auto-fix:

uv run ruff check --fix .

Format code:

uv run ruff format .

The pre-commit hook will automatically run ruff checks before each commit. If you need to bypass the hook temporarily, use git commit --no-verify (not recommended).


⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kamasi-0.1.1.tar.gz (60.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kamasi-0.1.1-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file kamasi-0.1.1.tar.gz.

File metadata

  • Download URL: kamasi-0.1.1.tar.gz
  • Upload date:
  • Size: 60.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for kamasi-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fb29d469088894e7af2c022ad0067cbe9a3789e406c162e43b0a3d0c3ceb332d
MD5 5c8555090b86f639ba50b4b056285d59
BLAKE2b-256 9c4a8f518386fc6d180dbb144cf46a5efe7ca75728dee3d10275d6a9a2ab82d9

See more details on using hashes here.

File details

Details for the file kamasi-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kamasi-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for kamasi-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 12ca4c17d61e29e2099608a18afaac444cbc7182b4459fd5e5af29d2ad5b71db
MD5 d7228aecbae1d55bb28252a589b322ec
BLAKE2b-256 69c8a170b81c0e56e802023e0324b97b1f487de9dbe56f544e2923e9742d2a0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page