Skip to main content

Library and CLI for text anonymization plus audio/video transcription with diarization

Project description

Anonim Video Text Library

Standalone project for two separate workflows:

  • text anonymization for JSON/JSONL/CSV/Markdown/TXT with a persistent people.json dictionary
  • audio/video transcription with diarization

Project layout

  • src/anonim_video_text_library/ - importable Python package
  • text_anonim/ - default runtime workspace for the anonymizer
  • examples/Anonimizez_example/ - self-contained anonymizer example
  • examples/Transcibator_example/ - self-contained transcription example
  • main.py - local wrapper for the transcription CLI
  • gpu_backends/ - helper scripts for GPU transcription backends
  • whisper.cpp/ - local checkout of whisper.cpp

Installation

cd Anonim_video_text_Library
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

This installs the dependencies for both workflows, including torch, transformers, faster-whisper, pyannote.audio, and imageio-ffmpeg.

Run the anonymizer separately

CLI entrypoints:

python3 -m anonim_video_text_library --help
python3 text_anonim/anonimizer.py --help

Self-contained example:

cd examples/Anonimizez_example
python3 run_anonymizer_example.py

What is inside examples/Anonimizez_example/:

  • .env and .env.example for settings
  • input/ for source json/jsonl/csv/md/txt files
  • output/ for anonymized copies
  • runtime_root/files/pii/ for people.json and blocklists
  • run_anonymizer_example.py as the runner

The example writes output files to output/. It does not print anonymized content only to the terminal.

Run the transcriber separately

CLI entrypoints:

anonim-video-text-transcribe --help
python3 main.py --help

Self-contained example:

cd examples/Transcibator_example
python3 run_transcriber_example.py

What is inside examples/Transcibator_example/:

  • .env and .env.example for settings
  • input/ for media files
  • output/ for generated transcripts
  • run_transcriber_example.py as the runner

If you need diarization, set HF_TOKEN in .env or in your shell.

Generate runtime examples

To generate the same two example folders inside any runtime workspace:

python3 -m anonim_video_text_library \
  --runtime-root /path/to/runtime \
  --example

To rebuild the generated README and example folders:

python3 -m anonim_video_text_library \
  --runtime-root /path/to/runtime \
  --example \
  --force-example

The generated runtime examples live under:

  • examples/Anonimizez_example/
  • examples/Transcibator_example/

Each example is isolated. The demo scripts no longer create another nested examples/ tree inside their own runtime data.

Default runtime workspace

By default the anonymizer uses:

text_anonim

That workspace contains:

  • files/pii/ for input files, people.json, and blocklists
  • files/pii_anonymized/ for anonymized output
  • README.md with generated workspace instructions
  • examples/ with the two generated example folders

Python API

The main public API is TextAnonymizationSession.

from pathlib import Path
from anonim_video_text_library import TextAnonymizationSession

session = TextAnonymizationSession.from_defaults(
    runtime_root=Path("/path/to/runtime"),
    device="auto",
    ner_batch_size=16,
)

text, stats = session.anonymize_text(
    "Jordan Miller from Northwind Labs wrote to contact@example.com",
    file_id="demo.txt",
)
print(text)
print(stats)

payload, stats = session.anonymize_value(
    {
        "title": "Jordan Miller",
        "body": "Northwind Labs contact: contact@example.com",
    },
    file_id="demo.json",
)
print(payload)
print(stats)

directory_stats = session.anonymize_directory(
    input_root=Path("/path/to/input"),
    output_root=Path("/path/to/output"),
    skip_existing=True,
)
print(directory_stats)
print(session.people_file)

Related docs

Notes

  • text_anonim/files may contain large working datasets
  • whisper.cpp/models/*.bin are not copied automatically with the project
  • fairseq_env was intentionally not moved with the standalone package; recreate a local environment if you still need it

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anonim_video_text_library-0.1.7.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anonim_video_text_library-0.1.7-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file anonim_video_text_library-0.1.7.tar.gz.

File metadata

File hashes

Hashes for anonim_video_text_library-0.1.7.tar.gz
Algorithm Hash digest
SHA256 c33b04edaf37e72b1460ac4d5a425e948afb65615c5f1e520cce7c86cc4ff4ad
MD5 e1bdce79b304acc6305323e5e2c3d289
BLAKE2b-256 4cac15f3845cec78c79990a386bb6e431325c4b24ef141c9bb1630e84a710e80

See more details on using hashes here.

File details

Details for the file anonim_video_text_library-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for anonim_video_text_library-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ab40da6fdc634da61a96e8c299be29f20875e37b91575b8edfc200044eb10ab1
MD5 e1773de4f3144cd9ab2395e8b7c53e5e
BLAKE2b-256 755c1420b124050fc2daa7ea3969b40717733e3ba99c9c5b21dba198ca348571

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page