Skip to main content

A pipeline-based file processing library.

Project description

filerohr

filerohr is a pipeline-based file processing library and CLI tool.

Users can configure a custom processing pipeline to suit their needs using freely interchangeable tasks.

filerohr comes with a number of built-in tasks that specialize in audio processing, all based on ffmpeg and ffprobe. Adding new task definitions is relatively easy and only requires some knowledge of Python.

NOTE: filerohr currently is a proof-of-concept and doesn’t have any tests yet.

filerohr’s name is a wordplay on the german Fallrohr (literally droppipe).

CLI

filerohr comes with a CLI that can be executed with uv run python -m filerohr

Quick start:

# validate a pipeline config
uv run python -m filerohr validate-config my-pipeline.yaml

# Show available jobs
uv run python -m filerohr list-jobs

# import a file
uv run python -m filerohr import-file --config my-pipeline.yaml ~/Music/song.mp3

Configuration

filerohr itself is mostly configured through environment variables. The pipeline configuration is YAML and is documented below.

Environment variables

Environment variable configuration options include:

TZ : The local timezone (e.g. Europe/Vienna)

FILEROHR_FFMPEG_BIN : Path to the ffmpeg binary

FILEROHR_FFPROBE_BIN : Path to the ffprobe binary

FILEROHR_DATA_DIR : Path to the data directory

FILEROHR_TMP_DIR : Path to the temporary directory

FILEROHR_TASK_MODULES : comma-separated list of Python modules to import as task definitions

FILEROHR_PIPELINE_CONFIG_DIR : Path to the pipeline configuration file directory

Most of these variables will be set to sensible defaults in the upcoming container image.

Pipeline configuration

The following pipeline configuration is based on filerohr’s built-in tasks:

# You can give pipelines a name.
# That makes it easy to reference them in an API or with the CLI.
name: audio
# Pipelines can be marked as default (but only one).
use_as_default: true
jobs:
  # Downloads the file if a remote URL was provided.
  # Skips them otherwise.
  - job: download_file
    match_content_type: ["audio/*", "video/*", "pass_unset"]
    max_download_size_mib: 25
  # Logs the file size.
  - job: log_file_size
  # Sanitizes the file as audio/video.
  # Automatically skips the file if it is not audio/video.
  - job: sanitize_av
  # Extracts all audio streams as separate files.
  - job: extract_audio
  # Extract metadata from the audio files.
  - job: extract_audio_metadata
  # Normalize the audio file with the podcast preset.
  - job: normalize_audio
    preset: podcast
  # Converts the audio file (if necessary).
  # Only allow opus and flac and convert to flac as needed.
  - job: convert_audio
    allowed_formats: ["opus", "flac"]
    fallback_format: flac
  # Now that the last job that could have changed the audio format is finished,
  # we can extract the mime type from the audio file.
  - job: extract_mime_type
  # Log the file size again.
  # This time it will include the reduction in file size in percent.
  - job: log_file_size
  # Hash the file content with SHA3 512.
  # This will be re-used in store_by_content_hash.
  - job: hash_file
    alg: sha3_512
  # Store the files by their content hash.
  - job: store_by_content_hash
    storage_dir: /home/you/data/audio/by-hash
    # Emit the stored file in the hash-based directory as a pipeline result.
    emit: true
  # Additionally, store by upload date, but only symlink to files in hash storage.
  - job: store_by_creation_date
    storage_dir: /home/you/data/audio/by-hash
    symlink: true
    # Do not emit but keep the file in the upload-date-based directory.
    keep: true

Not that you can include jobs multiple times. This can be helpful e.g., if you do file size logging, before and after converting audio streams.

Built-in tasks

convert_audio

Ensures an audio file matches the allowed formats.

Audio files that don’t match the formats will be converted to the fallback format.

Note: This job must be placed after an extract_audio job.

Configuration options:

skip: boolean, default: False
allowed_formats: string[] | 'any', required

List of allowed audio formats. Use 'any' to allow any audio format.

Examples:

  • ['vorbis', 'flac', 'opus']
  • 'any'
fallback_format: string, default: 'flac'

The format to fallback to if the detected format is not allowed. ffmpeg often has different encoders for the same audio format. Some of these encoders are experimental, e.g. you probably want to uselibopus over opus. If you select an experimental encoder, you might have to add ['-strict', '-2'] to fallback_format_encoder_args to avoid errors.

fallback_format_ext: string | null, default: null

The file extension of the fallback format. This is automatically inferred for the most common audio formats.

fallback_format_encoder_args: string[], required

Additional arguments passed to ffmpeg when encoding to the fallback format.

discard_remote

Stops the pipeline if the current file is not a local file.

Configuration options:

skip: boolean, default: False

download_file

Downloads files from remote sources.

When called with a local file path, the job will simply be skipped.

Configuration options:

skip: boolean, default: False
storage_dir: string, default: '$FILEROHR_DATA_DIR/tmp'

Base directory to store downloaded files in.

max_download_size_mib: number, default: 1024

Maximum size allowed for downloaded files in MiB (megabytes). Use 0 for no limit.

allow_streams: boolean, default: False

Whether to download files from sources that are streaming responses and do not have a fixed file size.

allowed_protocols: ('http' | 'https')[], default: ['http', 'https']

List of allowed protocols to download files from.

timeout_seconds: integer, default: 600

Timeout in seconds for downloading.

follow_redirects: boolean, default: True

Follow redirects when downloading.

match_content_type: (string | 'pass_unset')[] | null, default: null

List of mime types to check. Supports glob patterns like audio/*. Add pass_unset to the list if you want this check to pass in case no Content-Type was given in the server’s response headers. Setting match_content_type to null will allow all content types.

download_chunk_size_mib: number, default: 1

Size of chunks to download in MiB (megabytes).

extract_audio

Extracts all audio streams in the job’s file and saves them as separate files.

In case more than one stream is found, this job will spawn a subsequent job for each stream file.

If you only want to extract a single stream, set count: 1 in the job configuration.

Configuration options:

skip: boolean, default: False
count: integer, default: 0

Maximum number of audio streams to extract. This is useful for files that contain multiple audio streams. Extracts all audio streams if number is 0.

extract_audio_metadata

Extracts metadata, like duration, artist, title, album from the job’s file.

Configuration options:

skip: boolean, default: False

extract_mime_type

Extracts the mime type of the job’s file.

Configuration options:

skip: boolean, default: False

ffmpeg

Run a custom ffmpeg command with args specified in the job configuration. Ensure that you include {input_file} and {output_file} placeholders.

Configuration options:

skip: boolean, default: False
args: string[], required

FFMPEG command line arguments. Must contain {input_file} and {output_file} as literal placeholder strings for actual file paths.

Examples:

  • ['-i', '{input_file}', '-vn', '{output_file}']
output_format: string, required

Output file format. Must be a valid file extension understood by ffmpeg.

Examples:

  • '.mp3'
  • '.flac'

hash_file

Hash the file content.

Configuration options:

skip: boolean, default: False
alg: 'blake2b' | 'blake2s' | 'md5' | 'sha1' | 'sha224' | 'sha256' | 'sha384' | 'sha3_224' | 'sha3_256' | 'sha3_384' | 'sha3_512' | 'sha512' | 'shake_128' | 'shake_256', default: 'sha256'

Hash algorithm to use.

keep_file

Keep the current job file and do not delete it after the pipeline finishes.

This can be helpful to debug intermediate job output.

Configuration options:

skip: boolean, default: False

log_file_size

Logs the file size in MiB.

If used multiple times in the same pipeline config, it will also display the change in file size compared to the reference.

Configuration options:

skip: boolean, default: False
force_update_reference: boolean, default: False

If set to true, the reference used to calculate changes in the file size between jobs will be updated. Implicitly true on first call.

normalize_audio

Normalizes the audio stream with ffmpeg-normalize.

If no additional options are given, the podcast preset will be used.

Note: This job must be placed after an extract_audio job.

Configuration options:

skip: boolean, default: False
preset: 'music' | 'podcast' | 'streaming-video' | null, default: null

The audio normalization preset to use. See https://slhck.info/ffmpeg- normalize/usage/presets/ for available presets. Mutually exclusive with args.

args: string[], required

ffmpeg-normalize command line arguments. See https://slhck.info/ffmpeg-normalize/usage/cli-options/ for available options. Mutually exclusive with preset.

sanitize_av

Sanitizes the job file as an audio/video stream. This performs a simple copy operation for all streams in the job’s file and ensures that the resulting file is playable.

If the file is not an audio file, it will be skipped by default.

Configuration options:

skip: boolean, default: False
error_handling: 'ignore' | 'ignore_minor', default: 'ignore'

How to handle errors during file sanitization of broken files. ignore will ignore all but critical errors in the audio and corresponds to ffmpeg’s -err_detect ignore_err. The resulting file will play, but may not be pleasant to listen to. ignore_minor will let minor errors pass and corresponds to ffmpeg’s -err_detect careful.

Examples:

  • 'ignore'
  • 'ignore_minor'
skip_invalid: boolean, default: True

Silently skip files that do not contain audio.

store_by_content_hash

Stores the job file in a directory-tree based on the file’s content hash.

The generated directory structure looks like this:

audio_dir/
  8d/
    a9/
      8da9bce68a6aebdcba325cf21402c78c1628c9da1278b817a600cdd92b720653.flac
      8da9fc4939da378a720ba1ba310d3d7a1a85e44b79cd4c68bfc4bd3081f01062.flac
  e0/
    0a/
      e00afc4939da378a720ba1ba310d3d7a1a85e44b79cd4c68bfc4bd3081f01062.flac

Configuration options:

skip: boolean, default: False
storage_dir: string, default: '$FILEROHR_DATA_DIR'

Base directory to store files in.

symlink: boolean, default: False

If set to true, a symlink is created to the file from the previous job. In that case, the previous job must create the file in a persistent storage location.

keep: boolean, default: False

If set to true, the file is kept after the pipeline finishes.

emit: boolean, default: False

If set to true, the file is emitted as a pipeline result. Implies keep.

alg: 'blake2b' | 'blake2s' | 'md5' | 'sha1' | 'sha224' | 'sha256' | 'sha384' | 'sha3_224' | 'sha3_256' | 'sha3_384' | 'sha3_512' | 'sha512' | 'shake_128' | 'shake_256' | null, default: null

Hash algorithm to use. If unset the job will try to re-use an existing file hash calculated in a hash_file job. Otherwise, a new hash will be calculated and stored along with the file. In case the file already has a hash but uses a different algorithm, a new hash will be calculated for storage but the hash on the file will be kept.

levels: integer, default: 2

Number of directory levels that should be created for stored files.

chars_per_level: integer, default: 2

Number of hash-characters per level.

store_by_creation_date

Stores the job file in a directory-tree based on the creation date.

The generated directory structure looks like this:

audio_dir/
  2025/
    11/
      20/
        filename2.flac
        filename3.flac
    10/
      16/
        filename1.flac

Configuration options:

skip: boolean, default: False
storage_dir: string, default: '$FILEROHR_DATA_DIR'

Base directory to store files in.

symlink: boolean, default: False

If set to true, a symlink is created to the file from the previous job. In that case, the previous job must create the file in a persistent storage location.

keep: boolean, default: False

If set to true, the file is kept after the pipeline finishes.

emit: boolean, default: False

If set to true, the file is emitted as a pipeline result. Implies keep.

month: boolean, default: True

Include month in storage subdirectories.

day: boolean, default: True

Include day in storage subdirectories.

Custom tasks

You can define custom tasks by creating a python module/file. After that you need to ensure set the FILEROHR_TASK_MODULES environment variable to the name of your module. If you have created multiple modules, you can separate them by a comma.

You can find examples for tasks in the my-custom-tasks.py file or in filerohr’s own filerohr/tasks/ directory.

When implementing a custom task, there are three important guidelines:

  1. DO NOT block the loop.

    All code must run asynchronously, also known as cooperative multitasking. filerohr includes the aiofiles and pebble libraries and some additional helpers in filerohr.utils. Use these to do blocking IO or CPU-intensive work.

  2. Call job.next() when you are done with the task.

    That is, if you want the pipeline to move on. If you don’t call job.next() the pipeline will stop after your job finished. Sometimes a job might do that intentionally, but most of the time you don’t want that.

    You may also call job.next() multiple times, if you want to enqueue multiple follow-up jobs. This may be something you do, when you create multiple new files in your job (like filerohr.tasks.av.extract_audio does).

  3. Don’t modify the job.file directly.

    Instead, call job.next(file.clone(path=...)).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filerohr-0.2.0.tar.gz (65.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filerohr-0.2.0-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file filerohr-0.2.0.tar.gz.

File metadata

  • Download URL: filerohr-0.2.0.tar.gz
  • Upload date:
  • Size: 65.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for filerohr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6c1a57066f540bb094e0f814fe54517be3ad294af3decd93b81ad89672ed0917
MD5 39d19e15cac337ec8b5d109d8228366d
BLAKE2b-256 7744290dc17004c24d91f07b0e1d4ed7aef2c51d1a1b480e529f6bb3ec5455ef

See more details on using hashes here.

File details

Details for the file filerohr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: filerohr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for filerohr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8a569e4adb7581cfc68da9cd6586fdbb7ec58ca704849c59ea6ddbc78b8404c4
MD5 0a61a43aea7686a5af8a348fde172fe9
BLAKE2b-256 c23f5400f474a4adc3dba830f96b9854f023f015246981bdc48c8551cddf7a1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page