Skip to main content

A pipeline-based file processing library.

Project description

filerohr

filerohr is a pipeline-based file processing library and CLI tool.

Users can configure a custom processing pipeline to suit their needs using freely interchangeable tasks.

filerohr comes with a number of built-in tasks that specialize in audio processing, all based on ffmpeg and ffprobe. Adding new task definitions is relatively easy and only requires some knowledge of Python.

NOTE: filerohr currently is a proof-of-concept and doesn’t have any tests yet.

filerohr’s name is a wordplay on the german Fallrohr (literally droppipe).

CLI

filerohr comes with a CLI that can be executed with uv run python -m filerohr

Quick start:

# validate a pipeline config
uv run python -m filerohr validate-config my-pipeline.yaml

# Show available jobs
uv run python -m filerohr list-jobs

# import a file
uv run python -m filerohr import-file --config my-pipeline.yaml ~/Music/song.mp3

Configuration

filerohr itself is mostly configured through environment variables. The pipeline configuration is YAML and is documented below.

Environment variables

Environment variable configuration options include:

TZ : The local timezone (e.g. Europe/Vienna)

FILEROHR_FFMPEG_BIN : Path to the ffmpeg binary

FILEROHR_FFPROBE_BIN : Path to the ffprobe binary

FILEROHR_DATA_DIR : Path to the data directory

FILEROHR_TMP_DIR : Path to the temporary directory

FILEROHR_TASK_MODULES : comma-separated list of Python modules to import as task definitions

FILEROHR_PIPELINE_CONFIG_DIR : Path to the pipeline configuration file directory

Most of these variables will be set to sensible defaults in the upcoming container image.

Pipeline configuration

The following pipeline configuration is based on filerohr’s built-in tasks:

# You can give pipelines a name.
# That makes it easy to reference them in an API or with the CLI.
name: audio
# Pipelines can be marked as default (but only one).
use_as_default: true
jobs:
  # Downloads the file if a remote URL was provided.
  # Skips them otherwise.
  - job: download_file
    match_content_type: ["audio/*", "video/*", "pass_unset"]
    max_download_size_mib: 25
  # Logs the file size.
  - job: log_file_size
  # Sanitizes the file as audio/video.
  # Automatically skips the file if it is not audio/video.
  - job: sanitize_av
  # Extracts all audio streams as separate files.
  - job: extract_audio
  # Extract metadata from the audio files.
  - job: extract_audio_metadata
  # Normalize the audio file with the podcast preset.
  - job: normalize_audio
    preset: podcast
  # Converts the audio file (if necessary).
  # Only allow opus and flac and convert to flac as needed.
  - job: convert_audio
    allowed_formats: ["opus", "flac"]
    fallback_format: flac
  # Now that the last job that could have changed the audio format is finished,
  # we can extract the mime type from the audio file.
  - job: extract_mime_type
  # Log the file size again.
  # This time it will include the reduction in file size in percent.
  - job: log_file_size
  # Hash the file content with SHA3 512.
  # This will be re-used in store_by_content_hash.
  - job: hash_file
    alg: sha3_512
  # Store the files by their content hash.
  - job: store_by_content_hash
    storage_dir: /home/you/data/audio/by-hash
    # Emit the stored file in the hash-based directory as a pipeline result.
    emit: true
  # Additionally, store by upload date, but only symlink to files in hash storage.
  - job: store_by_creation_date
    storage_dir: /home/you/data/audio/by-hash
    symlink: true
    # Do not emit but keep the file in the upload-date-based directory.
    keep: true

Not that you can include jobs multiple times. This can be helpful e.g., if you do file size logging, before and after converting audio streams.

Built-in tasks

convert_audio

Ensures an audio file matches the allowed formats.

Audio files that don’t match the formats will be converted to the fallback format.

Note: This job must be placed after an extract_audio job.

Configuration options:

skip: boolean, default: False
allowed_formats: string[], required

List of allowed audio formats. Use 'any' to allow any audio format.

Examples:

  • ['vorbis', 'flac', 'opus']
fallback_format: string, default: 'flac'

The format to fallback to if the detected format is not allowed. ffmpeg often has different encoders for the same audio format. Some of these encoders are experimental, e.g. you probably want to uselibopus over opus. If you select an experimental encoder, you might have to add ['-strict', '-2'] to fallback_format_encoder_args to avoid errors.

fallback_format_ext: string | null, default: null

The file extension of the fallback format. This is automatically inferred for the most common audio formats.

fallback_format_encoder_args: string[], required

Additional arguments passed to ffmpeg when encoding to the fallback format.

discard_remote

Stops the pipeline if the current file is not a local file.

Configuration options:

skip: boolean, default: False

download_file

Downloads files from remote sources.

When called with a local file path, the job will simply be skipped.

Configuration options:

skip: boolean, default: False
storage_dir: string | null, default: null

Base directory to store downloaded files in.

max_download_size_mib: number, default: 1024

Maximum size allowed for downloaded files in MiB (megabytes). Use 0 for no limit.

allow_streams: boolean, default: False

Whether to download files from sources that are streaming responses and do not have a fixed file size.

allowed_protocols: ('http' | 'https')[], default: ['http', 'https']

List of allowed protocols to download files from.

timeout_seconds: integer, default: 600

Timeout in seconds for downloading.

follow_redirects: boolean, default: True

Follow redirects when downloading.

match_content_type: (string | 'pass_unset')[] | null, default: null

List of mime types to check. Supports glob patterns like audio/*. Add pass_unset to the list if you want this check to pass in case no Content-Type was given in the server’s response headers. Setting match_content_type to null will allow all content types.

download_chunk_size_mib: number, default: 1

Size of chunks to download in MiB (megabytes).

extract_audio

Extracts all audio streams in the job’s file and saves them as separate files.

In case more than one stream is found, this job will spawn a subsequent job for each stream file.

If you only want to extract a single stream, set count: 1 in the job configuration.

Configuration options:

skip: boolean, default: False
count: integer, default: 0

Maximum number of audio streams to extract. This is useful for files that contain multiple audio streams. Extracts all audio streams if number is 0.

extract_audio_metadata

Extracts metadata, like duration, artist, title, album from the job’s file.

Configuration options:

skip: boolean, default: False

extract_mime_type

Extracts the mime type of the job’s file.

Configuration options:

skip: boolean, default: False

ffmpeg

Run a custom ffmpeg command with args specified in the job configuration. Ensure that you include {input_file} and {output_file} placeholders.

Configuration options:

skip: boolean, default: False
args: string[], required

FFMPEG command line arguments. Must contain {input_file} and {output_file} as literal placeholder strings for actual file paths.

Examples:

  • ['-i', '{input_file}', '-vn', '{output_file}']
output_format: string, required

Output file format. Must be a valid file extension understood by ffmpeg.

Examples:

  • '.mp3'
  • '.flac'

hash_file

Hash the file content.

Configuration options:

skip: boolean, default: False
alg: 'blake2b' | 'blake2s' | 'md5' | 'sha1' | 'sha224' | 'sha256' | 'sha384' | 'sha3_224' | 'sha3_256' | 'sha3_384' | 'sha3_512' | 'sha512' | 'shake_128' | 'shake_256', default: 'sha256'

Hash algorithm to use.

keep_file

Keep the current job file and do not delete it after the pipeline finishes.

This can be helpful to debug intermediate job output.

Configuration options:

skip: boolean, default: False

log_file_size

Logs the file size in MiB.

If used multiple times in the same pipeline config, it will also display the change in file size compared to the reference.

Configuration options:

skip: boolean, default: False
force_update_reference: boolean, default: False

If set to true, the reference used to calculate changes in the file size between jobs will be updated. Implicitly true on first call.

normalize_audio

Normalizes the audio stream with ffmpeg-normalize.

If no additional options are given, the podcast preset will be used.

Note: This job must be placed after an extract_audio job.

Configuration options:

skip: boolean, default: False
preset: 'music' | 'podcast' | 'streaming-video' | null, default: null

The audio normalization preset to use. See https://slhck.info/ffmpeg- normalize/usage/presets/ for available presets. Mutually exclusive with args.

args: string[], required

ffmpeg-normalize command line arguments. See https://slhck.info/ffmpeg-normalize/usage/cli-options/ for available options. Mutually exclusive with preset.

sanitize_av

Sanitizes the job file as an audio/video stream. This performs a simple copy operation for all streams in the job’s file and ensures that the resulting file is playable.

If the file is not an audio file, it will be skipped by default.

Configuration options:

skip: boolean, default: False
error_handling: 'ignore' | 'ignore_minor', default: 'ignore'

How to handle errors during file sanitization of broken files. ignore will ignore all but critical errors in the audio and corresponds to ffmpeg’s -err_detect ignore_err. The resulting file will play, but may not be pleasant to listen to. ignore_minor will let minor errors pass and corresponds to ffmpeg’s -err_detect careful.

Examples:

  • 'ignore'
  • 'ignore_minor'
skip_invalid: boolean, default: True

Silently skip files that do not contain audio.

store_by_content_hash

Stores the job file in a directory-tree based on the file’s content hash.

The generated directory structure looks like this:

audio_dir/
  8d/
    a9/
      8da9bce68a6aebdcba325cf21402c78c1628c9da1278b817a600cdd92b720653.flac
      8da9fc4939da378a720ba1ba310d3d7a1a85e44b79cd4c68bfc4bd3081f01062.flac
  e0/
    0a/
      e00afc4939da378a720ba1ba310d3d7a1a85e44b79cd4c68bfc4bd3081f01062.flac

Configuration options:

skip: boolean, default: False
storage_dir: string | null, default: null

Base directory to store files in.

symlink: boolean, default: False

If set to true, a symlink is created to the file from the previous job. In that case, the previous job must create the file in a persistent storage location.

relative: boolean, default: True

If set to true, the symlink will be relative to the storage path.

keep: boolean, default: False

If set to true, the file is kept after the pipeline finishes.

emit: boolean, default: False

If set to true, the file is emitted as a pipeline result. Implies keep.

alg: 'blake2b' | 'blake2s' | 'md5' | 'sha1' | 'sha224' | 'sha256' | 'sha384' | 'sha3_224' | 'sha3_256' | 'sha3_384' | 'sha3_512' | 'sha512' | 'shake_128' | 'shake_256' | null, default: null

Hash algorithm to use. If unset the job will try to re-use an existing file hash calculated in a hash_file job. Otherwise, a new hash will be calculated and stored along with the file. In case the file already has a hash but uses a different algorithm, a new hash will be calculated for storage but the hash on the file will be kept.

levels: integer, default: 2

Number of directory levels that should be created for stored files.

chars_per_level: integer, default: 2

Number of hash-characters per level.

store_by_creation_date

Stores the job file in a directory-tree based on the creation date.

The generated directory structure looks like this:

audio_dir/
  2025/
    11/
      20/
        filename2.flac
        filename3.flac
    10/
      16/
        filename1.flac

Configuration options:

skip: boolean, default: False
storage_dir: string | null, default: null

Base directory to store files in.

symlink: boolean, default: False

If set to true, a symlink is created to the file from the previous job. In that case, the previous job must create the file in a persistent storage location.

relative: boolean, default: True

If set to true, the symlink will be relative to the storage path.

keep: boolean, default: False

If set to true, the file is kept after the pipeline finishes.

emit: boolean, default: False

If set to true, the file is emitted as a pipeline result. Implies keep.

month: boolean, default: True

Include month in storage subdirectories.

day: boolean, default: True

Include day in storage subdirectories.

Custom tasks

You can define custom tasks by creating a python module/file. After that you need to ensure set the FILEROHR_TASK_MODULES environment variable to the name of your module. If you have created multiple modules, you can separate them by a comma.

You can find examples for tasks in the my-custom-tasks.py file or in filerohr’s own filerohr/tasks/ directory.

When implementing a custom task, there are three important guidelines:

  1. DO NOT block the loop.

    All code must run asynchronously, also known as cooperative multitasking. filerohr includes the aiofiles and pebble libraries and some additional helpers in filerohr.utils. Use these to do blocking IO or CPU-intensive work.

  2. Call job.next() when you are done with the task.

    That is, if you want the pipeline to move on. If you don’t call job.next() the pipeline will stop after your job finished. Sometimes a job might do that intentionally, but most of the time you don’t want that.

    You may also call job.next() multiple times, if you want to enqueue multiple follow-up jobs. This may be something you do, when you create multiple new files in your job (like filerohr.tasks.av.extract_audio does).

  3. Don’t modify the job.file directly.

    Instead, call job.next(file.clone(path=...)).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filerohr-0.6.0.tar.gz (65.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filerohr-0.6.0-py3-none-any.whl (45.3 kB view details)

Uploaded Python 3

File details

Details for the file filerohr-0.6.0.tar.gz.

File metadata

  • Download URL: filerohr-0.6.0.tar.gz
  • Upload date:
  • Size: 65.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for filerohr-0.6.0.tar.gz
Algorithm Hash digest
SHA256 785d5dc69c298e409fba6ad3ff12066abc333f064297e5e3fa7576e3c8eb154b
MD5 7bfb2c7211da383e85a06bb0224f6895
BLAKE2b-256 099cb277edd8afec7bb7899a571cb474ae608bb664ea3f12c09d937788764e46

See more details on using hashes here.

File details

Details for the file filerohr-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: filerohr-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 45.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for filerohr-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3822442dabfed0c1a4494dbd00aa95491b94bf3f0a811f465c95dab25bc4cda
MD5 0ee05c821ec517169da6a711ca258d68
BLAKE2b-256 1c7745f4312889f5db9d055e25304bfe2600625bac73be16b16773c3b3381322

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page