Skip to main content

Classes for representing different file formats in Python classes for use in type hinting in data workflows

Project description

FileFormats

CI/CD Codecov Static Badge Python Versions Latest Version Documentation Status

Logo Small

Fileformats provides a library of file-format types implemented as Python classes for validation, detection, typing and provide hooks for extra functionality and format conversions. Formats are typically validated/identified by a combination of file extension and "magic numbers" where applicable. Unlike other file-type packages, FileFormats, supports multi-file data formats ("file sets"), which are often found in scientific workflows, e.g. with separate header/data files.

FileFormats provides a flexible extension framework to add custom identification routines for exotic file formats, e.g. formats that require inspection of headers to locate data files, directories containing certain file types, or to peek at metadata fields to define specific sub-types (e.g. functional MRI DICOM file set). These file-sets with auxiliary files can be moved, copied and hashed like they are a single file object.

See the extension template for instructions on how to design FileFormats extensions modules to augment the standard file-types implemented in the main repository with custom domain/vendor-specific file-format types (e.g. fileformats-medimage).

Notes on MIME-type coverage

Support for all non-vendor standard MIME types (i.e. ones not matching */vnd.* or */x-*) has been added to FileFormats by semi-automatically scraping the IANA MIME types website for file extensions and magic numbers. As such, many of the formats in the library have not been properly tested on real data and so should be treated with some caution. If you encounter any issues with an implemented file type, please raise an issue in the GitHub tracker.

A small selection of vendor-specific types can be found under fileformats.vendor.*. Support for additional vendor-specific formats can be added via plugin (see the extension template).

Installation

FileFormats can be installed for Python >= 3.8 from PyPI with

    python3 -m pip fileformats

Implementations of methods and converters between select formats that require external dependencies require the installation of the corresponding "extras" package e.g

    python3 -m pip install fileformats-extras

Extension packages exist for for formats not covered by [IANA MIME types] (e.g. NIfTI, R-files, MATLAB files) and can be installed along with their "extras" package similarly

    $ python3 -m pip install \
      fileformats-medimage \
      fileformats-medimage-extras \
      fileformats-datascience \
      fileformats-datascience-extras

Examples

Using the WithMagicNumber mixin class, the Png format can be defined concisely as

    from fileformats.generic import File
    from fileformats.core.mixin import WithMagicNumber

    class Png(WithMagicNumber, File):
        binary = True
        ext = ".png"
        iana_mime = "image/png"
        magic_number = b".PNG"

Files can then be checked to see whether they are of PNG format by

    png = Png("/path/to/image/file.png")  # Checks the extension and magic number

which will raise a FormatMismatchError if initialisation or validation fails, or for a boolean method that checks the validation use matches

    if Png.matches(a_path_to_a_file):
        ... handle case ...

Format Identification

There are 2 main functions that can be used for format identification

  • fileformats.core.from_mime
  • fileformats.core.find_matching

from_mime

As the name suggests, this function is used to return the FileFormats class corresponding to a given MIME <https://www.iana.org/assignments/media-types/media-types.xhtml>__ string. All non-vendor official MIME-types are supported. Non-official types can be loaded using the application/x-name-of-type form as long as the name of the type is unique amongst all installed format types. To avoid name clashes between different extension types, the "MIME-like" string can be used instead, where informal registries corresponding to the fileformats extension namespace are used instead, e.g. medimage/nifti-gz or datascience/hdf5.

find_matching

Given a set of file-system paths, by default, find_matching will iterate through all installed fileformats classes and return all that validate successfully (formats without any specific constraints are excluded by default). The potential candidate classes can be restricted by using the candidates keyword argument.

Format Conversion

While not implemented in the main File-formats itself, file-formats provides hooks for other packages to implement extra behaviour such as format conversion. The fileformats-extras <https://github.com/ArcanaFramework/fileformats-extras>__ implements a number of converters between standard file-format types, e.g. archive types to/from generic file/directories, which if installed can be called using the convert() method.

    from fileformats.application import Zip
    from fileformats.generic import Directory

    zip_file = Zip.convert(Directory("/path/to/a/directory"))
    extracted = Directory.convert(zip_file)
    copied = extracted.copy_to("/path/to/output")

The converters are implemented in the Pydra dataflow framework, and can be linked into wider Pydra workflows by creating a converter task

    import pydra
    from pydra.tasks.mypackage import MyTask
    from fileformats.application import Json, Yaml

    wf = pydra.Workflow(name="a_workflow", input_spec=["in_json"])
    wf.add(
        Yaml.get_converter(Json, name="json2yaml", in_file=wf.lzin.in_json)
    )
    wf.add(
        MyTask(
            name="my_task",
            in_file=wf.json2yaml.lzout.out_file,
        )
    )
    ...

Alternatively, the conversion can be executed outside of a Pydra workflow with

    json_file = Json("/path/to/file.json")
    yaml_file = Yaml.convert(json_file)

License

This work is licensed under a Creative Commons Attribution 4.0 International License

CC0

Acknowledgements

The authors acknowledge the facilities and scientific and technical assistance of the National Imaging Facility, a National Collaborative Research Infrastructure Strategy (NCRIS) capability.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fileformats-0.17.5.tar.gz (103.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fileformats-0.17.5-py3-none-any.whl (131.7 kB view details)

Uploaded Python 3

File details

Details for the file fileformats-0.17.5.tar.gz.

File metadata

  • Download URL: fileformats-0.17.5.tar.gz
  • Upload date:
  • Size: 103.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fileformats-0.17.5.tar.gz
Algorithm Hash digest
SHA256 c5ed5baa6c6068374ba0b4bbe241d254c26718360dbdc48f743925abb5423f3f
MD5 c4ee2664ccc11f2b03709f1d9199ed81
BLAKE2b-256 b6a9ad9b56349e00a618ec22c0652f1a7cbf6a87ce52bb6d5629f546858c9ba2

See more details on using hashes here.

File details

Details for the file fileformats-0.17.5-py3-none-any.whl.

File metadata

  • Download URL: fileformats-0.17.5-py3-none-any.whl
  • Upload date:
  • Size: 131.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fileformats-0.17.5-py3-none-any.whl
Algorithm Hash digest
SHA256 45f353f3ef2d68edee3ef829bea5b8b30aed474eede122d123aec26f7e08322a
MD5 978297007656ad8d6fa7bb6ec76dd10a
BLAKE2b-256 136f85533faa59352f86c40dc9e11a3f1cb2e5d1018401dca04fa0b1d967a6fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page