CLI to inspect and normalize messy data files into clean tables

These details have not been verified by PyPI

Project description

filelens

Turn messy files into clean tables in one command.

filelens is a CLI that helps you understand and clean messy data files.

Ever opened a file where:

headers start on row 6
metadata is mixed with data
columns are inconsistent

filelens lets you:

inspect structure and issues
infer a schema
convert to a clean table (Parquet)

No config. No guessing. Deterministic output.

Quick start

filelens inspect file.csv
filelens convert file.csv --out file.parquet

Install (pip)

Build and install locally:

pip install .

pip install . builds from source and requires Rust/Cargo on your machine.

Build a distributable wheel:

maturin build -b bin --release -o dist
pip install dist/filelens-*.whl

Example

Before

Metadata + mixed rows + unclear structure:

Metadata: Device=LabX
Date: 2024-01-01

Sample ID,Value,Unit
S1,0.45,mg/mL
S2,0.50,mg/mL

Inspect:

filelens inspect sample.csv

Output:

Detected:
- header row: 4
- metadata rows: 1-3
- columns: 3

Warnings:
- none

One command

filelens convert sample.csv --out sample.parquet

What it does:

detects structure
skips metadata
infers schema
writes sample.parquet

After

Clean table:

sample_id | value | unit
S1        | 0.45  | mg/mL
S2        | 0.50  | mg/mL

Supported inputs

Supports common messy data formats used in analytics and healthcare.

Excel / CSV (messy tabular files): .xlsx, .xlsm, .xls, .csv, .tsv, .psv, .txt
JSON (nested data): .json, .ndjson
XML (including cXML / CDA / NAACCR): .xml, .cxml, .xcml
HL7 (basic extraction): .hl7, .msg
Compressed text variants: .gz wrappers for supported text formats

Design

deterministic (no AI guessing)
no config required
optimized for messy real-world files

When to use filelens

You opened a file and do not understand its structure
Your Excel export has metadata rows and broken headers
You need to convert XML/JSON into a table quickly
You want clean input for dbt or a data warehouse

Command reference

Inspect:

filelens inspect data/file.xlsx
filelens inspect data/order.cxml
filelens inspect data/patient-example.json
filelens inspect data/oru_r01.msg
filelens inspect data/clinical.xml
filelens inspect data/patient-example.ttl
filelens inspect data/patient-example.ttl.html

Schema:

filelens schema data/file.xlsx
filelens schema data/patient-example.json --parser fhir

Convert:

filelens convert data/file.xlsx --out data/file.parquet
filelens convert data/order.cxml --out data/order.parquet
filelens convert data/nested_lab_result.json --out data/nested_lab_result.parquet
filelens convert data/oru_r01.msg --out data/oru_r01.parquet
filelens convert data/patient-example.ttl --out data/patient-example.ttl.parquet

Optional parser override:

filelens inspect data/file.xml --parser cda
filelens inspect data/file.json --parser json
filelens inspect data/file.json --parser fhir
filelens inspect data/file.msg --parser hl7
filelens inspect data/file.ttl --parser rdf

CXML extraction mode:

# curated canonical fields only
filelens schema data/order.cxml --parser cxml --cxml-mode mapped

# path-based auto-captured fields only (x_* columns)
filelens schema data/order.cxml --parser cxml --cxml-mode auto

# both canonical + path-based fields
filelens convert data/order.cxml --parser cxml --cxml-mode both --out data/order.parquet

If running from source, use ./target/release/filelens instead of filelens.

Works with dbt

filelens outputs Parquet files that can be loaded into warehouses and modeled with dbt.

Use it in this order:

Convert files to parquet.
Load parquet into Postgres raw.filelens_lines.
Run dbt models.
Query typed marts.

Setup env vars:

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

One-command local pipeline (public examples only):

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

What this command does:

loads parquet into raw.filelens_lines
syncs raw into raw_procurement and raw_clinical
runs staging models
runs marts (including typed marts)
runs tests
prints row counts and next query hints

Which tables to query:

analytics_marts.fct_procurement_lines for procurement analytics
analytics_marts.fct_fhir_resources for FHIR analytics
analytics_marts.fct_naaccr_cases for NAACCR analytics
analytics_marts.fct_record_attributes for generic key/value search across all extracted attributes

analytics_registry.idx_filelens_records is a cross-format registry/index table (lineage + canonical fields). It is not the primary end-user analytics table.

Why keep raw -> internal -> marts:

raw: ingestion/debug layer (what got loaded)
analytics_internal: normalization layer (map parser-specific columns into stable canonical fields)
marts: consumption layer (deduped and typed tables for analysts/apps)

Example consumer queries:

select * from analytics_marts.fct_procurement_lines limit 20;
select * from analytics_marts.fct_fhir_resources limit 20;
select * from analytics_marts.fct_naaccr_cases limit 20;
select * from analytics_marts.fct_record_attributes limit 20;

Trace NAACCR attributes back to original source ids:

select
  source_file,
  record_key,
  attribute_scope,
  attribute_source_id,
  attribute_name,
  attribute_value
from analytics_marts.fct_record_attributes
where source_kind = 'naaccr'
  and attribute_source_id in ('grade', 'patientidnumber', 'tumorrecordnumber')
limit 20;

Examples

See examples/ for real sample inputs:

procurement (cXML / xCML)
healthcare (FHIR, HL7, CDA, NAACCR)
RDF/Turtle (.ttl, .ttl.html)
messy CSV/TSV/PSV/TXT

Build

cargo build --release

Binary path:

./target/release/filelens

Release (GitHub Actions)

Tag-based release:

git tag v0.1.0
git push origin v0.1.0

What happens on tag push (v*):

builds platform wheels (Linux, macOS Intel/ARM, Windows)
builds source distribution on Linux
creates a GitHub Release and uploads dist/* artifacts

Optional PyPI publish:

configure PyPI Trusted Publisher for this repo (recommended)
on tag push, distributions are published to PyPI via GitHub OIDC
or run the Release workflow manually with publish_pypi=true

PyPI Trusted Publisher settings:

PyPI project -> Manage -> Publishing -> Add a new publisher -> GitHub.
Set:
- Owner: <your-github-owner>
- Repository name: filelens
- Workflow name: release.yml
- Environment name: pypi
Save. No API token secret is required.

Workflow

What this workflow does:

builds the filelens binary
converts only examples/public files into parquet under output/public
loads only output/public/**/*.parquet into Postgres raw tables
runs dbt staging + marts with --full-refresh (and tests)
does not include non-public example paths unless you change the command

Why --full-refresh in this demo workflow:

it rebuilds marts from scratch so the demo is deterministic after parser/model changes
it avoids stale incremental state while iterating locally
for recurring production loads, omit --full-refresh and use incremental dbt runs

cargo build --release
scripts/convert_inputs.sh --input-dir examples/public --output-dir output/public

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

scripts/convert_inputs.sh is non-strict by default (skips failures and continues). Add --strict to fail on first conversion error.

Advanced formats

RDF/Turtle (.ttl, .rdf) — experimental support
HTML pages containing RDF/Turtle <pre> blocks (for example *.ttl.html)

Why not pandas?

pandas can read files, but it does not:

detect likely header/metadata layout
explain quality issues up front
normalize mixed file families with one deterministic CLI pass

filelens is focused on that first cleanup step before your pipeline.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Apr 29, 2026

This version

0.1.2

Apr 28, 2026

0.1.1

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filelens-0.1.2.tar.gz (88.0 kB view details)

Uploaded Apr 28, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filelens-0.1.2-py3-none-win_amd64.whl (6.6 MB view details)

Uploaded Apr 28, 2026 Python 3Windows x86-64

filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB view details)

Uploaded Apr 28, 2026 Python 3manylinux: glibc 2.17+ x86-64

filelens-0.1.2-py3-none-macosx_11_0_arm64.whl (5.7 MB view details)

Uploaded Apr 28, 2026 Python 3macOS 11.0+ ARM64

File details

Details for the file filelens-0.1.2.tar.gz.

File metadata

Download URL: filelens-0.1.2.tar.gz
Upload date: Apr 28, 2026
Size: 88.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`47141c33d85fffb4fe1a0fde4de4e4bdca5d3da51f62aa0f47318b331da28a08`
MD5	`326ca1a3c27b815b7a34f1b17024dbfd`
BLAKE2b-256	`ffcc9aaf238c03d799ececdb3d7de82c60ddad91c725637a3cfc08a3f5c48d99`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2.tar.gz:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.2.tar.gz
- Subject digest: 47141c33d85fffb4fe1a0fde4de4e4bdca5d3da51f62aa0f47318b331da28a08
- Sigstore transparency entry: 1395995959
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/kraftaa
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Trigger Event: push

File details

Details for the file filelens-0.1.2-py3-none-win_amd64.whl.

File metadata

Download URL: filelens-0.1.2-py3-none-win_amd64.whl
Upload date: Apr 28, 2026
Size: 6.6 MB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.2-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`6d35f1c76c0ed4ce39c449b63339970c8ca33963b68f59916939fb57be55fd40`
MD5	`547f3dd71e65b872a362b2c10f39f9d9`
BLAKE2b-256	`080c79b4cc0dc2968fc5bbc2f64814dfa75c9a3e16b21c02a96bafbb902a276c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2-py3-none-win_amd64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.2-py3-none-win_amd64.whl
- Subject digest: 6d35f1c76c0ed4ce39c449b63339970c8ca33963b68f59916939fb57be55fd40
- Sigstore transparency entry: 1395996127
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/kraftaa
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Trigger Event: push

File details

Details for the file filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 28, 2026
Size: 6.4 MB
Tags: Python 3, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`cf75f6576f2089e0014a9d7444e0160728d44377f58461bb65130a53636d3cac`
MD5	`e515ebad956b0400c485c851db848a0d`
BLAKE2b-256	`f825693e0166dae3d36cbed663e5f3f0b6d7fbadf1db347f2ae47707aa651acf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: cf75f6576f2089e0014a9d7444e0160728d44377f58461bb65130a53636d3cac
- Sigstore transparency entry: 1395996182
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/kraftaa
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Trigger Event: push

File details

Details for the file filelens-0.1.2-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: filelens-0.1.2-py3-none-macosx_11_0_arm64.whl
Upload date: Apr 28, 2026
Size: 5.7 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.2-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`6ad134dc9892c08170fa1d8689ce78a91052dfd59a435523ea6e7082d5c89cde`
MD5	`bd45411e058b32f284b6b8de1bb4ad63`
BLAKE2b-256	`8f2513184aed9f802207bf29ddb224c62a5615b9dfb65c95bad6c4cfb398d75f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.2-py3-none-macosx_11_0_arm64.whl
- Subject digest: 6ad134dc9892c08170fa1d8689ce78a91052dfd59a435523ea6e7082d5c89cde
- Sigstore transparency entry: 1395996048
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/kraftaa
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72
- Trigger Event: push

filelens 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

filelens

Quick start

Install (pip)

Example

Before

One command

After

Supported inputs

Design

When to use filelens

Command reference

Works with dbt

Examples

Build

Release (GitHub Actions)

Workflow

Advanced formats

Why not pandas?

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance