Skip to main content

CLI to inspect and normalize messy data files into clean tables

Project description

filelens

Turn messy files into clean tables in one command.

filelens is a CLI that helps you understand and clean messy data files.

Ever opened a file where:

  • headers start on row 6
  • metadata is mixed with data
  • columns are inconsistent

filelens lets you:

  • inspect structure and issues
  • infer a schema
  • convert to a clean table (Parquet)

No config. No guessing. Deterministic output.

Quick start

filelens inspect file.csv
filelens convert file.csv --out file.parquet

Install (pip)

Build and install locally:

pip install .

pip install . builds from source and requires Rust/Cargo on your machine.

Build a distributable wheel:

maturin build -b bin --release -o dist
pip install dist/filelens-*.whl

Example

Before

Metadata + mixed rows + unclear structure:

Metadata: Device=LabX
Date: 2024-01-01

Sample ID,Value,Unit
S1,0.45,mg/mL
S2,0.50,mg/mL

Inspect:

filelens inspect sample.csv

Output:

Detected:
- header row: 4
- metadata rows: 1-3
- columns: 3

Warnings:
- none

One command

filelens convert sample.csv --out sample.parquet

What it does:

  • detects structure
  • skips metadata
  • infers schema
  • writes sample.parquet

After

Clean table:

sample_id | value | unit
S1        | 0.45  | mg/mL
S2        | 0.50  | mg/mL

Supported inputs

Supports common messy data formats used in analytics and healthcare.

  • Excel / CSV (messy tabular files): .xlsx, .xlsm, .xls, .csv, .tsv, .psv, .txt
  • JSON (nested data): .json, .ndjson
  • XML (including cXML / CDA / NAACCR): .xml, .cxml, .xcml
  • HL7 (basic extraction): .hl7, .msg
  • Compressed text variants: .gz wrappers for supported text formats

Design

  • deterministic (no AI guessing)
  • no config required
  • optimized for messy real-world files

When to use filelens

  • You opened a file and do not understand its structure
  • Your Excel export has metadata rows and broken headers
  • You need to convert XML/JSON into a table quickly
  • You want clean input for dbt or a data warehouse

Command reference

Inspect:

filelens inspect data/file.xlsx
filelens inspect data/order.cxml
filelens inspect data/patient-example.json
filelens inspect data/oru_r01.msg
filelens inspect data/clinical.xml
filelens inspect data/patient-example.ttl
filelens inspect data/patient-example.ttl.html

Schema:

filelens schema data/file.xlsx
filelens schema data/patient-example.json --parser fhir

Convert:

filelens convert data/file.xlsx --out data/file.parquet
filelens convert data/order.cxml --out data/order.parquet
filelens convert data/nested_lab_result.json --out data/nested_lab_result.parquet
filelens convert data/oru_r01.msg --out data/oru_r01.parquet
filelens convert data/patient-example.ttl --out data/patient-example.ttl.parquet

Optional parser override:

filelens inspect data/file.xml --parser cda
filelens inspect data/file.json --parser json
filelens inspect data/file.json --parser fhir
filelens inspect data/file.msg --parser hl7
filelens inspect data/file.ttl --parser rdf

CXML extraction mode:

# curated canonical fields only
filelens schema data/order.cxml --parser cxml --cxml-mode mapped

# path-based auto-captured fields only (x_* columns)
filelens schema data/order.cxml --parser cxml --cxml-mode auto

# both canonical + path-based fields
filelens convert data/order.cxml --parser cxml --cxml-mode both --out data/order.parquet

If running from source, use ./target/release/filelens instead of filelens.

Works with dbt

filelens outputs Parquet files that can be loaded into warehouses and modeled with dbt.

Use it in this order:

  1. Convert files to parquet.
  2. Load parquet into Postgres raw.filelens_lines.
  3. Run dbt models.
  4. Query typed marts.

Setup env vars:

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

One-command local pipeline (public examples only):

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

What this command does:

  • loads parquet into raw.filelens_lines
  • syncs raw into raw_procurement and raw_clinical
  • runs staging models
  • runs marts (including typed marts)
  • runs tests
  • prints row counts and next query hints

Which tables to query:

  • analytics_marts.fct_procurement_lines for procurement analytics
  • analytics_marts.fct_fhir_resources for FHIR analytics
  • analytics_marts.fct_naaccr_cases for NAACCR analytics
  • analytics_marts.fct_record_attributes for generic key/value search across all extracted attributes

analytics_registry.idx_filelens_records is a cross-format registry/index table (lineage + canonical fields). It is not the primary end-user analytics table.

Why keep raw -> internal -> marts:

  • raw: ingestion/debug layer (what got loaded)
  • analytics_internal: normalization layer (map parser-specific columns into stable canonical fields)
  • marts: consumption layer (deduped and typed tables for analysts/apps)

Example consumer queries:

select * from analytics_marts.fct_procurement_lines limit 20;
select * from analytics_marts.fct_fhir_resources limit 20;
select * from analytics_marts.fct_naaccr_cases limit 20;
select * from analytics_marts.fct_record_attributes limit 20;

Trace NAACCR attributes back to original source ids:

select
  source_file,
  record_key,
  attribute_scope,
  attribute_source_id,
  attribute_name,
  attribute_value
from analytics_marts.fct_record_attributes
where source_kind = 'naaccr'
  and attribute_source_id in ('grade', 'patientidnumber', 'tumorrecordnumber')
limit 20;

Examples

See examples/ for real sample inputs:

  • procurement (cXML / xCML)
  • healthcare (FHIR, HL7, CDA, NAACCR)
  • RDF/Turtle (.ttl, .ttl.html)
  • messy CSV/TSV/PSV/TXT

Build

cargo build --release

Binary path:

./target/release/filelens

Release (GitHub Actions)

Tag-based release:

git tag v0.1.0
git push origin v0.1.0

What happens on tag push (v*):

  • builds platform wheels (Linux, macOS Intel/ARM, Windows)
  • builds source distribution on Linux
  • creates a GitHub Release and uploads dist/* artifacts

Optional PyPI publish:

  • configure PyPI Trusted Publisher for this repo (recommended)
  • on tag push, distributions are published to PyPI via GitHub OIDC
  • or run the Release workflow manually with publish_pypi=true

PyPI Trusted Publisher settings:

  1. PyPI project -> Manage -> Publishing -> Add a new publisher -> GitHub.
  2. Set:
    • Owner: <your-github-owner>
    • Repository name: filelens
    • Workflow name: release.yml
    • Environment name: pypi
  3. Save. No API token secret is required.

Workflow

What this workflow does:

  • builds the filelens binary
  • converts only examples/public files into parquet under output/public
  • loads only output/public/**/*.parquet into Postgres raw tables
  • runs dbt staging + marts with --full-refresh (and tests)
  • does not include non-public example paths unless you change the command

Why --full-refresh in this demo workflow:

  • it rebuilds marts from scratch so the demo is deterministic after parser/model changes
  • it avoids stale incremental state while iterating locally
  • for recurring production loads, omit --full-refresh and use incremental dbt runs
cargo build --release
scripts/convert_inputs.sh --input-dir examples/public --output-dir output/public

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

scripts/convert_inputs.sh is non-strict by default (skips failures and continues). Add --strict to fail on first conversion error.

Advanced formats

  • RDF/Turtle (.ttl, .rdf) — experimental support
  • HTML pages containing RDF/Turtle <pre> blocks (for example *.ttl.html)

Why not pandas?

pandas can read files, but it does not:

  • detect likely header/metadata layout
  • explain quality issues up front
  • normalize mixed file families with one deterministic CLI pass

filelens is focused on that first cleanup step before your pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filelens-0.1.2.tar.gz (88.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filelens-0.1.2-py3-none-win_amd64.whl (6.6 MB view details)

Uploaded Python 3Windows x86-64

filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

filelens-0.1.2-py3-none-macosx_11_0_arm64.whl (5.7 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file filelens-0.1.2.tar.gz.

File metadata

  • Download URL: filelens-0.1.2.tar.gz
  • Upload date:
  • Size: 88.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.2.tar.gz
Algorithm Hash digest
SHA256 47141c33d85fffb4fe1a0fde4de4e4bdca5d3da51f62aa0f47318b331da28a08
MD5 326ca1a3c27b815b7a34f1b17024dbfd
BLAKE2b-256 ffcc9aaf238c03d799ececdb3d7de82c60ddad91c725637a3cfc08a3f5c48d99

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2.tar.gz:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.2-py3-none-win_amd64.whl.

File metadata

  • Download URL: filelens-0.1.2-py3-none-win_amd64.whl
  • Upload date:
  • Size: 6.6 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 6d35f1c76c0ed4ce39c449b63339970c8ca33963b68f59916939fb57be55fd40
MD5 547f3dd71e65b872a362b2c10f39f9d9
BLAKE2b-256 080c79b4cc0dc2968fc5bbc2f64814dfa75c9a3e16b21c02a96bafbb902a276c

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2-py3-none-win_amd64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cf75f6576f2089e0014a9d7444e0160728d44377f58461bb65130a53636d3cac
MD5 e515ebad956b0400c485c851db848a0d
BLAKE2b-256 f825693e0166dae3d36cbed663e5f3f0b6d7fbadf1db347f2ae47707aa651acf

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filelens-0.1.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6ad134dc9892c08170fa1d8689ce78a91052dfd59a435523ea6e7082d5c89cde
MD5 bd45411e058b32f284b6b8de1bb4ad63
BLAKE2b-256 8f2513184aed9f802207bf29ddb224c62a5615b9dfb65c95bad6c4cfb398d75f

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.2-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page