Skip to main content

CLI to inspect and normalize messy data files into clean tables

Project description

filelens

Turn messy files into clean tables in one command.

filelens is a CLI that helps you understand and clean messy data files.

Ever opened a file where:

  • headers start on row 6
  • metadata is mixed with data
  • columns are inconsistent

filelens lets you:

  • inspect structure and issues
  • infer a schema
  • convert to a clean table (Parquet)

No config. No guessing. Deterministic output.

Quick start

filelens inspect file.csv
filelens convert file.csv --out file.parquet

Install (pip)

Build and install locally:

pip install .

pip install . builds from source and requires Rust/Cargo on your machine.

Build a distributable wheel:

maturin build -b bin --release -o dist
pip install dist/filelens-*.whl

Example

Before

Metadata + mixed rows + unclear structure:

Metadata: Device=LabX
Date: 2024-01-01

Sample ID,Value,Unit
S1,0.45,mg/mL
S2,0.50,mg/mL

Inspect:

filelens inspect sample.csv

Output:

Detected:
- header row: 4
- metadata rows: 1-3
- columns: 3

Warnings:
- none

One command

filelens convert sample.csv --out sample.parquet

What it does:

  • detects structure
  • skips metadata
  • infers schema
  • writes sample.parquet

After

Clean table:

sample_id | value | unit
S1        | 0.45  | mg/mL
S2        | 0.50  | mg/mL

Supported inputs

Supports common messy data formats used in analytics and healthcare.

  • Excel / CSV (messy tabular files): .xlsx, .xlsm, .xls, .csv, .tsv, .psv, .txt
  • JSON (nested data): .json, .ndjson
  • XML (including cXML / CDA / NAACCR): .xml, .cxml, .xcml
  • HL7 (basic extraction): .hl7, .msg
  • Compressed text variants: .gz wrappers for supported text formats

Design

  • deterministic (no AI guessing)
  • no config required
  • optimized for messy real-world files

When to use filelens

  • You opened a file and do not understand its structure
  • Your Excel export has metadata rows and broken headers
  • You need to convert XML/JSON into a table quickly
  • You want clean input for dbt or a data warehouse

Command reference

Inspect:

filelens inspect data/file.xlsx
filelens inspect data/order.cxml
filelens inspect data/patient-example.json
filelens inspect data/oru_r01.msg
filelens inspect data/clinical.xml
filelens inspect data/patient-example.ttl
filelens inspect data/patient-example.ttl.html

Schema:

filelens schema data/file.xlsx
filelens schema data/patient-example.json --parser fhir

Convert:

filelens convert data/file.xlsx --out data/file.parquet
filelens convert data/order.cxml --out data/order.parquet
filelens convert data/nested_lab_result.json --out data/nested_lab_result.parquet
filelens convert data/oru_r01.msg --out data/oru_r01.parquet
filelens convert data/patient-example.ttl --out data/patient-example.ttl.parquet

Optional parser override:

filelens inspect data/file.xml --parser cda
filelens inspect data/file.json --parser json
filelens inspect data/file.json --parser fhir
filelens inspect data/file.msg --parser hl7
filelens inspect data/file.ttl --parser rdf

If running from source, use ./target/release/filelens instead of filelens.

Works with dbt

filelens outputs Parquet files that can be loaded into warehouses and modeled with dbt.

Use it in this order:

  1. Convert files to parquet.
  2. Load parquet into Postgres raw.filelens_lines.
  3. Run dbt models.
  4. Query typed marts.

Setup env vars:

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

One-command local pipeline (public examples only):

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

What this command does:

  • loads parquet into raw.filelens_lines
  • syncs raw into raw_procurement and raw_clinical
  • runs staging models
  • runs marts (including typed marts)
  • runs tests
  • prints row counts and next query hints

Which tables to query:

  • analytics_marts.fct_procurement_lines for procurement analytics
  • analytics_marts.fct_fhir_resources for FHIR analytics
  • analytics_marts.fct_naaccr_cases for NAACCR analytics
  • analytics_marts.fct_record_attributes for generic key/value search across all extracted attributes

analytics_registry.idx_filelens_records is a cross-format registry/index table (lineage + canonical fields). It is not the primary end-user analytics table.

Why keep raw -> internal -> marts:

  • raw: ingestion/debug layer (what got loaded)
  • analytics_internal: normalization layer (map parser-specific columns into stable canonical fields)
  • marts: consumption layer (deduped and typed tables for analysts/apps)

Example consumer queries:

select * from analytics_marts.fct_procurement_lines limit 20;
select * from analytics_marts.fct_fhir_resources limit 20;
select * from analytics_marts.fct_naaccr_cases limit 20;
select * from analytics_marts.fct_record_attributes limit 20;

Trace NAACCR attributes back to original source ids:

select
  source_file,
  record_key,
  attribute_scope,
  attribute_source_id,
  attribute_name,
  attribute_value
from analytics_marts.fct_record_attributes
where source_kind = 'naaccr'
  and attribute_source_id in ('grade', 'patientidnumber', 'tumorrecordnumber')
limit 20;

Examples

See examples/ for real sample inputs:

  • procurement (cXML / xCML)
  • healthcare (FHIR, HL7, CDA, NAACCR)
  • RDF/Turtle (.ttl, .ttl.html)
  • messy CSV/TSV/PSV/TXT

Build

cargo build --release

Binary path:

./target/release/filelens

Release (GitHub Actions)

Tag-based release:

git tag v0.1.0
git push origin v0.1.0

What happens on tag push (v*):

  • builds platform wheels (Linux, macOS Intel/ARM, Windows)
  • builds source distribution on Linux
  • creates a GitHub Release and uploads dist/* artifacts

Optional PyPI publish:

  • configure PyPI Trusted Publisher for this repo (recommended)
  • on tag push, distributions are published to PyPI via GitHub OIDC
  • or run the Release workflow manually with publish_pypi=true

PyPI Trusted Publisher settings:

  1. PyPI project -> Manage -> Publishing -> Add a new publisher -> GitHub.
  2. Set:
    • Owner: <your-github-owner>
    • Repository name: filelens
    • Workflow name: release.yml
    • Environment name: pypi
  3. Save. No API token secret is required.

Workflow

What this workflow does:

  • builds the filelens binary
  • converts only examples/public files into parquet under output/public
  • loads only output/public/**/*.parquet into Postgres raw tables
  • runs dbt staging + marts with --full-refresh (and tests)
  • does not include non-public example paths unless you change the command

Why --full-refresh in this demo workflow:

  • it rebuilds marts from scratch so the demo is deterministic after parser/model changes
  • it avoids stale incremental state while iterating locally
  • for recurring production loads, omit --full-refresh and use incremental dbt runs
cargo build --release
scripts/convert_inputs.sh --input-dir examples/public --output-dir output/public

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

scripts/convert_inputs.sh is non-strict by default (skips failures and continues). Add --strict to fail on first conversion error.

Advanced formats

  • RDF/Turtle (.ttl, .rdf) — experimental support
  • HTML pages containing RDF/Turtle <pre> blocks (for example *.ttl.html)

Why not pandas?

pandas can read files, but it does not:

  • detect likely header/metadata layout
  • explain quality issues up front
  • normalize mixed file families with one deterministic CLI pass

filelens is focused on that first cleanup step before your pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filelens-0.1.1.tar.gz (86.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filelens-0.1.1-py3-none-win_amd64.whl (6.6 MB view details)

Uploaded Python 3Windows x86-64

filelens-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

filelens-0.1.1-py3-none-macosx_11_0_arm64.whl (5.7 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file filelens-0.1.1.tar.gz.

File metadata

  • Download URL: filelens-0.1.1.tar.gz
  • Upload date:
  • Size: 86.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0b55794692bc6e5a9009962f605800cc4dbc9077a24a398cdb1c8919ccc8406d
MD5 bccf4beb70fd712cb7edbfb7b928f81a
BLAKE2b-256 6a388e2569899e709f9e9a1b93ebe0998cec7ee0329188ffb9a3b1cf85662e09

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.1.tar.gz:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: filelens-0.1.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 6.6 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 37cae89768ddee72d7759cdf9ffdd032339ab6a98fec3e6e6709b1275204698b
MD5 1a65431f43f94de2c48e27822e0b8215
BLAKE2b-256 8a39869fd7260499f3c3b650b4e71c6419369e5be93fdb3401247b106bda2187

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.1-py3-none-win_amd64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filelens-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 01989a4f358acbb3f98f90311143a3478adadd03082f3621d9eaeb35715de69d
MD5 14cef76041487fe0fcce9b905c5a83be
BLAKE2b-256 e3e49b59d2f51d3b5d1b22fc8da7b001b8bf8de32e4e1ab3b8346bb7ded4735d

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filelens-0.1.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7847928fbfb3287f5acab563c3f9853f4606caaf5bc2936c4bff2f5f1e5242f7
MD5 a07c2da2b1fa676104a69d78e433e7cd
BLAKE2b-256 5bd8e745b8e61781230d3424f14106b624895174bed201ab99a4fd00e1a9ec55

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.1-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page