Skip to main content

CLI to inspect and normalize messy data files into clean tables

Project description

filelens

Parse cXML, XML, JSON, and messy data files into clean tables — in one command.

Most tools give you an XML tree. filelens gives you rows you can actually use.

filelens is a CLI that helps you understand and clean messy data files. Most of the time, the hardest part is just understanding the file.

Ever opened a file where:

  • headers start on row 6
  • metadata is mixed with data
  • columns are inconsistent

filelens lets you:

  • inspect structure and issues
  • infer a schema
  • convert to a clean table (Parquet)

No config. No guessing. Deterministic output. Built for real-world data engineering workflows.

Quick start

Install with pip:

pip install filelens

Install with Homebrew:

brew tap kraftaa/filelens
brew install filelens

Use it:

filelens inspect file.csv
filelens convert file-or-folder --out-dir output/

Example

Before

Metadata + mixed rows + unclear structure:

Metadata: Device=LabX
Date: 2024-01-01

Sample ID,Value,Unit
S1,0.45,mg/mL
S2,0.50,mg/mL

Inspect:

filelens inspect sample.csv

Inspect messy CSV

Output:

Detected:
- header row: 4
- metadata rows: 1-3
- columns: 3

Warnings:
- none

One command

filelens convert sample.cxml --out sample.parquet

Convert cXML both

What it does:

  • detects structure
  • skips metadata
  • infers schema
  • writes sample.parquet

After

Clean table:

sample_id | value | unit
S1        | 0.45  | mg/mL
S2        | 0.50  | mg/mL

Read in pandas:

import pandas as pd
df = pd.read_parquet("sample.parquet")
df.head()

That's it

For most use cases, you only need:

filelens inspect file.csv
filelens inspect file.cxml
filelens convert file-or-folder --out-dir output/

Everything below is optional (advanced formats, dbt integration, pipelines).

Default workflow

Golden path:

filelens convert <file-or-folder> --out-dir output/

What it does:

  • auto-detects parser from extension/content
  • keeps cXML default mode as mapped (canonical columns)
  • writes parquet output files
  • appends _source_file, _source_kind, _record_id columns
  • writes sidecar report: output/_filelens_report.json (schema + warnings + status per file)

If your cXML needs extra nested fields, use:

filelens convert po.cxml --parser cxml --cxml-mode both --out-dir output/

Folder-specific command:

filelens batch examples/public/trade --out-dir output/trade

batch prints:

  • files processed / succeeded / failed
  • total rows written
  • top warnings

Supported inputs

Supports common messy data formats used in analytics and healthcare.

  • Excel / CSV (messy tabular files): .xlsx, .xlsm, .xls, .csv, .tsv, .psv, .txt
  • JSON (nested data): .json, .ndjson
  • XML (including cXML / CDA / NAACCR): .xml, .cxml, .xcml
  • HL7 (basic extraction): .hl7, .msg
  • Compressed text variants: .gz wrappers for supported text formats

Design

  • deterministic
  • no config required
  • optimized for messy real-world files

When to use filelens

  • You opened a file and do not understand its structure
  • Your Excel export has metadata rows and broken headers
  • You need to convert XML/JSON into a table quickly
  • You want clean input for dbt or a data warehouse

More examples

Inspect:

filelens inspect data/file.xlsx
filelens inspect data/order.cxml
filelens inspect data/patient-example.json
filelens inspect data/oru_r01.msg
filelens inspect data/clinical.xml
filelens inspect data/patient-example.ttl
filelens inspect data/patient-example.ttl.html

Schema:

filelens schema data/file.xlsx
filelens schema data/patient-example.json --parser fhir

Convert:

filelens convert data/file.xlsx --out data/file.parquet
filelens convert data/order.cxml --out data/order.parquet
filelens convert data/nested_lab_result.json --out data/nested_lab_result.parquet
filelens convert data/oru_r01.msg --out data/oru_r01.parquet
filelens convert data/patient-example.ttl --out data/patient-example.ttl.parquet

Optional parser override:

filelens inspect data/file.xml --parser cda
filelens inspect data/file.json --parser json
filelens inspect data/file.json --parser fhir
filelens inspect data/file.msg --parser hl7
filelens inspect data/file.ttl --parser rdf

CXML extraction mode:

cXML mode controls which columns are emitted:

  • mapped (default): canonical columns only (order_id, line_number, quantity, ...)
  • auto: extracted path-based columns only (x_*)
  • both: union of mapped + auto

If you do not pass --cxml-mode, filelens uses mapped.

# curated canonical fields only
filelens schema data/order.cxml --parser cxml --cxml-mode mapped

# path-based auto-captured fields only (x_* columns)
filelens schema data/order.cxml --parser cxml --cxml-mode auto

# both canonical + path-based fields
filelens convert data/order.cxml --parser cxml --cxml-mode both --out data/order.parquet

If running from source, use ./target/release/filelens instead of filelens.

Optional: use with dbt

filelens outputs Parquet files that can be loaded into warehouses and modeled with dbt.

Use it in this order:

  1. Convert files to parquet.
  2. Load parquet into Postgres raw.filelens_lines.
  3. Run dbt models.
  4. Query typed marts.

Setup env vars:

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

One-command local pipeline (public examples only):

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

What this command does:

  • loads parquet into raw.filelens_lines
  • syncs raw into raw_procurement and raw_clinical
  • runs staging models
  • runs marts (including typed marts)
  • runs tests
  • prints row counts and next query hints

Which tables to query:

  • analytics_marts.fct_procurement_lines for procurement analytics
  • analytics_marts.fct_fhir_resources for FHIR analytics
  • analytics_marts.fct_naaccr_cases for NAACCR analytics
  • analytics_marts.fct_record_attributes for generic key/value search across all extracted attributes

analytics_registry.idx_filelens_records is a cross-format registry/index table (lineage + canonical fields). It is not the primary end-user analytics table.

Why keep raw -> internal -> marts:

  • raw: ingestion/debug layer (what got loaded)
  • analytics_internal: normalization layer (map parser-specific columns into stable canonical fields)
  • marts: consumption layer (deduped and typed tables for analysts/apps)

Example consumer queries:

select * from analytics_marts.fct_procurement_lines limit 20;
select * from analytics_marts.fct_fhir_resources limit 20;
select * from analytics_marts.fct_naaccr_cases limit 20;
select * from analytics_marts.fct_record_attributes limit 20;

Trace NAACCR attributes back to original source ids:

select
  source_file,
  record_key,
  attribute_scope,
  attribute_source_id,
  attribute_name,
  attribute_value
from analytics_marts.fct_record_attributes
where source_kind = 'naaccr'
  and attribute_source_id in ('grade', 'patientidnumber', 'tumorrecordnumber')
limit 20;

Examples

See examples/ for real sample inputs:

  • procurement (cXML / xCML)
  • healthcare (FHIR, HL7, CDA, NAACCR)
  • RDF/Turtle (.ttl, .ttl.html)
  • messy CSV/TSV/PSV/TXT
  • hard cXML edge-case fixtures for parser testing: examples/hard/cxml

Workflow

What this workflow does:

  • converts only examples/public/** into Parquet under output/public/**
  • loads only output/public/**/*.parquet into Postgres raw.filelens_lines
  • runs dbt staging + marts with --full-refresh (and tests)
  • does not include non-public example paths unless you change the command

Why --full-refresh in this demo workflow:

  • it rebuilds marts from scratch so the demo is deterministic after parser/model changes
  • it avoids stale incremental state while iterating locally
  • for recurring production loads, omit --full-refresh and use incremental dbt runs
scripts/convert_inputs.sh --input-dir examples/public --output-dir output/public

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

scripts/convert_inputs.sh uses ./target/release/filelens by default. Use --bin to point to another binary path.

scripts/convert_inputs.sh is non-strict by default (skips failures and continues). Add --strict to fail on first conversion error.

Advanced formats

  • RDF/Turtle (.ttl, .rdf) — experimental support
  • HTML pages containing RDF/Turtle <pre> blocks (for example *.ttl.html)

Why not pandas?

pandas can read files, but it does not:

  • detect likely header/metadata layout
  • explain quality issues up front
  • normalize mixed file families with one deterministic CLI pass

filelens is focused on that first cleanup step before your pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filelens-0.1.3.tar.gz (615.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filelens-0.1.3-py3-none-win_amd64.whl (6.7 MB view details)

Uploaded Python 3Windows x86-64

filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

filelens-0.1.3-py3-none-macosx_11_0_arm64.whl (5.8 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file filelens-0.1.3.tar.gz.

File metadata

  • Download URL: filelens-0.1.3.tar.gz
  • Upload date:
  • Size: 615.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.3.tar.gz
Algorithm Hash digest
SHA256 cb6bbd5a954630c5daab896ec13994dc3a1a5eca0f65056bc08739ffc46bf0e2
MD5 359b39b0e9ecddbd4f14ce276a0e84b5
BLAKE2b-256 13e5ea7fdb283a7a3be87f45953fca03f43643651c96b93e227fb8f08c23591e

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3.tar.gz:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.3-py3-none-win_amd64.whl.

File metadata

  • Download URL: filelens-0.1.3-py3-none-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 58fe0a848f4a7e1f6edb2a5f50e8ca2c5813ffa1fabf8b9fcd9612aa0c5d21ec
MD5 2a5f66ae20303320c579e8d1c0007f57
BLAKE2b-256 80b498d7f5328319488deaf8a7cee5ad44f70f0a6b9e11a0ad06a392491b22e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3-py3-none-win_amd64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ed0437ebc0b2c8b65d1158652337a09d9b9a1d2cd41c61ecfa3e1bb444036a2d
MD5 ec20b00343b9cf9ca86c2bdbc8a8438a
BLAKE2b-256 593ec1b3d08c026f5e5df6711858bbbaba284d2d999e76e221d4c49f3eb46535

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filelens-0.1.3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filelens-0.1.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c587d1217f4b3634966956e4b10c0884ac541bb0b85eefa5564f39573cca16b6
MD5 f0eea66b823660d9227402b7ad411c5a
BLAKE2b-256 1aafa612b4112f972dac57c4a0bc4187a95bf8c54001fcec6faac1a9d5e012b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page