Skip to main content

A Python script to index a large folder into a parquet file, along with metadata

Project description

folder-indexer-py

A Python script to index a large folder into a parquet file, along with metadata

Description

This script is useful for searching for files stored on a reasonably slow disk from backups, especially in where you aren't sure about the files are are searching for.

Use tools like DBeaver and DuckDB to query and explore the generated index.

Usage

pip install folder_indexer

python3 -m folder_indexer -i /path/to/input/folder -o /path/to/output/folder
# --or--
folder_indexer -i /path/to/input/folder -o /path/to/output/folder

Metadata Indexed and Output

The output parquet files have the following columns:

* file_path
* folder_path
* file_name
* file_size_bytes
* md5_hash_hex
* sha256_base64
* date_created
* date_modified
* date_accessed
* magic_file_type_1
* first_100_bytes
* last_100_bytes
* timestamp_crawled
* indexing_start_timestamp

The generated parquet files are stored to the output folder with the following naming convention: partial_file_index_{datetime}.parquet

At the end of the execution, the individual parquet files are unioned into a single parquet file, with the following name: 00_complete_file_index.parquet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folder_indexer-0.1.1.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

folder_indexer-0.1.1-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file folder_indexer-0.1.1.tar.gz.

File metadata

  • Download URL: folder_indexer-0.1.1.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for folder_indexer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 003d64f31bec655abe0be1176d791be85ee972aa7fb10abaca28c8a7e50bc9b0
MD5 c7d52987f7bb4605f7a2060de1c85ec2
BLAKE2b-256 5aa54b27d3c7cac919884a79d4f034b8d0cc7149c2eca50a8a5451636db53bbd

See more details on using hashes here.

File details

Details for the file folder_indexer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for folder_indexer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8b67dacd2ca55122b69e1e9341868d819ea450fc6b6e3347f0c5ab1fde6d2a2
MD5 449ad65a357263a1720cdba88e252f88
BLAKE2b-256 f9c1d3b811be8cda40fb376348cbf2567823d55c4d3bebe5b88991fe9427efa1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page