Skip to main content

A Python script to index a large folder into a parquet file, along with metadata

Project description

file-indexer-py

A Python script to index a large folder into a parquet file, along with metadata

Description

This script is useful for searching for files stored on a reasonably slow disk from backups, especially in where you aren't sure about the files are are searching for.

Use tools like DBeaver and DuckDB to query and explore the generated index.

Usage

pip install file_indexer

python3 -m file_indexer -i /path/to/input/folder -o /path/to/output/folder
# --or--
file_indexer -i /path/to/input/folder -o /path/to/output/folder

Metadata Indexed and Output

The output parquet files have the following columns:

* file_path
* folder_path
* file_name
* file_size_bytes
* md5_hash_hex
* sha256_base64
* date_created
* date_modified
* date_accessed
* magic_file_type_1
* first_100_bytes
* last_100_bytes
* timestamp_crawled
* indexing_start_timestamp

The parquet files are stored to the output folder with the following naming convention: partial_file_index_{datetime}.parquet

At the end of the execution, the individual parquet files are unioned into a single parquet file, with the following name: 00_complete_file_index.parquet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folder_indexer-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

folder_indexer-0.1.0-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file folder_indexer-0.1.0.tar.gz.

File metadata

  • Download URL: folder_indexer-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for folder_indexer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 af5a83267bc16c17023f8757735a22abe63d8baf86aad8c464a222745266595c
MD5 965276d9d5db32bad7bae86763244fc7
BLAKE2b-256 b9ea5724846abae363e2e3b840630de5286d6e85996b038f48b219e5498cfb94

See more details on using hashes here.

File details

Details for the file folder_indexer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: folder_indexer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for folder_indexer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6e59b78cfcd9315269cf29b6b0b3335086a112a74e88520731d92c8c7cd72964
MD5 cf44eab65d7ba97b3a985deb3745fe2c
BLAKE2b-256 350da4287228de6d0fa7a9c62db94a1a3e256227bfe20e273a2d90ff46b0b492

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page