Skip to main content

File system crawler for OpenSearch / Elasticsearch — Python rewrite of FSCrawler

Project description

FSCrawler — Python Edition

CI codecov

Disclaimer: This is a prototype intended for local development and experimentation only. It is not production-ready and should not be used in production environments.

A Python 3.12 rewrite of FSCrawler, a file system crawler that indexes binary documents (PDF, MS Office, plain text, and more) into OpenSearch or Elasticsearch.

Migrating from the Java version? fs.filename_as_id defaults to true here but false in Java. If you are pointing this at an existing index, set fs.filename_as_id: false in your _settings.yaml explicitly — otherwise documents will be re-indexed under new IDs and you will end up with duplicates.

Features

  • Backwards-compatible _settings.yaml format — drop-in replacement for the Java version
  • Event-driven crawling — watches the filesystem for changes in real time using OS-native events; no polling required
  • Apache Tika integration — connects to a running Tika server over HTTP (no bundled JVM)
  • Bulk indexing — buffers documents and flushes on document count or byte-size thresholds
  • Template management — creates OpenSearch component and index templates automatically
  • Multi-arch Docker image — Dockerfile supports linux/amd64 and linux/arm64 (make build)

Docker image

Pre-built multi-arch images (linux/amd64, linux/arm64) are published to GitHub Container Registry on every release:

ghcr.io/p6rguvyrst/opensearch-fscrawler:latest
ghcr.io/p6rguvyrst/opensearch-fscrawler:1.2.3   # pin to a specific version
ghcr.io/p6rguvyrst/opensearch-fscrawler:1.2     # major.minor
docker pull ghcr.io/p6rguvyrst/opensearch-fscrawler:latest

In a Kubernetes manifest or Compose file:

image: ghcr.io/p6rguvyrst/opensearch-fscrawler:latest

Quick start

With Docker Compose

# Start OpenSearch, Tika, Dashboards, and FSCrawler
docker compose up -d

# Watch the logs
docker compose logs -f fscrawler-markdown fscrawler-pdf fscrawler-catchall

Locally (development)

# One command: install deps, wire git hooks
make develop

# Create a job config
fscrawler --setup myfiles
# Edit ~/.fscrawler/myfiles/_settings.yaml

# Run once
fscrawler myfiles

# Run continuously (watches for filesystem changes)
fscrawler --loop myfiles

Requirements

  • Python 3.12+
  • A running Apache Tika server (docker run -p 9998:9998 apache/tika:latest-full)
  • A running OpenSearch or Elasticsearch cluster

Development only:

  • uv — package manager (brew install uv)
  • Trivy — vulnerability scanner, required by the pre-push git hook (brew install trivy)

Configuration

See docs/configuration.md for the full settings reference.

Development

make develop      # first-time setup: install deps + activate git hooks
make test         # run unit tests
make lint         # ruff check
make typecheck    # mypy
make test-all     # unit + integration (needs OPENSEARCH_URL)

Integration tests

# Start services
docker compose up -d opensearch tika

# Run integration tests
OPENSEARCH_URL=http://localhost:9200 TIKA_URL=http://localhost:9998 make test-integration

Architecture

src/fscrawler/
├── cli.py        CLI entry point (Click)
├── settings.py   YAML config loader with duration/byte parsing
├── models.py     Document, FileInfo, PathInfo, Meta dataclasses
├── templates.py  OpenSearch component and index template definitions
├── client.py     opensearch-py wrapper
├── crawler.py    Local filesystem walker with checkpoint tracking
├── watcher.py    Watchdog-based filesystem event handler
├── parser.py     Apache Tika HTTP client
└── indexer.py    Bulk buffering/flushing processor

Security

This prototype has known security issues — including no REST authentication, unbounded upload size, and unvalidated index names — that make it unsuitable for production or internet-facing deployments. See SECURITY.md for the full list.

Credits

This project (opensearch-fscrawler) is a Python rewrite of FSCrawler, originally created by David Pilato in 2012. The configuration format, REST API design, crawl workflow, and checkpoint mechanism are all derived from his work.

If you need the full-featured Java version with Elasticsearch/OpenSearch 7–9 support, SSH/FTP crawling, Apache Tika bundled, and a plugin system, use the original: https://github.com/dadoonet/fscrawler

License

Apache License 2.0 — same as the original FSCrawler project. See LICENSE and NOTICE for full attribution details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opensearch_fscrawler-0.2.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opensearch_fscrawler-0.2.0-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file opensearch_fscrawler-0.2.0.tar.gz.

File metadata

  • Download URL: opensearch_fscrawler-0.2.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opensearch_fscrawler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 16460864544ec6e7e7145bc07aabb07249970334c2be8a40b1ae32c3a783082f
MD5 9ed23dcb21c57fea7b45c9e7c20e7f54
BLAKE2b-256 378964e692461f23818743c5b9376f2e3577eebc97ee5144077471048cebd53d

See more details on using hashes here.

Provenance

The following attestation bundles were made for opensearch_fscrawler-0.2.0.tar.gz:

Publisher: release.yml on P6rguVyrst/opensearch-fscrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opensearch_fscrawler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for opensearch_fscrawler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb14611ba996fd2ff3dfb689cf43865d5711e9f5395f5415c0f7e364ded1c875
MD5 72c5b7fba630677114f4dc63a32030ce
BLAKE2b-256 e3bdd3ad35dac6a4a851e89f5dc876987743c06c12ad2c2533418141b31d1a22

See more details on using hashes here.

Provenance

The following attestation bundles were made for opensearch_fscrawler-0.2.0-py3-none-any.whl:

Publisher: release.yml on P6rguVyrst/opensearch-fscrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page