File system crawler for OpenSearch / Elasticsearch — Python rewrite of FSCrawler
Project description
FSCrawler — Python Edition
Disclaimer: This is a prototype intended for local development and experimentation only. It is not production-ready and should not be used in production environments.
A Python 3.12 rewrite of FSCrawler, a file system crawler that indexes binary documents (PDF, MS Office, plain text, and more) into OpenSearch or Elasticsearch.
Migrating from the Java version?
fs.filename_as_iddefaults totruehere butfalsein Java. If you are pointing this at an existing index, setfs.filename_as_id: falsein your_settings.yamlexplicitly — otherwise documents will be re-indexed under new IDs and you will end up with duplicates.
Features
- Backwards-compatible
_settings.yamlformat — drop-in replacement for the Java version - Event-driven crawling — watches the filesystem for changes in real time using OS-native events; no polling required
- Apache Tika integration — connects to a running Tika server over HTTP (no bundled JVM)
- Bulk indexing — buffers documents and flushes on document count or byte-size thresholds
- Template management — creates OpenSearch component and index templates automatically
- Multi-arch Docker image — Dockerfile supports linux/amd64 and linux/arm64 (
make build)
Docker image
Pre-built multi-arch images (linux/amd64, linux/arm64) are published to GitHub Container Registry on every release:
ghcr.io/p6rguvyrst/opensearch-fscrawler:latest
ghcr.io/p6rguvyrst/opensearch-fscrawler:1.2.3 # pin to a specific version
ghcr.io/p6rguvyrst/opensearch-fscrawler:1.2 # major.minor
docker pull ghcr.io/p6rguvyrst/opensearch-fscrawler:latest
In a Kubernetes manifest or Compose file:
image: ghcr.io/p6rguvyrst/opensearch-fscrawler:latest
Quick start
With Docker Compose
# Start OpenSearch, Tika, Dashboards, and FSCrawler (rebuilds images automatically)
make up
Locally (development)
# One command: install deps, wire git hooks
make develop
# Create a job config
fscrawler --setup myfiles
# Edit ~/.fscrawler/myfiles/_settings.yaml
# Run once
fscrawler myfiles
# Run continuously (watches for filesystem changes)
fscrawler --loop myfiles
Requirements
- Python 3.12+
- A running Apache Tika server (
docker run -p 9998:9998 apache/tika:latest-full) - A running OpenSearch or Elasticsearch cluster
Development only:
- uv — package manager (
brew install uv) - Trivy — vulnerability scanner, required by the pre-push git hook (
brew install trivy)
Configuration
See docs/configuration.md for the full settings reference.
Development
make develop # first-time setup: install deps + activate git hooks
make test # run unit tests
make lint # ruff check
make typecheck # mypy
make test-all # unit + integration (needs OPENSEARCH_URL)
Integration tests
# Start services
docker compose up -d opensearch tika
# Run integration tests
OPENSEARCH_URL=http://localhost:9200 TIKA_URL=http://localhost:9998 make test-integration
Architecture
src/fscrawler/
├── cli.py CLI entry point (Click)
├── settings.py YAML config loader with duration/byte parsing
├── models.py Document, FileInfo, PathInfo, Meta dataclasses
├── templates.py OpenSearch component and index template definitions
├── client.py opensearch-py wrapper
├── crawler.py Local filesystem walker with checkpoint tracking
├── watcher.py Watchdog-based filesystem event handler
├── parser.py Apache Tika HTTP client
└── indexer.py Bulk buffering/flushing processor
Roadmap
See ROADMAP.md for planned improvements and known limitations.
Security
This prototype has known security issues — including no REST authentication, unbounded upload size, and unvalidated index names — that make it unsuitable for production or internet-facing deployments. See SECURITY.md for the full list.
Credits
This project (opensearch-fscrawler) is a Python rewrite of
FSCrawler, originally created by
David Pilato in 2012. The configuration format,
REST API design, crawl workflow, and checkpoint mechanism are all derived from his work.
If you need the full-featured Java version with Elasticsearch/OpenSearch 7–9 support, SSH/FTP crawling, Apache Tika bundled, and a plugin system, use the original: https://github.com/dadoonet/fscrawler
License
Apache License 2.0 — same as the original FSCrawler project. See LICENSE and NOTICE for full attribution details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opensearch_fscrawler-0.5.3.tar.gz.
File metadata
- Download URL: opensearch_fscrawler-0.5.3.tar.gz
- Upload date:
- Size: 63.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e8a51fca011284eb967e8b5cf21e2aecac3778f8b1811f6ef60d9d109b81390
|
|
| MD5 |
9555598e9ff788c9ee4db1fb8f685caf
|
|
| BLAKE2b-256 |
9b815c6fef4a11c1d188ac88d84134787eccbe60cf9d2aabca36dcbb110bebff
|
Provenance
The following attestation bundles were made for opensearch_fscrawler-0.5.3.tar.gz:
Publisher:
release.yml on P6rguVyrst/opensearch-fscrawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opensearch_fscrawler-0.5.3.tar.gz -
Subject digest:
7e8a51fca011284eb967e8b5cf21e2aecac3778f8b1811f6ef60d9d109b81390 - Sigstore transparency entry: 1278587236
- Sigstore integration time:
-
Permalink:
P6rguVyrst/opensearch-fscrawler@bf99b00d54742acfc9e2e94e0c8c5b8c6afb0f8a -
Branch / Tag:
refs/tags/v0.5.3 - Owner: https://github.com/P6rguVyrst
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bf99b00d54742acfc9e2e94e0c8c5b8c6afb0f8a -
Trigger Event:
push
-
Statement type:
File details
Details for the file opensearch_fscrawler-0.5.3-py3-none-any.whl.
File metadata
- Download URL: opensearch_fscrawler-0.5.3-py3-none-any.whl
- Upload date:
- Size: 56.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d64a6d043d855c306aab51e2c4035a32d620134915bcdb57a7e19edca87c4330
|
|
| MD5 |
b765ba63926ec8cd125b97f59600e91e
|
|
| BLAKE2b-256 |
df504766d13ed88c1a5ac90247db641b5680719732a9e08cd76b7ece74b522c8
|
Provenance
The following attestation bundles were made for opensearch_fscrawler-0.5.3-py3-none-any.whl:
Publisher:
release.yml on P6rguVyrst/opensearch-fscrawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opensearch_fscrawler-0.5.3-py3-none-any.whl -
Subject digest:
d64a6d043d855c306aab51e2c4035a32d620134915bcdb57a7e19edca87c4330 - Sigstore transparency entry: 1278587266
- Sigstore integration time:
-
Permalink:
P6rguVyrst/opensearch-fscrawler@bf99b00d54742acfc9e2e94e0c8c5b8c6afb0f8a -
Branch / Tag:
refs/tags/v0.5.3 - Owner: https://github.com/P6rguVyrst
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bf99b00d54742acfc9e2e94e0c8c5b8c6afb0f8a -
Trigger Event:
push
-
Statement type: