Skip to main content

Harvest codebases into portable JSON + chunks for RAG and tooling

Project description

harvest-code

Extract codebases into portable JSON for RAG, tooling, and analysis.

PyPI Python 3.9+ License: MIT

Quick Start

pip install harvest-code
harvest serve  # Watch + serve current directory at http://localhost:8787

harvest Web UI

Features

  • 🔍 Zero dependencies - Pure Python, no external packages required
  • 📊 Smart chunking - Extracts functions, classes, exports with stable IDs
  • 🌐 Interactive UI - Web interface with search, filtering, and syntax highlighting
  • 🎯 Intelligent filtering - Automatically skips build artifacts, binaries, and test files
  • Live updates - Auto-refreshes when files change (watch mode)
  • 🚀 Scales - Handles 50k+ files with progressive loading

Commands

harvest reap - Extract codebase to JSON

# Basic usage - output named after directory
harvest reap .                    # → ./current-dir-name.harvest.json
harvest reap /path/to/project     # → ./project.harvest.json

# Custom output
harvest reap . -o analysis.json

# Control what's included
harvest reap . --include metadata       # File inventory only
harvest reap . --include data          # File contents only  
harvest reap . --exclude chunks        # Skip code parsing
harvest reap . --format jsonl          # Line-delimited JSON

# Override filters
harvest reap . --no-default-excludes   # Include hidden files, tests, etc.

harvest serve - Interactive web UI

# Watch + serve (default)
harvest serve                    # Current directory on port 8787
harvest serve /path/to/project   # Specific directory
harvest serve --port 8080        # Custom port

# Control options
harvest serve --no-watch         # Disable auto-refresh
harvest serve --only-ext py,ts   # Watch specific file types

Web UI Features:

  • Real-time search and filtering
  • Syntax highlighting with toggle
  • Auto-refresh on file changes
  • Infinite scroll for large codebases
  • Deep linking to specific files/lines

harvest query - Search harvest data

# Find specific code elements
harvest query data.json --entity chunks --language python --public true
harvest query data.json --entity files --export-named "MyComponent"
harvest query data.json --path-glob "src/**" --fields path,symbol,kind

harvest watch - Monitor directory changes

# Continuous harvesting
harvest watch .                        # → ./<dirname>.harvest.json
harvest watch /path/to/src -o out.json # Custom output
harvest watch . --only-ext py,js,ts    # Specific extensions

harvest sow - Generate artifacts

# Create React barrel exports
harvest sow data.json --react src/index.ts

Output Structure

harvest generates JSON with three sections:

{
  "metadata": {
    "schema": "harvest/v1.2",
    "source": {"type": "local", "root": "/path/to/code"},
    "counts": {"total_files": 150, "total_bytes": 524288}
  },
  "data": [
    {
      "path": "src/utils.py",
      "language": "python",
      "content": "def helper():\n    pass",
      "exports": null,
      "py_symbols": {"functions": ["helper"]}
    }
  ],
  "chunks": [
    {
      "id": "abc123...",
      "file_path": "src/utils.py",
      "kind": "function",
      "symbol": "helper",
      "start_line": 1,
      "end_line": 2,
      "public": true
    }
  ]
}

Output Sections

  • metadata - File inventory, counts, timestamps
  • data - File contents and language metadata
  • chunks - Parsed symbols (functions, classes, exports)

Control output with --include and --exclude:

harvest reap . --include metadata       # Inventory only (fast)
harvest reap . --exclude chunks         # Skip parsing (smaller)
harvest reap . --include chunks         # Symbols only (for analysis)

File Handling

Smart Filtering

harvest uses a three-tier filtering system:

Completely Skipped:

  • Hidden files and directories (.git/, .env)
  • Test directories (tests/, __tests__/)
  • Build artifacts (dist/, build/, node_modules/)
  • Binaries and media (.exe, .mp3, .db)
  • Logs and temp files (.log, .tmp)

Path-Only (no content):

  • Images (.jpg, .png, .svg)
  • Fonts (.ttf, .woff)
  • Documents (.pdf, .doc)

Fully Processed:

  • Source code (.py, .js, .ts, etc.)
  • Config files (.json, .yaml, .toml)
  • Documentation (.md, .txt)

Override with --no-default-excludes to include everything.

Language Support

Full parsing (chunks + symbols):

  • Python - Functions, classes via AST
  • JavaScript/TypeScript - Functions, classes, React components, ES6/CommonJS exports
  • JSON/YAML/TOML - Single file chunks

Syntax highlighting in web UI: Python, JavaScript, TypeScript, JSON, YAML, TOML, Markdown, Shell, Go, Rust

API Endpoints

The web server exposes REST APIs:

# Search chunks
curl http://localhost:8787/api/search?entity=chunks&language=python

# Get metadata
curl http://localhost:8787/api/meta

# Download full harvest
curl http://localhost:8787/api/harvest

Use Cases

RAG/LLM Context

import json

with open('project.harvest.json') as f:
    harvest = json.load(f)

# Extract public functions for context
api_functions = [
    chunk for chunk in harvest['chunks'] 
    if chunk['public'] and chunk['kind'] == 'function'
]

Code Analysis

# Find all React components
harvest query app.json --entity chunks --kind export_default --language typescript

# Find large classes
harvest query app.json --entity chunks --kind class --min-lines 100

Documentation Generation

# Extract all exports
harvest query app.json --entity files --has-default-export --fields path,exports

File Naming

  • Output files use pattern: <directory-name>.harvest.json
  • Created in current working directory (where command is run)
  • Use -o flag for custom names/paths
  • All *.harvest.json files are automatically ignored to prevent recursion

Development

git clone https://github.com/veyorokon/code-harvest.git
cd code-harvest
pip install -e .
harvest reap . -o self.harvest.json

Links

License

MIT - see LICENSE file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harvest_code-1.6.0.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harvest_code-1.6.0-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file harvest_code-1.6.0.tar.gz.

File metadata

  • Download URL: harvest_code-1.6.0.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for harvest_code-1.6.0.tar.gz
Algorithm Hash digest
SHA256 edee88606bb3aea41178365d2df3193501b46a760cb9eaa7a4f20d9ae9372880
MD5 182dd2146c4493edbc30677365c91378
BLAKE2b-256 f009d1412fcc62aeaba68306d26d1d0ec048d3051369f5a8778208036822a4aa

See more details on using hashes here.

File details

Details for the file harvest_code-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: harvest_code-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for harvest_code-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9cf0030ad2b5b788c4bcc37cc028b091e2bce636670ef32187badaedca958771
MD5 7b97dd3dd0814f47e3a074b720a8c7f7
BLAKE2b-256 39b2dd6fb0a92a94596817bc7ad2ce4277daf34afc790fc717fcb19d0f4b7d1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page