Skip to main content

Harvest codebases into portable JSON + chunks for RAG and tooling

Project description

harvest-code

Extract codebases into portable JSON for RAG, tooling, and analysis.

PyPI Python 3.9+ License: MIT

Quick Start

pip install harvest-code
harvest serve  # Watch + serve current directory at http://localhost:8787

harvest Web UI

Features

  • 🔍 Zero dependencies - Pure Python, no external packages required
  • 📊 Smart chunking - Extracts functions, classes, exports with stable IDs
  • 🌐 Interactive UI - Web interface with search, filtering, and syntax highlighting
  • 🎯 Intelligent filtering - Automatically skips build artifacts, binaries, and test files
  • Live updates - Auto-refreshes when files change (watch mode)
  • 🚀 Scales - Handles 50k+ files with progressive loading

Commands

harvest reap - Extract codebase to JSON

# Basic usage - output named after directory
harvest reap .                    # → ./current-dir-name.harvest.json
harvest reap /path/to/project     # → ./project.harvest.json

# Custom output
harvest reap . -o analysis.json

# Control what's included
harvest reap . --include metadata       # File inventory only
harvest reap . --include data          # File contents only  
harvest reap . --exclude chunks        # Skip code parsing
harvest reap . --format jsonl          # Line-delimited JSON

# Override filters
harvest reap . --no-default-excludes   # Include hidden files, tests, etc.

harvest serve - Interactive web UI

# Watch + serve (default)
harvest serve                    # Current directory on port 8787
harvest serve /path/to/project   # Specific directory
harvest serve --port 8080        # Custom port

# Control options
harvest serve --no-watch         # Disable auto-refresh
harvest serve --only-ext py,ts   # Watch specific file types

Web UI Features:

  • Real-time search and filtering
  • Syntax highlighting with toggle
  • Skeleton view - Show only function/class signatures
  • Auto-refresh on file changes
  • Infinite scroll for large codebases
  • Deep linking to specific files/lines

harvest query - Search harvest data

# Find specific code elements
harvest query data.json --entity chunks --language python --public true
harvest query data.json --entity files --export-named "MyComponent"
harvest query data.json --path-glob "src/**" --fields path,symbol,kind

harvest watch - Monitor directory changes

# Continuous harvesting
harvest watch .                        # → ./<dirname>.harvest.json
harvest watch /path/to/src -o out.json # Custom output
harvest watch . --only-ext py,js,ts    # Specific extensions

harvest sow - Generate artifacts

# Create React barrel exports
harvest sow data.json --react src/index.ts

harvest winnow - Extract code skeletons

# Extract function/class signatures without bodies
harvest winnow data.harvest.json --skeleton                    # → data.skeleton.harvest.jsonl
harvest winnow data.harvest.json --skeleton --language python  # Python only
harvest winnow data.harvest.json --skeleton --out custom.jsonl # Custom output

# Output formats
harvest winnow data.harvest.json --skeleton --format json      # → data.skeleton.harvest.json
harvest winnow data.harvest.json --skeleton --format jsonl     # → data.skeleton.harvest.jsonl (default)
harvest winnow data.harvest.json --skeleton --out -            # Output to stdout

Output Structure

harvest generates JSON with three sections:

{
  "metadata": {
    "schema": "harvest/v1.2",
    "source": {"type": "local", "root": "/path/to/code"},
    "counts": {"total_files": 150, "total_bytes": 524288}
  },
  "data": [
    {
      "path": "src/utils.py",
      "language": "python",
      "content": "def helper():\n    pass",
      "exports": null,
      "py_symbols": {"functions": ["helper"]}
    }
  ],
  "chunks": [
    {
      "id": "abc123...",
      "file_path": "src/utils.py",
      "kind": "function",
      "symbol": "helper",
      "start_line": 1,
      "end_line": 2,
      "public": true
    }
  ]
}

Output Sections

  • metadata - File inventory, counts, timestamps
  • data - File contents and language metadata
  • chunks - Parsed symbols (functions, classes, exports)

Control output with --include and --exclude:

harvest reap . --include metadata       # Inventory only (fast)
harvest reap . --exclude chunks         # Skip parsing (smaller)
harvest reap . --include chunks         # Symbols only (for analysis)

File Handling

Smart Filtering

harvest uses a three-tier filtering system:

Completely Skipped:

  • Hidden files and directories (.git/, .env)
  • Test directories (tests/, __tests__/)
  • Build artifacts (dist/, build/, node_modules/)
  • Binaries and media (.exe, .mp3, .db)
  • Logs and temp files (.log, .tmp)

Path-Only (no content):

  • Images (.jpg, .png, .svg)
  • Fonts (.ttf, .woff)
  • Documents (.pdf, .doc)

Fully Processed:

  • Source code (.py, .js, .ts, etc.)
  • Config files (.json, .yaml, .toml)
  • Documentation (.md, .txt)

Override with --no-default-excludes to include everything.

Language Support

Full parsing (chunks + symbols):

  • Python - Functions, classes via AST
  • JavaScript/TypeScript - Functions, classes, React components, ES6/CommonJS exports
  • JSON/YAML/TOML - Single file chunks

Syntax highlighting in web UI: Python, JavaScript, TypeScript, JSON, YAML, TOML, Markdown, Shell, Go, Rust

API Endpoints

The web server exposes REST APIs:

# Search chunks
curl http://localhost:8787/api/search?entity=chunks&language=python

# Get metadata
curl http://localhost:8787/api/meta

# Download full harvest
curl http://localhost:8787/api/harvest

Use Cases

RAG/LLM Context

import json

with open('project.harvest.json') as f:
    harvest = json.load(f)

# Extract public functions for context
api_functions = [
    chunk for chunk in harvest['chunks'] 
    if chunk['public'] and chunk['kind'] == 'function'
]

Code Analysis

# Find all React components
harvest query app.json --entity chunks --kind export_default --language typescript

# Find large classes
harvest query app.json --entity chunks --kind class --min-lines 100

Documentation Generation

# Extract all exports
harvest query app.json --entity files --has-default-export --fields path,exports

File Naming

  • Output files use pattern: <directory-name>.harvest.json
  • Created in current working directory (where command is run)
  • Use -o flag for custom names/paths
  • All *.harvest.json files are automatically ignored to prevent recursion

Development

git clone https://github.com/veyorokon/code-harvest.git
cd code-harvest
pip install -e .
harvest reap . -o self.harvest.json

Links

License

MIT - see LICENSE file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harvest_code-1.7.0.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harvest_code-1.7.0-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file harvest_code-1.7.0.tar.gz.

File metadata

  • Download URL: harvest_code-1.7.0.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for harvest_code-1.7.0.tar.gz
Algorithm Hash digest
SHA256 0830b50524aa453b897d0d1b78b4471575aea178f643d4f766d438f8a1aa0807
MD5 2e787633a70af58a736401fab0dc4407
BLAKE2b-256 cda833c4051e4e94bd66d55cf769bbbd7761350756fdffbd5ca3015681e24115

See more details on using hashes here.

File details

Details for the file harvest_code-1.7.0-py3-none-any.whl.

File metadata

  • Download URL: harvest_code-1.7.0-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for harvest_code-1.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f35ef9707624e37339a77e109d0eefab37c7f3d42b8210240b76f93ea184c4bc
MD5 531f91af1b8d7e9b38d6c1293058ce87
BLAKE2b-256 8c258d3bcec39bb1190316e2342c0a33002252bbfc2e4600146632553de96073

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page