Skip to main content

Harvest codebases into portable JSON + chunks for RAG and tooling

Project description

harvest-code

Harvest codebases into portable JSON + chunks for RAG, tooling, and analysis.

PyPI Python 3.9+ License: MIT

Quick Start

pip install harvest-code
harvest reap . -o my-codebase.json
harvest serve my-codebase.json  # Web UI at http://localhost:8787

harvest Web UI

Interactive web interface with search, filtering, syntax highlighting, and progressive loading

Key Features

  • 🔍 Zero-dependency codebase harvesting from local directories or GitHub URLs
  • 📊 Smart chunking by functions, classes, exports with content-agnostic IDs
  • 🌐 Interactive Web UI with search, filtering, syntax highlighting, and deep linking
  • 🚀 Progressive loading - handles 50k+ files with infinite scroll pagination
  • 📝 Schema v1.2 with stable chunk identifiers for incremental updates
  • 🎯 Intelligent defaults - automatically skips binaries, caches, build artifacts, and lockfiles
  • Fast filtering - by language, visibility, file patterns, and symbols
  • 💾 Saved views - bookmark and share specific filtered states

Use Cases

  • RAG/AI Systems: Feed structured code context to LLMs with precise chunk boundaries
  • Documentation: Auto-generate API references from exported symbols and functions
  • Code Analysis: Extract dependencies, exports, and architectural patterns
  • Legacy Migration: Understand large codebases through progressive exploration
  • Code Search: Find functions, classes, and patterns across entire repositories

Installation

pip install harvest-code

Requirements: Python 3.9+ (no dependencies)

Core Commands

Harvest Local Directory

# Basic harvesting
harvest reap . -o codebase.json
harvest reap /path/to/project -o output.json

# With size limits
harvest reap . --max-files 1000 --max-bytes 512000

# Include everything (override smart defaults)
harvest reap . --no-default-excludes -o full-harvest.json

Harvest from GitHub

harvest reap https://github.com/user/repo -o repo.json
harvest reap https://github.com/user/repo/tree/main/src -o src-only.json

Query & Filter

# Find all public Python functions
harvest query output.json --entity chunks --language python --public true

# Search specific paths with custom fields
harvest query output.json --path-glob "src/**" --fields path,symbol,kind,start_line

# Find exports and components
harvest query output.json --entity files --export-named "MyComponent" 
harvest query output.json --entity chunks --kind export_default --language typescript

Interactive Web Interface

harvest serve output.json --port 8080
# Browse at http://localhost:8080

Web UI Features:

  • Search & Filter: Real-time filtering by language, path, symbols
  • Syntax Highlighting: Toggle-able code highlighting for 10+ languages
  • Progressive Loading: Infinite scroll for large datasets
  • Deep Linking: Share URLs to specific files and line ranges
  • Saved Views: Bookmark filtered states for quick access
  • Mobile Responsive: Works on desktop, tablet, and mobile

Output Schema

harvest-code produces JSON with three main sections:

{
  "metadata": {
    "schema": "harvest/v1.2",
    "source": {"type": "local", "root": "/path/to/code"},
    "counts": {"total_files": 150, "total_bytes": 524288},
    "created_at": "2025-08-15T12:00:00Z"
  },
  "data": [
    {
      "path": "src/utils.py",
      "language": "python", 
      "size": 1024,
      "exports": null,
      "py_symbols": {"functions": ["helper"], "classes": ["Util"]},
      "content": "def helper():\n    pass\n\nclass Util:\n    pass"
    }
  ],
  "chunks": [
    {
      "id": "abc123...",
      "file_path": "src/utils.py",
      "language": "python",
      "kind": "function",
      "symbol": "helper", 
      "start_line": 1,
      "end_line": 2,
      "public": true
    }
  ]
}

Advanced Usage

Custom Exclusions

# Skip additional file types
harvest reap . --skip-ext .log,.tmp --skip-folder cache,temp

# Include only specific languages  
harvest reap . --only-ext .py,.js,.ts

Incremental Updates

# Reuse unchanged files for faster re-harvesting
harvest reap . --prev previous-harvest.json -o updated.json

Large Codebases

The web UI automatically handles large datasets with:

  • Pagination: Loads 50 items at a time by default
  • Infinite Scroll: Seamlessly loads more results
  • Cursor-based Navigation: Efficient browsing of 10k+ items
  • Smart Filtering: Client-side filtering for responsive UX

Default Exclusions

harvest-code intelligently skips common non-source files:

File Extensions: .log, .tmp, .db, .sqlite, .mp4, .jpg, .png, .zip, .exe, .pyc, .min.js, .woff, .pb, .tflite, etc.

Directories: node_modules, .git, dist, build, __pycache__, .pytest_cache, .mypy_cache, .vscode, .idea, .next, Pods, DerivedData, etc.

Lockfiles: yarn.lock, package-lock.json, poetry.lock, Cargo.lock, go.sum, etc.

Use --no-default-excludes to include everything.

Language Support

Chunking & Symbol Extraction:

  • Python: Functions, classes, exports via AST parsing
  • JavaScript/TypeScript: Functions, classes, React components, ES6/CommonJS exports
  • JSON/YAML/TOML: Single file chunks with metadata

Syntax Highlighting: Python, JavaScript, TypeScript, JSON, YAML, TOML, Markdown, Shell, Go, Rust

API Integration

The web server exposes REST endpoints for programmatic access:

curl http://localhost:8787/api/search?entity=chunks&language=python&limit=10
curl http://localhost:8787/api/meta  # Get harvest metadata

Examples

RAG Pipeline Integration

import json

# Load harvest data
with open('codebase.json') as f:
    harvest = json.load(f)

# Extract public API functions for LLM context
api_functions = [
    chunk for chunk in harvest['chunks'] 
    if chunk['public'] and chunk['kind'] in ['function', 'export_named']
]

Documentation Generation

# Extract all exported React components
harvest query react-app.json --entity chunks --kind export_default --language typescript --fields symbol,file_path > components.txt

Migration Analysis

# Find all Python classes for refactoring
harvest query legacy-app.json --entity chunks --kind class --language python --min-lines 10

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup:

git clone https://github.com/veyorokon/code-harvest.git
cd code-harvest  
pip install -e .
harvest reap . -o self-harvest.json  # Harvest itself!

License

MIT License - see LICENSE file for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harvest_code-1.2.1.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harvest_code-1.2.1-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file harvest_code-1.2.1.tar.gz.

File metadata

  • Download URL: harvest_code-1.2.1.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for harvest_code-1.2.1.tar.gz
Algorithm Hash digest
SHA256 7d809736f496b0b13b1678adb8f4cf1ee6a9a68e3a0024f51005d3daf74f10dd
MD5 0aa9d6b870a6ae66e2af3d4d46e1a945
BLAKE2b-256 82062afb7dae5fa6bfbc63a8fa7cff6f71d59fa11a924e287b78348b1efbbcf9

See more details on using hashes here.

File details

Details for the file harvest_code-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: harvest_code-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for harvest_code-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b9cb0dd7f1c3b0581acf469d68f572ce1ebbeec133f7f3c86aec533ebd06954
MD5 cf994f4ebc099cf0f859167ac8ebe6b9
BLAKE2b-256 37aec0f3acd062ecf4b889f94a1b70bfce2121ed1ab92e2bf5cf01174083f134

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page