A production-ready Python service for intelligently chunking source code and ingesting it into RAG pipelines
Project description
Code Ingestion Service
A production-ready Python service for intelligently chunking source code and ingesting it into RAG (Retrieval-Augmented Generation) pipelines. Features optimized performance, Pinecone vector database integration, and a powerful CLI for both public and private repositories.
Author: Sandeep G
Copyright: © 2025 Sandeep G
License: Apache License 2.0 - see LICENSE file for details
🚀 Features
Core Capabilities
- Smart Code Chunking: Single-pass CST analysis.
- Pluggable Architecture: Swap embedding providers (Nomic, OpenAI) and vector stores (Pinecone, Weaviate)
- High Performance: Optimized batching for ultra-fast embedding generation
- Production-Ready CLI: Unified interface with provider selection and verbose logging
Language & Platform Support
- Java: Full CST parsing with REST API detection and method extraction
- Git Integration: Shallow clone support for public/private repositories
- Multiple Providers: Nomic (default), OpenAI, HuggingFace embedding support
- Vector Stores: Pinecone (default), with extensible architecture
Performance & Reliability
- Optimized Processing: Single-pass CST traversal, method-level context caching
- Smart Filtering: Include/exclude patterns for selective ingestion
- Error Handling: Robust processing with cleanup and validation
- Test Coverage: Comprehensive test suite for all components
🛠️ Core Components
CodeChunker
The main orchestrator for code chunking operations, responsible for:
- Parsing source code
- Analyzing Concrete Syntax Trees (CST)
- Applying chunking strategies
- Generating structured code chunks with metadata
Key Features
- Package & Import Handling: Preserves context by maintaining package and import statements
- Class-Level Chunking: Creates complete class chunks when appropriate
- Method-Level Chunking: Breaks down classes into method-level chunks when needed
- Intelligent ID Generation: Creates unique identifiers for each code chunk
- Metadata Management: Tracks comprehensive metadata for each chunk
🏗️ Architecture
The service follows a pluggable architecture with these main components:
- Orchestration: Coordinates the complete ingestion pipeline
- Chunkers: Handle intelligent code splitting with CST analysis
- Embedding Providers: Generate embeddings (Nomic, OpenAI, HuggingFace)
- Vector Stores: Store embeddings (Pinecone, Weaviate, Qdrant)
- Data Models: Define structured representations for chunks and metadata
💻 Usage
CLI Usage (Recommended)
The CLI provides a simple interface for ingesting repositories into your RAG pipeline:
Basic Usage
# Ingest a local repository
code-ingestion /path/to/your/repo
# Ingest a public GitHub repository
code-ingestion https://github.com/spring-projects/spring-boot
# Ingest with file filtering
code-ingestion https://github.com/kdn251/interviews \
--include "**/*.java" \
--exclude "**/test/**" \
--max-files 50
Advanced Filtering
# Include specific folders and file types
code-ingestion https://github.com/kdn251/interviews \
--include "company/**/*.java" \
--include "leetcode/**/*.java" \
--include "cracking-the-coding-interview/**/*.java" \
--max-files 30
# Exclude unwanted directories
code-ingestion https://github.com/spring-projects/spring-boot \
--include "**/*.java" \
--exclude "**/test/**" \
--exclude "**/build/**" \
--exclude "**/target/**"
Provider Selection
# Use different embedding providers
code-ingestion /path/to/repo --embedding-provider nomic # Default
code-ingestion /path/to/repo --embedding-provider openai
code-ingestion /path/to/repo --embedding-provider huggingface
# Use different vector stores
code-ingestion /path/to/repo --vector-store pinecone # Default
code-ingestion /path/to/repo --vector-store weaviate
# Enable detailed logging and progress
code-ingestion /path/to/repo --verbose
CLI Options
--embedding-provider: Choose embedding provider (nomic, openai, huggingface)--vector-store: Choose vector store (pinecone, weaviate, qdrant)--verbose: Enable detailed logging and progress reports--include: File patterns to include (supports glob patterns like**/*.java)--exclude: File patterns to exclude (default excludes test, build, node_modules, etc.)--max-files: Limit number of files processed (useful for large repos)--cleanup/--no-cleanup: Control temporary file cleanup (default: cleanup enabled)
GitHub Actions Integration (Recommended)
Create automated ingestion workflows for your repositories:
# .github/workflows/ingest-code.yml
name: Ingest Codebase to RAG
on:
workflow_dispatch: # Manual trigger
push:
branches: [ main ] # Auto-trigger on main branch updates
jobs:
ingest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Code Ingestion Service
run: |
pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
- name: Ingest Current Repository
run: |
code-ingestion ./ \
--include "**/*.java" \
--include "**/*.py" \
--exclude "**/test/**" \
--verbose
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}
# Optional: Ingest external repositories
ingest-external:
runs-on: ubuntu-latest
steps:
- name: Install Code Ingestion Service
run: |
pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
- name: Ingest Spring Boot (example)
run: |
code-ingestion https://github.com/spring-projects/spring-boot \
--include "**/*.java" \
--max-files 200 \
--verbose
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}
Multi-Repository Workflow
# Ingest multiple repositories in one workflow
name: Build Knowledge Base
on:
workflow_dispatch:
schedule:
- cron: '0 2 * * 0' # Weekly on Sunday at 2 AM
jobs:
ingest-repositories:
runs-on: ubuntu-latest
strategy:
matrix:
repo:
- 'https://github.com/spring-projects/spring-boot'
- 'https://github.com/apache/kafka'
- 'your-org/internal-repo'
steps:
- name: Install Code Ingestion Service
run: |
pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
- name: Ingest Repository
run: |
code-ingestion ${{ matrix.repo }} \
--include "**/*.java" \
--exclude "**/test/**" \
--max-files 500 \
--verbose
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}
Local Development & Testing
You can test the service locally before setting up CI/CD:
# Test with a local repository
code-ingestion /path/to/your/local/repo --verbose
# Test with a public repository
code-ingestion https://github.com/spring-projects/spring-boot \
--include "**/*.java" \
--max-files 50 \
--verbose
# Test different providers
code-ingestion /path/to/repo \
--embedding-provider openai \
--vector-store pinecone \
--verbose
⚙️ Setup
Prerequisites
- Python 3.13+
- Pinecone account and API key
- Git (for repository cloning)
Installation
Option 1: Direct Installation (Recommended)
# Install directly from GitHub
pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
# Test the installation
code-ingestion --help
Option 2: Development Installation
# Clone and install in development mode
git clone https://github.com/sandeepgovi/code-ingestion-service
cd code-ingestion-service
pip install -e .
# Test the installation
code-ingestion --help
Setup Environment Variables
# Set up required environment variables
export PINECONE_API_KEY=your_api_key_here
export PINECONE_INDEX_NAME=your_index_name
# Or create .env file
echo "PINECONE_API_KEY=your_api_key_here" > .env
echo "PINECONE_INDEX_NAME=your_index_name" >> .env
🔒 Security
Environment Variables
- Never commit
.envfiles or hardcoded secrets to version control - Use the provided
.env.exampleas a template - Store API keys and sensitive configuration in environment variables only
Recommended Security Practices
- Review Dependencies: Regularly audit dependencies for vulnerabilities
- Access Control: Limit repository access when processing private repositories
- API Keys: Use read-only API keys when possible, rotate keys regularly
- Local Processing: Sensitive code processing happens locally before embedding
Supported Environment Variables
# Pinecone (default vector store)
PINECONE_API_KEY=your_pinecone_api_key # Required for Pinecone integration
PINECONE_INDEX_NAME=your_index_name # Required for Pinecone integration
PINECONE_BATCH_SIZE=100 # Optional: batch size for uploads
# Embedding providers
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5 # Default model
OPENAI_API_KEY=your_openai_key # Required for OpenAI embeddings
# Other providers (as you add them)
WEAVIATE_URL=http://localhost:8080 # For Weaviate vector store
QDRANT_URL=http://localhost:6333 # For Qdrant vector store
Security Audit
Run dependency vulnerability checks:
# Install security audit tool
pip install pip-audit
# Check for vulnerabilities
pip-audit
# Or using pipenv
pipenv check
🤝 Contributing
Contributions are welcome! Feel free to submit pull requests or open issues for:
- Bug fixes
- New features
- Documentation improvements
- Additional language support
📄 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Attribution Requirements
When using this software, you must:
- Include the copyright notice and license in any copy or substantial portion of the software
- State any significant changes made to the original code
- Include attribution to the original author (Sandeep G) in derivative works
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file code_ingestion_service-0.1.0.tar.gz.
File metadata
- Download URL: code_ingestion_service-0.1.0.tar.gz
- Upload date:
- Size: 39.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa19497a2cb8e61237219eadbf6ca4cfebc790e3d546c1decdd68ab53413346f
|
|
| MD5 |
e73afee04fe3bdc218c5333d256d3e42
|
|
| BLAKE2b-256 |
40b360f50655ec99533b764dca7b780b6abf51d7aa108e53ac710d8173ada201
|
File details
Details for the file code_ingestion_service-0.1.0-py3-none-any.whl.
File metadata
- Download URL: code_ingestion_service-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fac916ad50ddd1bfdae70cdebb2da443994d0eb61e4b1783767868d84b4089a0
|
|
| MD5 |
64d25ae4061dbce523046ae9763ffec2
|
|
| BLAKE2b-256 |
a495c4ef70dc502fa85c725fecb194e01ab99d72938dec1cfb9ab2254d0e52bc
|