A production-ready Python service for intelligently chunking source code and ingesting it into RAG pipelines

These details have not been verified by PyPI

Project links

Project description

Code Ingestion Service

A production-ready Python service for intelligently chunking source code and ingesting it into RAG (Retrieval-Augmented Generation) pipelines. Features optimized performance, Pinecone vector database integration, and a powerful CLI for both public and private repositories.

Author: Sandeep G
Copyright: © 2025 Sandeep G
License: Apache License 2.0 - see LICENSE file for details

🚀 Features

Core Capabilities

Smart Code Chunking: Single-pass CST analysis.
Pluggable Architecture: Swap embedding providers (Nomic, OpenAI) and vector stores (Pinecone, Weaviate)
High Performance: Optimized batching for ultra-fast embedding generation
Production-Ready CLI: Unified interface with provider selection and verbose logging

Language & Platform Support

Java: Full CST parsing with REST API detection and method extraction
Git Integration: Shallow clone support for public/private repositories
Multiple Providers: Nomic (default), OpenAI, HuggingFace embedding support
Vector Stores: Pinecone (default), with extensible architecture

Performance & Reliability

Optimized Processing: Single-pass CST traversal, method-level context caching
Smart Filtering: Include/exclude patterns for selective ingestion
Error Handling: Robust processing with cleanup and validation
Test Coverage: Comprehensive test suite for all components

🛠️ Core Components

CodeChunker

The main orchestrator for code chunking operations, responsible for:

Parsing source code
Analyzing Concrete Syntax Trees (CST)
Applying chunking strategies
Generating structured code chunks with metadata

Key Features

Package & Import Handling: Preserves context by maintaining package and import statements
Class-Level Chunking: Creates complete class chunks when appropriate
Method-Level Chunking: Breaks down classes into method-level chunks when needed
Intelligent ID Generation: Creates unique identifiers for each code chunk
Metadata Management: Tracks comprehensive metadata for each chunk

🏗️ Architecture

The service follows a pluggable architecture with these main components:

Orchestration: Coordinates the complete ingestion pipeline
Chunkers: Handle intelligent code splitting with CST analysis
Embedding Providers: Generate embeddings (Nomic, OpenAI, HuggingFace)
Vector Stores: Store embeddings (Pinecone, Weaviate, Qdrant)
Data Models: Define structured representations for chunks and metadata

💻 Usage

CLI Usage (Recommended)

The CLI provides a simple interface for ingesting repositories into your RAG pipeline:

Basic Usage

# Ingest a local repository
code-ingestion /path/to/your/repo

# Ingest a public GitHub repository
code-ingestion https://github.com/spring-projects/spring-boot

# Ingest with file filtering
code-ingestion https://github.com/kdn251/interviews \
  --include "**/*.java" \
  --exclude "**/test/**" \
  --max-files 50

Advanced Filtering

Include Pattern Examples

# Include specific file types
code-ingestion /path/to/repo --include "**/*.java"
code-ingestion /path/to/repo --include "**/*.py" --include "**/*.js"

# Include specific directories
code-ingestion /path/to/repo --include "src/**/*.java"
code-ingestion /path/to/repo --include "app/models/**/*.py"

# Include multiple patterns for interview prep repos
code-ingestion https://github.com/kdn251/interviews \
  --include "company/**/*.java" \
  --include "leetcode/**/*.java" \
  --include "cracking-the-coding-interview/**/*.java" \
  --max-files 30

# Include only specific file patterns
code-ingestion /path/to/repo \
  --include "src/main/java/com/mycompany/web/*.java" \
  --include "src/main/java/com/mycompany/service/*.java"

Exclude Pattern Examples

# Exclude test directories (common pattern)
code-ingestion /path/to/repo \
  --include "**/*.java" \
  --exclude "**/test/**" \
  --exclude "**/tests/**"

# Exclude build artifacts and dependencies
code-ingestion /path/to/repo \
  --include "**/*.java" \
  --exclude "**/build/**" \
  --exclude "**/target/**" \
  --exclude "**/node_modules/**"

# Exclude specific file types
code-ingestion /path/to/repo \
  --include "**/*.java" \
  --exclude "**/*.class" \
  --exclude "**/*.jar"

# Complex filtering for Spring Boot projects
code-ingestion https://github.com/spring-projects/spring-boot \
  --include "**/*.java" \
  --exclude "**/test/**" \
  --exclude "**/build/**" \
  --exclude "**/target/**" \
  --exclude "**/generated/**"

Pattern Syntax

** - Matches any number of directories
* - Matches any number of characters (except path separator)
? - Matches a single character
[abc] - Matches any character in the set
{a,b} - Matches either pattern a or b

Common Use Cases

# Only process main source code (Java projects)
code-ingestion /path/to/repo \
  --include "src/main/**/*.java" \
  --exclude "**/test/**"

# Multiple languages in one repo
code-ingestion /path/to/repo \
  --include "**/*.java" \
  --include "**/*.py" \
  --include "**/*.js" \
  --exclude "**/test/**" \
  --exclude "**/node_modules/**"

# Focus on specific packages
code-ingestion /path/to/repo \
  --include "src/main/java/com/mycompany/core/**/*.java" \
  --include "src/main/java/com/mycompany/api/**/*.java"

Provider Selection

# Use different embedding providers
code-ingestion /path/to/repo --embedding-provider nomic     # Default
code-ingestion /path/to/repo --embedding-provider openai
code-ingestion /path/to/repo --embedding-provider huggingface

# Use different vector stores  
code-ingestion /path/to/repo --vector-store pinecone        # Default
code-ingestion /path/to/repo --vector-store weaviate

# Enable detailed logging and progress
code-ingestion /path/to/repo --verbose

CLI Options

--embedding-provider: Choose embedding provider (nomic, openai, huggingface)
--vector-store: Choose vector store (pinecone, weaviate, qdrant)
--verbose: Enable detailed logging and progress reports
--include: File patterns to include (supports glob patterns like **/*.java)
--exclude: File patterns to exclude (default excludes test, build, node_modules, etc.)
--max-files: Limit number of files processed (useful for large repos)
--cleanup/--no-cleanup: Control temporary file cleanup (default: cleanup enabled)

GitHub Actions Integration (Recommended)

Create automated ingestion workflows for your repositories:

# .github/workflows/ingest-code.yml
name: Ingest Codebase to RAG
on:
  workflow_dispatch:  # Manual trigger
  push:
    branches: [ main ]  # Auto-trigger on main branch updates
  
jobs:
  ingest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install Code Ingestion Service
        run: |
          pip install code-ingestion-service
      
      - name: Ingest Current Repository
        run: |
          code-ingestion ./ \
            --include "**/*.java" \
            --include "**/*.py" \
            --exclude "**/test/**" \
            --verbose
        env:
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}

  # Optional: Ingest external repositories
  ingest-external:
    runs-on: ubuntu-latest
    steps:
      - name: Install Code Ingestion Service
        run: |
          pip install code-ingestion-service
          
      - name: Ingest Spring Boot (example)
        run: |
          code-ingestion https://github.com/spring-projects/spring-boot \
            --include "**/*.java" \
            --max-files 200 \
            --verbose
        env:
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}

Multi-Repository Workflow

# Ingest multiple repositories in one workflow
name: Build Knowledge Base
on:
  workflow_dispatch:
  schedule:
    - cron: '0 2 * * 0'  # Weekly on Sunday at 2 AM

jobs:
  ingest-repositories:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        repo: 
          - 'https://github.com/spring-projects/spring-boot'
          - 'https://github.com/apache/kafka' 
          - 'your-org/internal-repo'
    steps:
      - name: Install Code Ingestion Service
        run: |
          pip install code-ingestion-service
          
      - name: Ingest Repository
        run: |
          code-ingestion ${{ matrix.repo }} \
            --include "**/*.java" \
            --exclude "**/test/**" \
            --max-files 500 \
            --verbose
        env:
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}

Local Development & Testing

You can test the service locally before setting up CI/CD:

# Test with a local repository
code-ingestion /path/to/your/local/repo --verbose

# Test with a public repository
code-ingestion https://github.com/spring-projects/spring-boot \
  --include "**/*.java" \
  --max-files 50 \
  --verbose

# Test different providers
code-ingestion /path/to/repo \
  --embedding-provider openai \
  --vector-store pinecone \
  --verbose

⚙️ Setup

Prerequisites

Python 3.13+
Pinecone account and API key
Git (for repository cloning)

Installation

Option 1: PyPI Installation (Recommended)

# Install from PyPI (simplest method)
pip install code-ingestion-service

# Test the installation
code-ingestion --help

Option 2: GitHub Installation

# Install directly from GitHub (latest development version)
pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main

# Test the installation
code-ingestion --help

Option 3: Development Installation

# Clone and install in development mode
git clone https://github.com/sandeepgovi/code-ingestion-service
cd code-ingestion-service
pip install -e .

# Test the installation  
code-ingestion --help

Setup Environment Variables

# Set up required environment variables
export PINECONE_API_KEY=your_api_key_here
export PINECONE_INDEX_NAME=your_index_name

# Or create .env file
echo "PINECONE_API_KEY=your_api_key_here" > .env
echo "PINECONE_INDEX_NAME=your_index_name" >> .env

🔒 Security

Environment Variables

Never commit .env files or hardcoded secrets to version control
Use the provided .env.example as a template
Store API keys and sensitive configuration in environment variables only

Recommended Security Practices

Review Dependencies: Regularly audit dependencies for vulnerabilities
Access Control: Limit repository access when processing private repositories
API Keys: Use read-only API keys when possible, rotate keys regularly
Local Processing: Sensitive code processing happens locally before embedding

Supported Environment Variables

# Pinecone (default vector store)
PINECONE_API_KEY=your_pinecone_api_key        # Required for Pinecone integration
PINECONE_INDEX_NAME=your_index_name           # Required for Pinecone integration
PINECONE_BATCH_SIZE=100                       # Optional: batch size for uploads

# Embedding providers
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5  # Default model
OPENAI_API_KEY=your_openai_key                # Required for OpenAI embeddings

# Other providers (as you add them)
WEAVIATE_URL=http://localhost:8080            # For Weaviate vector store
QDRANT_URL=http://localhost:6333              # For Qdrant vector store

Security Audit

Run dependency vulnerability checks:

# Install security audit tool
pip install pip-audit

# Check for vulnerabilities
pip-audit

# Or using pipenv
pipenv check

🤝 Contributing

Contributions are welcome! Feel free to submit pull requests or open issues for:

Bug fixes
New features
Documentation improvements
Additional language support

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Attribution Requirements

When using this software, you must:

Include the copyright notice and license in any copy or substantial portion of the software
State any significant changes made to the original code
Include attribution to the original author (Sandeep G) in derivative works

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Oct 3, 2025

0.1.0

Sep 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_ingestion_service-0.2.0.tar.gz (47.4 kB view details)

Uploaded Oct 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

code_ingestion_service-0.2.0-py3-none-any.whl (38.5 kB view details)

Uploaded Oct 3, 2025 Python 3

File details

Details for the file code_ingestion_service-0.2.0.tar.gz.

File metadata

Download URL: code_ingestion_service-0.2.0.tar.gz
Upload date: Oct 3, 2025
Size: 47.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for code_ingestion_service-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`476fd0bf05469bb26ecc1fa4562e853873a79b263078294ab28bfe883d7339fe`
MD5	`8eb8e659f9ecc894412cde7d4b55a319`
BLAKE2b-256	`39d2e418ab261310f049cbef7d9743b1f39e7bc37ff39a66ed19f3676d42159c`

See more details on using hashes here.

File details

Details for the file code_ingestion_service-0.2.0-py3-none-any.whl.

File metadata

Download URL: code_ingestion_service-0.2.0-py3-none-any.whl
Upload date: Oct 3, 2025
Size: 38.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for code_ingestion_service-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d4bd8a6f87638303b7bfbc9e10414b5f4e7cd815c6d6b302f678612eaeb30df`
MD5	`44e9198f8d4e3941f25b51385fc7aedb`
BLAKE2b-256	`b353f6a3bf0e230c303b5a1e23ca2890503095c4bf9566af666f91d6dd706175`

See more details on using hashes here.

code-ingestion-service 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Code Ingestion Service

🚀 Features

Core Capabilities

Language & Platform Support

Performance & Reliability

🛠️ Core Components

CodeChunker

Key Features

🏗️ Architecture

💻 Usage

CLI Usage (Recommended)

Basic Usage

Advanced Filtering

Include Pattern Examples

Exclude Pattern Examples

Pattern Syntax

Common Use Cases

Provider Selection

CLI Options

GitHub Actions Integration (Recommended)

Multi-Repository Workflow

Local Development & Testing

⚙️ Setup

Prerequisites

Installation

Option 1: PyPI Installation (Recommended)

Option 2: GitHub Installation

Option 3: Development Installation

Setup Environment Variables

🔒 Security

Environment Variables

Recommended Security Practices

Supported Environment Variables

Security Audit

🤝 Contributing

📄 License

Attribution Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes