Skip to main content

GitHub repository reader with document chunking for RAG/LLM applications

Project description

gitsource

GitHub repository reader with document chunking for RAG/LLM applications.

Features

  • Download repositories directly from GitHub using codeload.github.com (no git required)
  • Filter files by extension and path patterns
  • Parse YAML frontmatter from markdown files
  • Chunk documents using sliding windows (preserves metadata)
  • Lightweight Jupyter notebook parser

Installation

pip install gitsource
# or
uv add gitsource

Usage

Read GitHub Repository

from gitsource import GithubRepositoryDataReader

reader = GithubRepositoryDataReader(
    repo_owner="evidentlyai",
    repo_name="docs",
    allowed_extensions={"md", "mdx"},
)

files = reader.read()

Parse Frontmatter

from gitsource import GithubRepositoryDataReader

reader = GithubRepositoryDataReader(
    repo_owner="alexeygrigorev",
    repo_name="gitsource",
    allowed_extensions={"md"},
)
files = reader.read()

# Parse YAML frontmatter from markdown files
for file in files:
    data = file.parse()
    print(f"{data['filename']}: {data.get('title', 'No title')}")

Process Jupyter Notebooks

from gitsource import GithubRepositoryDataReader, notebook_processor

reader = GithubRepositoryDataReader(
    repo_owner="alexeygrigorev",
    repo_name="gitsource",
    branch="master",
    allowed_extensions={"md", "ipynb"},
    filename_filter=lambda fp: fp.startswith("fixtures/"),
    processors={"ipynb": notebook_processor},  # Convert .ipynb to text
)

files = reader.read()
for file in files:
    print(f"{file.filename}: {file.content[:50]}...")

Chunk Documents

from gitsource import chunk_documents

documents = [
    {"content": "Long text here...", "filename": "doc.txt"}
]

chunks = chunk_documents(
    documents,
    size=2000,
    step=1000
)

License

WTFPL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitsource-0.0.4.tar.gz (73.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gitsource-0.0.4-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file gitsource-0.0.4.tar.gz.

File metadata

  • Download URL: gitsource-0.0.4.tar.gz
  • Upload date:
  • Size: 73.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.13.5 HTTPX/0.28.1

File hashes

Hashes for gitsource-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3a33ca2ddd9359f04bb1f66eab273cb46cd4a8c83c75650d9bcd14bfc2327c69
MD5 fc8eca2f958e80b8e0b29b1e6bf647de
BLAKE2b-256 a91c845fc8743b2af778f1df6ca651e7ab8b9f045fe47e8d7d5507a9eb34b027

See more details on using hashes here.

File details

Details for the file gitsource-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: gitsource-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.13.5 HTTPX/0.28.1

File hashes

Hashes for gitsource-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b31f5113bdf400ed9f4e925daaca35939dab7068c3474048f1d0771403043fdd
MD5 f37c328958ced838153913bf1d1bb2a6
BLAKE2b-256 ecb23a28db38c042437786bb255f88d2995362b1f9697b2b4143bfeeafc87b72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page