Skip to main content

GitHub repository reader with document chunking for RAG/LLM applications

Project description

gitsource

GitHub repository reader with document chunking for RAG/LLM applications.

Features

  • Download repositories directly from GitHub using codeload.github.com (no git required)
  • Filter files by extension and path patterns
  • Parse frontmatter from markdown files
  • Chunk documents using sliding windows (preserves metadata)
  • Lightweight Jupyter notebook parser

Installation

pip install gitsource
# or
uv add gitsource

Usage

Read GitHub Repository

from gitsource import GithubRepositoryDataReader

reader = GithubRepositoryDataReader(
    repo_owner="evidentlyai",
    repo_name="docs",
    allowed_extensions={"md", "mdx"},
)

files = reader.read()

Process Jupyter Notebooks

from gitsource import GithubRepositoryDataReader, notebook_processor

reader = GithubRepositoryDataReader(
    repo_owner="alexeygrigorev",
    repo_name="gitsource",
    branch="master",
    allowed_extensions={"md", "ipynb"},
    filename_filter=lambda fp: fp.startswith("fixtures/"),
    processors={"ipynb": notebook_processor},  # Convert .ipynb to text
)

files = reader.read()
for file in files:
    print(f"{file.filename}: {file.content[:50]}...")

Chunk Documents

from gitsource import chunk_documents

documents = [
    {"content": "Long text here...", "filename": "doc.txt"}
]

chunks = chunk_documents(
    documents,
    size=2000,
    step=1000
)

Parse Jupyter Notebooks

from gitsource import load_notebook

notebook = load_notebook("notebook.ipynb")
cells = notebook.cells  # List of cell dictionaries

License

WTFPL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitsource-0.0.2.tar.gz (108.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gitsource-0.0.2-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file gitsource-0.0.2.tar.gz.

File metadata

  • Download URL: gitsource-0.0.2.tar.gz
  • Upload date:
  • Size: 108.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.13.5 HTTPX/0.28.1

File hashes

Hashes for gitsource-0.0.2.tar.gz
Algorithm Hash digest
SHA256 37b807d0b2507867467bbfa41ddd8a4a2f1ab15a8ba2df5ea8efe4eb29705d6d
MD5 0954d8d10254625e695fdf0fdcc3f87f
BLAKE2b-256 d6731c64522d06a83204c31a761356c0fd89d5afe86696858a48c5a2e6f49e29

See more details on using hashes here.

File details

Details for the file gitsource-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: gitsource-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.13.5 HTTPX/0.28.1

File hashes

Hashes for gitsource-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f929f80a49a5dcf1bd1af254b480fe64fb65acb5ebc3957ab7b969c32f27f211
MD5 aafd18efcc4ab2f070bab273fc2584af
BLAKE2b-256 6933d85cda06096eb31c67b066cfbcecbb9d28951bc5a7792da7e391d0dcb2d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page