GitHub repository reader with document chunking for RAG/LLM applications
Project description
gitsource
GitHub repository reader with document chunking for RAG/LLM applications.
Features
- Download repositories directly from GitHub using
codeload.github.com(no git required) - Filter files by extension and path patterns
- Parse YAML frontmatter from markdown files
- Chunk documents using sliding windows (preserves metadata)
- Lightweight Jupyter notebook parser
Installation
pip install gitsource
# or
uv add gitsource
Usage
Read GitHub Repository
from gitsource import GithubRepositoryDataReader
reader = GithubRepositoryDataReader(
repo_owner="evidentlyai",
repo_name="docs",
allowed_extensions={"md", "mdx"},
)
files = reader.read()
Parse Frontmatter
from gitsource import GithubRepositoryDataReader
reader = GithubRepositoryDataReader(
repo_owner="alexeygrigorev",
repo_name="gitsource",
allowed_extensions={"md"},
)
files = reader.read()
# Parse YAML frontmatter from markdown files
for file in files:
data = file.parse()
print(f"{data['filename']}: {data.get('title', 'No title')}")
Process Jupyter Notebooks
from gitsource import GithubRepositoryDataReader, notebook_processor
reader = GithubRepositoryDataReader(
repo_owner="alexeygrigorev",
repo_name="gitsource",
branch="master",
allowed_extensions={"md", "ipynb"},
filename_filter=lambda fp: fp.startswith("fixtures/"),
processors={"ipynb": notebook_processor}, # Convert .ipynb to text
)
files = reader.read()
for file in files:
print(f"{file.filename}: {file.content[:50]}...")
Chunk Documents
from gitsource import chunk_documents
documents = [
{"content": "Long text here...", "filename": "doc.txt"}
]
chunks = chunk_documents(
documents,
size=2000,
step=1000
)
License
WTFPL
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gitsource-0.0.5.tar.gz
(73.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gitsource-0.0.5.tar.gz.
File metadata
- Download URL: gitsource-0.0.5.tar.gz
- Upload date:
- Size: 73.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98e1bd7c4da6abe477071f876945c1203b35dc9ef10dc1f842cad4d5953ad3cb
|
|
| MD5 |
39ec12acf4aa78a4db65e062d1e33367
|
|
| BLAKE2b-256 |
b93b52da10f861beeec427600010de4506d89f6da9126e0b9499650fad5c0c22
|
File details
Details for the file gitsource-0.0.5-py3-none-any.whl.
File metadata
- Download URL: gitsource-0.0.5-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a6030f61e264628606f8438b8cb633046ae158757aa242504a4c0fc4630042c
|
|
| MD5 |
d3d0a5f81508a9ebac5a20c784441dbc
|
|
| BLAKE2b-256 |
f17809a6508d8cd50d730c15368af4f1950236f3adce29c38502f79cc8710600
|