SDK for building custom Glean indexing integrations

These details have not been verified by PyPI

Project links

Source Code

Project description

Glean Indexing SDK

A Python SDK for building custom Glean indexing connectors. Provides base classes and utilities to create connectors that fetch data from external systems and upload to Glean's indexing APIs.

Requirements

Python >= 3.10
A Glean instance and an indexing API token

Installation

pip install glean-indexing-sdk

Key Concepts

Every connector has two parts:

DataClient — fetches raw data from your external system (API, database, files)
Connector — transforms that data into Glean's format and uploads it

The workflow is: fetch → transform → upload. You implement get_source_data() on your data client and transform() on your connector; the SDK handles batching and upload.

See Architecture overview for a data flow diagram and the full class hierarchy.

Quickstart

1. Set up credentials

export GLEAN_SERVER_URL="https://your-company-be.glean.com"
export GLEAN_INDEXING_API_TOKEN="your-indexing-api-token"

# Deprecated alternative: use GLEAN_INSTANCE as legacy fallback
# export GLEAN_INSTANCE="acme"

2. Build a connector

This complete example defines a data type, a data client, and a connector, then indexes everything into Glean:

from typing import List, Sequence, TypedDict

from glean.indexing.connectors import BaseConnectorDataClient, BaseDatasourceConnector
from glean.indexing.models import (
    ContentDefinition,
    CustomDatasourceConfig,
    DocumentDefinition,
    IndexingMode,
    UserReferenceDefinition,
)


class WikiPageData(TypedDict):
    id: str
    title: str
    content: str
    author: str
    created_at: str
    updated_at: str
    url: str
    tags: List[str]


class WikiDataClient(BaseConnectorDataClient[WikiPageData]):
    def __init__(self, wiki_base_url: str, api_token: str):
        self.wiki_base_url = wiki_base_url
        self.api_token = api_token

    def get_source_data(self, since=None) -> Sequence[WikiPageData]:
        # Example static data
        return [
            {
                "id": "page_123",
                "title": "Engineering Onboarding Guide",
                "content": "Welcome to the engineering team...",
                "author": "jane.smith@company.com",
                "created_at": "2024-01-15T10:00:00Z",
                "updated_at": "2024-02-01T14:30:00Z",
                "url": f"{self.wiki_base_url}/pages/123",
                "tags": ["onboarding", "engineering"],
            },
            {
                "id": "page_124",
                "title": "API Documentation Standards",
                "content": "Our standards for API documentation...",
                "author": "john.doe@company.com",
                "created_at": "2024-01-20T09:15:00Z",
                "updated_at": "2024-01-25T16:45:00Z",
                "url": f"{self.wiki_base_url}/pages/124",
                "tags": ["api", "documentation", "standards"],
            },
        ]


class CompanyWikiConnector(BaseDatasourceConnector[WikiPageData]):
    configuration: CustomDatasourceConfig = CustomDatasourceConfig(
        name="company_wiki",
        display_name="Company Wiki",
        url_regex=r"https://wiki\.company\.com/.*",
        trust_url_regex_for_view_activity=True,
        is_user_referenced_by_email=True,
    )

    def transform(self, data: Sequence[WikiPageData]) -> List[DocumentDefinition]:
        documents = []
        for page in data:
            documents.append(
                DocumentDefinition(
                    id=page["id"],
                    title=page["title"],
                    datasource=self.name,
                    view_url=page["url"],
                    body=ContentDefinition(mime_type="text/plain", text_content=page["content"]),
                    author=UserReferenceDefinition(email=page["author"]),
                    created_at=self._parse_timestamp(page["created_at"]),
                    updated_at=self._parse_timestamp(page["updated_at"]),
                    tags=page["tags"],
                )
            )
        return documents

    def _parse_timestamp(self, timestamp_str: str) -> int:
        from datetime import datetime

        dt = datetime.fromisoformat(timestamp_str.replace("Z", "+00:00"))
        return int(dt.timestamp())


data_client = WikiDataClient(wiki_base_url="https://wiki.company.com", api_token="your-wiki-token")
connector = CompanyWikiConnector(name="company_wiki", data_client=data_client)
connector.configure_datasource()
connector.index_data(mode=IndexingMode.FULL)

Connector Types

Connector	Data Client	Best For
`BaseDatasourceConnector`	`BaseDataClient`	Small-to-medium datasets that fit in memory. Wikis, knowledge bases, file systems.
`BaseStreamingDatasourceConnector`	`BaseStreamingDataClient`	Large or paginated datasets where you need to limit memory usage. Uses sync generators.
`BaseAsyncStreamingDatasourceConnector`	`BaseAsyncStreamingDataClient`	Large datasets with async APIs (aiohttp, httpx async). Non-blocking I/O.
`BasePeopleConnector`	—	Employee and identity data indexing.

For detailed guidance on choosing between these, see the decision matrix.

Indexing Modes

IndexingMode.FULL — Re-indexes all documents. Use for initial loads or when you need a complete refresh.
IndexingMode.INCREMENTAL — Only indexes documents modified since the last crawl. Use for scheduled updates to minimize API calls.

connector.index_data(mode=IndexingMode.FULL)         # full re-index
connector.index_data(mode=IndexingMode.INCREMENTAL)   # only changes since last run

Testing

The SDK includes a ConnectorTestHarness that lets you validate your connector without making real API calls. It intercepts uploads and captures the documents your connector produces so you can assert on them.

from glean.indexing.connectors import ConnectorTestHarness

harness = ConnectorTestHarness(connector)
harness.run()

validator = harness.get_validator()
validator.assert_documents_posted(count=2)

# Inspect individual documents
for doc in validator.documents_posted:
    print(doc.title)

Contributing

This project uses mise for toolchain management and uv for Python dependencies.

mise run setup              # create venv and install dependencies
mise run test               # run all tests
mise run lint               # run all linters (ruff, pyright, markdown-code)
mise run lint:fix           # auto-fix lint issues and format code

Next Steps

Architecture overview — data flow diagram and component hierarchy
Streaming connectors — sync and async streaming walkthroughs
Advanced usage — connector selection guide, forced restart uploads

Project details

These details have not been verified by PyPI

Project links

Source Code

Release history Release notifications | RSS feed

1.0.0b2 pre-release

Apr 23, 2026

This version

1.0.0b1 pre-release

Mar 5, 2026

1.0.0b0 pre-release

Feb 23, 2026

0.3.1

Feb 5, 2026

0.3.0

Feb 4, 2026

0.2.0

Jul 24, 2025

0.1.0

Jul 23, 2025

0.0.3

Jun 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glean_indexing_sdk-1.0.0b1.tar.gz (134.7 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glean_indexing_sdk-1.0.0b1-py3-none-any.whl (52.7 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file glean_indexing_sdk-1.0.0b1.tar.gz.

File metadata

Download URL: glean_indexing_sdk-1.0.0b1.tar.gz
Upload date: Mar 5, 2026
Size: 134.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for glean_indexing_sdk-1.0.0b1.tar.gz
Algorithm	Hash digest
SHA256	`98486e8e5e90abdcc367c06f1a8178d5ac28a2b495f1b10097a4f694e5a11f63`
MD5	`22808b975995e49690d873c8b353fefc`
BLAKE2b-256	`907c1c14bda3b2e56b30651d9ed1d9b00823422039bccb823c561853445ed329`

See more details on using hashes here.

Provenance

The following attestation bundles were made for glean_indexing_sdk-1.0.0b1.tar.gz:

Publisher: publish.yml on gleanwork/glean-indexing-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: glean_indexing_sdk-1.0.0b1.tar.gz
- Subject digest: 98486e8e5e90abdcc367c06f1a8178d5ac28a2b495f1b10097a4f694e5a11f63
- Sigstore transparency entry: 1044639013
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: gleanwork/glean-indexing-sdk@eb80e24ee637591cf77d1436e1f45ee7116518ef
- Branch / Tag: refs/tags/v1.0.0b1
- Owner: https://github.com/gleanwork
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@eb80e24ee637591cf77d1436e1f45ee7116518ef
- Trigger Event: push

File details

Details for the file glean_indexing_sdk-1.0.0b1-py3-none-any.whl.

File metadata

Download URL: glean_indexing_sdk-1.0.0b1-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 52.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for glean_indexing_sdk-1.0.0b1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dce71f11f87867e539e21c72845fea18126ef04cd1e8abdbfbdbfb6e1152118e`
MD5	`41796b0673b5b0fffeddd68b8eb0099b`
BLAKE2b-256	`0238b5a10ac934361aa113faef716d79680c36cc0fb8d80e771dffedfeeaa882`

See more details on using hashes here.

Provenance

The following attestation bundles were made for glean_indexing_sdk-1.0.0b1-py3-none-any.whl:

Publisher: publish.yml on gleanwork/glean-indexing-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: glean_indexing_sdk-1.0.0b1-py3-none-any.whl
- Subject digest: dce71f11f87867e539e21c72845fea18126ef04cd1e8abdbfbdbfb6e1152118e
- Sigstore transparency entry: 1044639054
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: gleanwork/glean-indexing-sdk@eb80e24ee637591cf77d1436e1f45ee7116518ef
- Branch / Tag: refs/tags/v1.0.0b1
- Owner: https://github.com/gleanwork
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@eb80e24ee637591cf77d1436e1f45ee7116518ef
- Trigger Event: push

glean-indexing-sdk 1.0.0b1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

Glean Indexing SDK

Requirements

Installation

Key Concepts

Quickstart

1. Set up credentials

2. Build a connector

Connector Types

Indexing Modes

Testing

Contributing

Next Steps

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance