Skip to main content

Connector components for the Sayou Data Platform

Project description

sayou-connector

Build Status License: Apache 2.0 Docs

sayou-connector is the universal data ingestion engine for the Sayou Data Platform. It provides a unified interface to fetch data from diverse sources—Local Files, Web URLs, APIs, and Databases.

Unlike simple HTTP clients or file readers, sayou-connector is designed as a Recursive Crawling Engine. It separates the logic of "What to fetch" (Navigation) from "How to fetch" (Transport), allowing for complex, stateful data collection strategies like web crawling or database pagination.

Philosophy

"Navigate First, Fetch Later." Data collection is not a one-off task; it's a discovery process. We define two distinct roles:

  1. Generator (Navigator): Determines the next target (e.g., finds the next page URL, calculates DB offset).
  2. Fetcher (Driver): Executes the retrieval (e.g., sends HTTP GET, executes SQL).

This separation allows the pipeline to be infinitely extensible—from a simple file walker to an AI-powered web crawler.

🚀 Key Features

  • Strategy-Based Execution: Switch between local_scan, web_crawl, or sql_scan with a single parameter.
  • Recursive & Stateful: Supports BFS/DFS crawling for websites and directories with depth control.
  • Smart Filtering: Built-in support for regex-based URL filtering and file extension filtering.
  • AI-Ready: Designed to integrate with LLMs (Tier 3 Plugin) to intelligently identify CSS selectors or generate SQL queries dynamically.

📦 Installation

pip install sayou-connector

⚡ Quickstart

The ConnectorPipeline manages the loop between Generators and Fetchers.

from sayou.connector.pipeline import ConnectorPipeline

def run_demo():
    # 1. Initialize Pipeline
    pipeline = ConnectorPipeline()
    pipeline.initialize()

    # 2. Run (Example: Web Crawling)
    print("Starting Web Crawl...")
    results = pipeline.run(
        source="BASE_URL",
        strategy="web_crawl",
        # Generator Options
        link_pattern="BASE_PATTERN",
        max_depth=1
    )

    # 3. Process Results (Stream)
    for res in results:
        print(f"[Fetched] {res.task.uri}")
        # res.data contains extracted content or raw HTML
        # res.task contains metadata

if __name__ == "__main__":
    run_demo()

🤝 Contributing

We welcome contributions for new Fetchers (e.g., S3Fetcher, KafkaFetcher) or Generators (e.g., SitemapGenerator).

📜 License

Apache 2.0 License © 2025 Sayouzone

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_connector-0.1.2.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sayou_connector-0.1.2-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file sayou_connector-0.1.2.tar.gz.

File metadata

  • Download URL: sayou_connector-0.1.2.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_connector-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ed1b93b2551b02bf78b8d5f6d1f8683d334b9ee743186e8c6adaeebbdde4e7b3
MD5 8267babf284f90b54bc828f4411dfa59
BLAKE2b-256 1d84093e599d8d2ddc052b2da785bb3b97b88f009e78caf33a6d604b32c43eac

See more details on using hashes here.

File details

Details for the file sayou_connector-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for sayou_connector-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e779dc98306e00390fb2de9b3f5ba6ff6c777e9e2dcfcf3124b3e83eee31aa3b
MD5 eda59fe68c94e296ad4ba9eebc2a8a72
BLAKE2b-256 41dd0a8931c34766e26ebd4dfd46be0e592c8302c5f7359184696f06b51268e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page