Skip to main content

Connector components for the Sayou Data Platform

Project description

sayou-connector

Build Status License: Apache 2.0 Docs

sayou-connector is the universal data ingestion engine for the Sayou Data Platform. It provides a unified interface to fetch data from diverse sources—Local Files, Web URLs, APIs, and Databases.

Unlike simple HTTP clients or file readers, sayou-connector is designed as a Recursive Crawling Engine. It separates the logic of "What to fetch" (Navigation) from "How to fetch" (Transport), allowing for complex, stateful data collection strategies like web crawling or database pagination.

Philosophy

"Navigate First, Fetch Later." Data collection is not a one-off task; it's a discovery process. We define two distinct roles:

  1. Generator (Navigator): Determines the next target (e.g., finds the next page URL, calculates DB offset).
  2. Fetcher (Driver): Executes the retrieval (e.g., sends HTTP GET, executes SQL).

This separation allows the pipeline to be infinitely extensible—from a simple file walker to an AI-powered web crawler.

🚀 Key Features

  • Strategy-Based Execution: Switch between local_scan, web_crawl, or sql_scan with a single parameter.
  • Recursive & Stateful: Supports BFS/DFS crawling for websites and directories with depth control.
  • Smart Filtering: Built-in support for regex-based URL filtering and file extension filtering.
  • AI-Ready: Designed to integrate with LLMs (Tier 3 Plugin) to intelligently identify CSS selectors or generate SQL queries dynamically.

📦 Installation

pip install sayou-connector

⚡ Quickstart

The ConnectorPipeline manages the loop between Generators and Fetchers.

from sayou.connector.pipeline import ConnectorPipeline

def run_demo():
    # 1. Initialize Pipeline
    pipeline = ConnectorPipeline()
    pipeline.initialize()

    # 2. Run (Example: Web Crawling)
    print("Starting Web Crawl...")
    results = pipeline.run(
        source="BASE_URL",
        strategy="web_crawl",
        # Generator Options
        link_pattern="BASE_PATTERN",
        max_depth=1
    )

    # 3. Process Results (Stream)
    for res in results:
        print(f"[Fetched] {res.task.uri}")
        # res.data contains extracted content or raw HTML
        # res.task contains metadata

if __name__ == "__main__":
    run_demo()

🤝 Contributing

We welcome contributions for new Fetchers (e.g., S3Fetcher, KafkaFetcher) or Generators (e.g., SitemapGenerator).

📜 License

Apache 2.0 License © 2025 Sayouzone

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_connector-0.1.3.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sayou_connector-0.1.3-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file sayou_connector-0.1.3.tar.gz.

File metadata

  • Download URL: sayou_connector-0.1.3.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_connector-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9daf8c1a376c9bede6d5518a277ebc4fb182ef73716608b82b2c01538f566716
MD5 76f075a421e5a5d7ac29eac4c56edd62
BLAKE2b-256 8fe6481e0016df5e35f5ac90290166e4b3a7270a7dd9f103d69b618a43bba6d0

See more details on using hashes here.

File details

Details for the file sayou_connector-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for sayou_connector-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8a5faafd08205dda246ef376417434df7245b310cad4562928873cd29bd89a5b
MD5 229da6c67ea7647ee1e93f674e1cdba1
BLAKE2b-256 97dc384d6292801f3102001261a83fb868ac2a1d8cf106d3a74f2afc81b5a51c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page