Skip to main content

Connector components for the Sayou Data Platform

Project description

sayou-connector

PyPI version License Docs

The Universal Data Ingestion Engine for Sayou Fabric.

sayou-connector provides a unified interface to fetch data from diverse sources—Local Files, Web URLs, and Databases—normalizing everything into a standard format called SayouPacket.

It separates the logic of Navigation (Generator) from Retrieval (Fetcher), enabling complex recursive crawling and pagination strategies out of the box.

💡 Core Philosophy

"Navigate First, Fetch Later."

Data collection is not just about downloading; it's about discovery. We decouple the responsibility into two roles:

  1. Generator (Navigator): The "Brain". It decides what to fetch next (e.g., calculates DB offsets, finds next page links) and yields a Task.
  2. Fetcher (Driver): The "Muscle". It executes the actual retrieval (e.g., HTTP GET, SQL Query) and returns a Packet.

This separation enables the Feedback Loop, where the result of a fetch (e.g., found links) feeds back into the Generator to discover more targets.

📦 Installation

pip install sayou-connector

⚡ Quick Start

The ConnectorPipeline manages the feedback loop between Generators and Fetchers.

from sayou.connector.pipeline import ConnectorPipeline

def run_demo():
    # 1. Initialize Pipeline
    pipeline = ConnectorPipeline()
    pipeline.initialize()

    # 2. Run (Example: Web Crawling)
    print("Starting Web Crawl...")
    
    # Returns an iterator of 'SayouPacket' objects
    packets = pipeline.run(
        source="https://news.daum.net/tech",
        strategy="requests",
        link_pattern=r"https://v\.daum\.net/v/\d+",
        max_depth=1
    )

    # 3. Process Results (Stream)
    for packet in packets:
        if packet.success:
            print(f"[Fetched] {packet.task.uri}")
            # packet.data contains the extracted content (dict, bytes, etc.)
            print(f"   Data: {str(packet.data)[:50]}...")
        else:
            print(f"[Error] {packet.error}")

if __name__ == "__main__":
    run_demo()

🔑 Key Concepts

Generators

  • FileGenerator: Recursively scans directories to find files matching extensions or patterns.
  • SqlGenerator: Generates paginated SQL queries (LIMIT/OFFSET) to fetch large tables in batches.
  • WebCrawlGenerator: Manages a URL frontier queue for BFS/DFS web crawling with depth control.

Fetchers

  • FileFetcher: Reads binary or text content from the local file system.
  • SqliteFetcher: Executes SQL queries against SQLite databases securely.
  • SimpleWebFetcher: Fetches HTML pages and extracts data/links using BeautifulSoup.

🤝 Contributing

We welcome contributions for new Fetchers (e.g., S3Fetcher, KafkaFetcher) or Generators (e.g., SitemapGenerator)!

📜 License

Apache 2.0 License © 2025 Sayouzone

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_connector-0.2.8.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sayou_connector-0.2.8-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file sayou_connector-0.2.8.tar.gz.

File metadata

  • Download URL: sayou_connector-0.2.8.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_connector-0.2.8.tar.gz
Algorithm Hash digest
SHA256 b2dde2a28bfadf9803f29dcc805ade914a49371db1032fae82ecbfd9462d3cd8
MD5 2d415e027e5cfb614329c1f04ad6457e
BLAKE2b-256 286a1830f6ddda16c5cbd80d8a2373b2c6c6e1d1c5f1e2a150f86fafc30161ed

See more details on using hashes here.

File details

Details for the file sayou_connector-0.2.8-py3-none-any.whl.

File metadata

File hashes

Hashes for sayou_connector-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2f5e87f6122f782be8f673a456edc12c08f1e6ee88436c565e6c30e3fd9f670b
MD5 57203ab6b6d7834183d6b61222b99a57
BLAKE2b-256 03b17fa22b9c14cebbb99c057f99511dcf8c5bab1b0dda4caff6f531d4b3e876

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page