Skip to main content

Baybin Sentinel: OpenSearch writer

Project description

Baybin Sentinel

baybin_sentinel is a Python utility package designed for the Baybin Sentiment Analysis System. It provides specialized writers to streamline the ingestion of social media data into OpenSearch.

Currently supported platforms: Facebook, Threads, PTT, News (RSS / Scrapy), Google Trends.

Installation

(For Crawler Developers) Install Package

pip install -U baybin_sentinel

(For Package Developers) Create Virtual Environment

conda update -n base -c conda-forge conda
conda create -n sentinel python=3.13 pip -y
conda activate sentinel
cd baybin_sentinel
pip install -r requirements.txt
pip install -e .

Configuration

Each writer accepts credentials either as direct parameters or via a config file.

Option A — direct parameters:

writer = PttWriter(
    host="192.168.x.x",
    port=9200,
    user="your_username",
    password="your_password",
    verify_certs=False,
)

Option B — config file (recommended for development):

writer = PttWriter(config_path="/absolute/path/to/config.yaml")
# config.yaml
opensearch:
  host: "your_opensearch_ip"
  port: 9200
  user: "your_username"
  password: "your_password"
  verify_certs: false

Option C — environment variable (recommended for Celery workers / containers):

Set BAYBIN_SENTINEL_CONFIG to the absolute path of your config file. Takes priority over config_path.

export BAYBIN_SENTINEL_CONFIG=/absolute/path/to/config.yaml

Config resolution order: direct params → BAYBIN_SENTINEL_CONFIG env var → config_path argument → default "config.yaml" (relative to CWD).

Index naming convention

Each writer targets a dedicated OpenSearch index following the pattern raw_{platform}_{content_type}s:

Writer Post index Comment index
FacebookWriter raw_facebook_posts raw_facebook_comments
ThreadsWriter raw_threads_posts raw_threads_comments
PttWriter raw_ptt_posts
NewsWriter raw_news_posts
GoogleTrendsWriter raw_google_trends_posts

Field normalization

Each writer accepts pre-normalized data and routes fields to root vs metadata before writing to OpenSearch.

Canonical root-level fields (posts): post_id, platform, client_id, source_name, url, content, author_name, language, timestamp, crawled_at, s3_path

Canonical root-level fields (comments): comment_id, legacy_comment_id, post_id, post_url, platform, client_id, author_id, author_name, content, content_hash, timestamp, crawled_at, created_at, depth, s3_path

Any field not in the canonical set is automatically moved into a nested metadata object.

Validation

Every writer validates the document before writing to OpenSearch. A ValueError is raised immediately if any required field is missing or empty — no silent bad writes.

Required post fields: post_id, platform, client_id, timestamp, crawled_at

Required comment fields: comment_id (or legacy_comment_id), post_id, platform, client_id, content, timestamp, crawled_at

This means:

  • You must call normalize_post() / normalize_comment() before passing data to the writer — passing a raw API response directly will raise.
  • client_id must always be present — enforces multi-tenancy at the write layer.

Platform field maps

ThreadsWriter — accepts raw output from the internal Threads scraper:

Raw field Canonical field
text content (posts and comments)
post_url url (posts)
author author_name (posts)
reply_author author_name (comments)
reply_author_id author_id (comments)

FacebookWriter — expects pre-normalized post data (output of normalize_post()). Comment field map:

Raw field Canonical field
reply_author author_name
reply_author_id author_id

PttWriter, NewsWriter, GoogleTrendsWriter — expect pre-normalized data with canonical field names already set.

Example (Facebook)

from baybin_sentinel.platforms.facebook import FacebookWriter

writer = FacebookWriter(
    host="192.168.x.x",
    port=9200,
    user="your_username",
    password="your_password",
    verify_certs=False,
)

# Single post with its comments
writer.save(post, comments)

# Bulk posts only
writer.save_bulk_posts(posts)

# Bulk comments only
writer.save_bulk_comments(comments)

Example (Threads)

from baybin_sentinel.platforms.threads import ThreadsWriter

writer = ThreadsWriter(config_path="/path/to/config.yaml")

# Single post with its replies (extracted from post["replies_detail"])
writer.save(post)

# Single post with explicit comments
writer.save(post, comments)

# Bulk posts only
writer.save_bulk_posts(posts)

# Bulk comments for one post
writer.save_bulk_comments(replies, post_url="https://threads.net/...")

Example (PTT)

from baybin_sentinel.platforms.ptt import PttWriter

writer = PttWriter(config_path="/path/to/config.yaml")

writer.save_post(post)
writer.save_bulk_posts(posts)

Example (News)

from baybin_sentinel.platforms.news import NewsWriter

writer = NewsWriter(config_path="/path/to/config.yaml")

writer.save_post(post)
writer.save_bulk_posts(posts)

Example (Google Trends)

from baybin_sentinel.platforms.google_trends import GoogleTrendsWriter

writer = GoogleTrendsWriter(config_path="/path/to/config.yaml")

writer.save_post(post)
writer.save_bulk_posts(posts)

Publishing to PyPI

If you are the maintainer, follow these steps to publish a new version:

  1. Update version in pyproject.toml (e.g., 0.2.0).
  2. Install build tools:
    pip install build twine
    
  3. Build the package:
    rmdir /s /q dist build 2>nul
    python -m build
    
  4. Upload to PyPI:
    python -m twine upload dist/*
    
  5. Authentication:
    • Username: __token__
    • Password: pypi-your-api-token-here (including the pypi- prefix)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

baybin_sentinel-2026.6.25.1.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

baybin_sentinel-2026.6.25.1-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file baybin_sentinel-2026.6.25.1.tar.gz.

File metadata

  • Download URL: baybin_sentinel-2026.6.25.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for baybin_sentinel-2026.6.25.1.tar.gz
Algorithm Hash digest
SHA256 e89952d572e34466f1b93ef87061444adddaf9cc4a1fffa320a37dffaed7f346
MD5 5949bbd25010e4c9e7e6f1d24bc7579a
BLAKE2b-256 ece7bd84041cac60bed3d289861db161c73e83efffbac240bb077bf346924344

See more details on using hashes here.

File details

Details for the file baybin_sentinel-2026.6.25.1-py3-none-any.whl.

File metadata

File hashes

Hashes for baybin_sentinel-2026.6.25.1-py3-none-any.whl
Algorithm Hash digest
SHA256 207650a4a034b2cd9278ba849648e3f9e6019936b16ac6bbe68186aff4610e11
MD5 94cfda70eff51f317d96805ada2c488f
BLAKE2b-256 4888fdecdee23d92c840439b376a42db9d930fca40e9885562f073a1c90c62dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page