Browser-based Instagram research tooling: Selenium scraper with optional GCS and PostgreSQL.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Slug-Ig-Crawler

What it is: A Python tool that drives a real browser (Selenium) to collect public Instagram profile data, post metadata, comments, and media, with optional Google Cloud Storage uploads and PostgreSQL enqueue rows for downstream pipelines. Configuration is TOML + Pydantic; orchestration is CLI → Pipeline → Selenium backend.

This document is organized so you can understand the repo, skim flags, and run a first pass in 5–10 minutes, then jump to the deeper sections below when you need them.

Start here

#	Section
1	What this repository is
2	Objectives & scope
3	Features
4	Key configuration flags
5	Installation
6	Quickstart (5-10 minutes)
7	Documentation map
8	Open source, research use, and acceptable use

Reference (deep dive)

#	Section
9	Architecture Overview
10	Entry Point: CLI
11	VS Code debugging (`launch.json`)
12	Core Components
13	End-to-End Workflow
14	Execution Flow
15	Sequence Diagram
16	Configuration
17	External services and infrastructure
18	Data Models and Parsing
19	Authentication
20	Docker and Docker Compose
21	Data Persistence
22	Key Design Patterns
23	Dependencies
24	Security Considerations
25	Troubleshooting
26	Performance Timing & Observability
27	Conclusion

What this repository is

Stack: Python 3, Selenium (+ selenium-wire for captured network traffic), Pydantic config, optional GCS and Postgres (psycopg) for artifact handoff.
Entry point: Slug-Ig-Crawler → Pipeline → SeleniumBackend → page objects and utilities. Pass --config /path/to/config.toml, or omit it when ~/.slug/config.toml exists (e.g. after Slug-Ig-Crawler bootstrap).
Outputs: JSONL and related files under configurable paths; when push_to_gcs = 1, batches can be uploaded and enqueued (crawled_posts / crawled_comments). See scripts/postgres_setup.sql for the DB schema.
Operations note: Job orchestrators (e.g. Thor) may generate configs from their own templates and run the same CLI inside Docker; this README does not replace Thor’s own docs.

Objectives & scope

Research, education, and careful automation against public pages.
Transparency in how data is collected (browser + captured requests).
Traceability via thor_worker_id, structured logs, and optional DB rows.

You are responsible for compliance with Instagram / Meta terms, applicable law, and your own risk tolerance.

Features

Profile mode — scrape by handle from [main].target_profiles.
URL file mode — scrape from a list file when [data].urls_filepath exists on disk (overrides profile mode).
Captured GraphQL — optional scrape_using_captured_requests path for comment/post data via performance logs.
Local media + optional full-video download — in-process script when not using captured-requests path for some media flows.
GCS + Postgres handoff — upload JSONL and enqueue gs:// URIs (or local paths when push_to_gcs = 0).
Screenshots → MP4 — optional enable_screenshots with shutdown upload (respects push_to_gcs).
Docker or local Chrome — use_docker, headless, env overrides CHROME_BIN / CHROMEDRIVER_BIN.
Observability — JSON timing events (pipeline_total_time, pipeline_active_time) and structured fields including thor_worker_id.

Key configuration flags

These are the knobs people usually need first. Full TOML lives in config.example.toml.

Flag / section	Role
`[main].target_profiles`	Profile mode: list of `{ name, num_posts }`.
`[data].urls_filepath`	If this path exists, URL-file mode wins; otherwise profile mode.
`[main].scrape_using_captured_requests`	Prefer GraphQL capture from network logs vs. heavier DOM-only flows where applicable.
`[main].push_to_gcs`	`1` = upload JSONL to GCS and store `gs://...` in DB; `0` = no GCS, enqueue absolute local paths; also affects screenshot video upload/cleanup.
`[main].gcs_bucket_name`	Target bucket when `push_to_gcs = 1` and upload paths run.
`[main].use_docker` / `headless`	Browser environment: container vs. local; visible vs. headless.
`[main].enable_screenshots`	Capture WebP frames and generate/upload MP4 on shutdown (see `push_to_gcs`).
`[trace].thor_worker_id`	Required for `Pipeline`; used in logs, enqueue, and naming.
*`PUGSY_PG_` env vars**	Postgres connection for `FileEnqueuer` (see `enqueue_client.py`).
`GOOGLE_APPLICATION_CREDENTIALS`	Typical GCP auth for GCS when uploading.

Environment overrides for binaries: CHROME_BIN, CHROMEDRIVER_BIN beat optional [main].chrome_binary_path / chromedriver_binary_path. On macOS, if neither env nor config nor ~/.slug/browser cache supplies Chrome, the pipeline falls back to /Applications/Google Chrome.app/Contents/MacOS/Google Chrome when that file exists. IGSCRAPER_OMIT_CHROME_USER_DATA_DIR=1 (save-cookie and run) skips --user-data-dir for debugging corrupted profiles.

Installation

Slug-Ig-Crawler is the project name. PyPI package: slug-ig-crawler. CLI: Slug-Ig-Crawler (import package remains igscraper).

Install	Command
Latest release from PyPI	`pip install slug-ig-crawler`
With screenshot → MP4 helpers (`imageio`)	`pip install "slug-ig-crawler[video]"`
Optional JSON5 parsing in the sorter	`pip install "slug-ig-crawler[json5]"`
Video + JSON5 together	`pip install "slug-ig-crawler[all]"`

After install, the Slug-Ig-Crawler console script is on your PATH (legacy alias igscraper is still provided for compatibility). Dependencies are declared in pyproject.toml.

Chrome / ChromeDriver (macOS and Linux): pip does not download browsers. After pip install "slug-ig-crawler[all]" (or any install), run Slug-Ig-Crawler bootstrap once to fetch Chrome for Testing + matching ChromeDriver for pinned full version 143.0.7499.169 (from Google’s known-good index) into ~/.slug/browser/<platform>/, and install a sample ~/.slug/config.toml if missing. Override the build with IGSCRAPER_CFT_FULL_VERSION (must be a version listed in Google’s JSON). Until binaries exist, the first pipeline run prints a stderr warning suggesting bootstrap (silence with IGSCRAPER_SILENT_BROWSER_CACHE_WARN=1). Inspect templates with Slug-Ig-Crawler show-config.

Publishing to PyPI (maintainers): see docs/PYPI_RELEASE.md (Trusted Publishing + release checklist; canonical org repo Pugsyfy/Slug-IG-Crawler). Release notes are tracked in CHANGELOG.md.

Quickstart (5-10 minutes)

Goal: install dependencies, apply the Postgres schema, drop in a minimal config.toml, and run the CLI once.

Step	Action
1	Create and activate a virtualenv: `python3 -m venv .venv && source .venv/bin/activate` (Windows: `.venv\Scripts\activate`).
2	Install from PyPI: `pip install "slug-ig-crawler[all]"`. Then run `Slug-Ig-Crawler bootstrap` to cache stable Chrome + ChromeDriver, seed `~/.slug/config.toml`, and (by default) apply the bundled Postgres schema using local defaults (`localhost:5432`, database `postgres`; on macOS Homebrew the default DB user is your login name, elsewhere `postgres`; set `PUGSY_PG_PORT=5433` if your DB is on a Docker-mapped port). On success, `~/.slug/.env` is written with the effective `PUGSY_PG_*` values.
3	Postgres (required if you use enqueue): ensure Postgres is reachable at those defaults, or set `PUGSY_PG_` in your shell, a project `.env`, or edit `~/.slug/.env`. If you do not* have `psql` yet, from a git clone run `./scripts/install_postgres_local.sh` (macOS Homebrew; Linux apt/dnf/yum; starts the Postgres service via `brew services` or `systemctl` when possible). You can also run `psql` manually: `psql "$YOUR_DATABASE_URL" -f scripts/postgres_setup.sql`. Use `Slug-Ig-Crawler bootstrap --no-setup-postgres` to skip schema setup.
4	Run `Slug-Ig-Crawler save-cookie --username <instagram_username>` once, then set `[data].cookie_file` in `~/.slug/config.toml` (recommended: `~/.slug/cookies/latest.json`) and set `[trace].thor_worker_id` (any non-empty string, e.g. `local-dev`). Set `push_to_gcs` to `0` for a local-only trial without GCP.
5	Profile mode: keep `[main].target_profiles` populated and ensure `[data].urls_filepath` is missing or points to a file that does not exist. URL mode: one URL per line in a file; set `[data].urls_filepath` to that real path.
6	Docker vs local: `[main].use_docker = true` for Docker/Compose flows; `false` with `headless = false` for a visible local browser. See Docker and Docker Compose.
7	Run: `Slug-Ig-Crawler` (autoloads `~/.slug/config.toml`), or `Slug-Ig-Crawler --config /path/to/config.toml`.

Debug in the IDE: see VS Code debugging (launch.json). For debugpy, start Slug-Ig-Crawler: CLI (listen for debugger), then Slug-Ig-Crawler: Attach to debugpy so execution continues past debugpy.wait_for_client().

Documentation map

After the quickstart, use the Reference table of contents above for:

Architecture & flow — diagrams and sequence for how a run is structured.
Configuration — full TOML sections, placeholders, [trace].
External services — GCS, Postgres, path rules (/outputs/), push_to_gcs behavior.
Docker — compose layout and Chrome in containers.
Operations — timing logs, troubleshooting, dependencies, security notes.
PyPI releases — docs/PYPI_RELEASE.md.
Changelog — CHANGELOG.md.

Development from source (git clone)

Use this only when you want to hack on code, run tests, or make local edits.

git clone https://github.com/Pugsyfy/Slug-IG-Crawler.git
cd Slug-IG-Crawler
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

This installs editable mode with dev/video/json5 extras via requirements.txt (-e .[dev,video,json5]).

This repository is open source (see the project license in the repo root). It is shared for transparency and research, not as an official product or service.

Open source, research use, and acceptable use

Research and education only (recommended). This software is intended for research, education, and responsible personal experimentation (for example, understanding browser automation or studying publicly visible page structure). It is not presented as a tool for high-volume production scraping, commercial data harvesting, or any use that conflicts with platform rules. You decide how you use it; you are responsible for that use.

Compliance with Instagram / Meta policies (mandatory). Instagram and Meta impose Terms of Use, Community Guidelines, and other rules that apply to access, automation, and data. Automated or scripted access may be restricted or prohibited depending on context. You must read, understand, and follow the terms, policies, and technical limits that apply to your jurisdiction and use case—including any future updates Meta publishes. Do not use this project to circumvent security, rate limits, login walls, or other protections.

Responsible use. Use conservative rate limits, respect people’s privacy and intellectual property, collect and retain only what you are permitted to, and stop immediately if the platform signals that access is unwelcome. Nothing in this documentation authorizes scraping in violation of applicable law or platform terms.

No affiliation. This project is not affiliated with, endorsed by, or sponsored by Instagram, Meta, or their brands.

Disclaimer. The software is provided as-is without warranty. The authors and contributors assume no liability for misuse, account actions (including suspension), legal claims, or damages arising from use of this repository. You are solely responsible for ensuring your use is lawful and compliant.

Architecture Overview

The application follows a layered architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────────┐
│                    CLI Layer (cli.py)                        │
│              Command-line argument parsing                   │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              Pipeline Layer (pipeline.py)                   │
│         Orchestrates scraping workflow                      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│         Configuration Layer (config.py)                      │
│    Loads and validates TOML configuration                   │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│         Backend Layer (backends/selenium_backend.py)        │
│    Manages WebDriver lifecycle and browser automation       │
└──────────────────────┬──────────────────────────────────────┘
                       │
        ┌──────────────┴──────────────┐
        ▼                              ▼
┌──────────────────┐         ┌──────────────────────┐
│  Page Objects    │         │  Data Extraction     │
│  (pages/)        │         │  (utils.py)          │
└──────────────────┘         └──────────────────────┘
        │                              │
        └──────────────┬──────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│         Data Persistence Layer                               │
│    Local files, GCS upload, database enqueueing             │
└─────────────────────────────────────────────────────────────┘

Runtime mode selection

At Pipeline.run(), the effective mode is chosen after config load (the [main].mode value in TOML may be overwritten):

URL file mode (mode 2) — if [data].urls_filepath is set and that path exists on disk.
Profile mode (mode 1) — else if [main].target_profiles is non-empty.
Otherwise the run logs a warning and does nothing.

Config template and Thor

This repo: use config.example.toml as a starting point (copy to config.toml and edit). It includes a [trace] section required by Pipeline.
Thor does not read this README; it generates job configs from its own template (e.g. thor/assets/base_config.toml) and invokes Docker with DOCKER_COMPOSE_FILE pointing at its compose file. The service name Slug-Ig-Crawler and the usual entrypoint Slug-Ig-Crawler --config /job/config.toml should stay compatible with that flow.

Entry Point: CLI

`cli.py`

The cli.py module serves as the single entry point for the application. It handles command-line argument parsing and initializes the scraping pipeline.

Commands:

Command	Purpose
`run` (default)	Load config and run the pipeline.
`bootstrap`	Download Chrome for Testing + ChromeDriver for `143.0.7499.169` (override with `IGSCRAPER_CFT_FULL_VERSION`) into `~/.slug/browser/…` and copy sample config to `~/.slug/config.toml` if absent (`--force` / `--force-config` available). Re-downloads if the cache is not the pinned full version (see `~/.slug/browser/<platform>/.cft-pinned-version`).
`show-config`	Print the bundled sample TOML plus discovered cache config/cookie paths.
`save-cookie`	Open Instagram login flow and save JSON cookies to `~/.slug/cookies/<browserVersion>_<username>_<timestamp>.json` (also updates `~/.slug/cookies/latest.json`). Uses the same Chrome + ChromeDriver pair as `bootstrap` (`~/.slug/browser/...`) unless you set both `CHROME_BIN` and `CHROMEDRIVER_BIN`; major versions must match (checked before launch). If bootstrap is missing on macOS, the tool falls back to `/Applications/Google Chrome.app/Contents/MacOS/Google Chrome` plus `chromedriver` on `PATH` (e.g. Homebrew) when both exist. Default: ephemeral Chrome profile (no `--user-data-dir`, matching the stable Linux-UA + `--remote-debugging-pipe` + CDP `navigator.platform` flow). Set `IGSCRAPER_COOKIE_USE_USER_DATA_DIR=1` or `CHROME_USER_DATA_DIR` for a persistent profile under `~/.slug/chrome-user-data/save-cookie/<username>/`. `IGSCRAPER_OMIT_CHROME_USER_DATA_DIR=1` forces ephemeral. Runs in a fresh Python subprocess by default; `IGSCRAPER_COOKIE_NO_SUBPROCESS=1` forces in-process (debug only). macOS: If Chrome for Testing crashes with `multi-threaded process forked` / fork pre-exec, run from Terminal.app instead of an IDE terminal; the CLI sets `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES` before Selenium loads. `bootstrap` strips `com.apple.quarantine` from cached ChromeDriver when needed; you can also run `xattr -d com.apple.quarantine $(which chromedriver)` manually.
`list-cookies`	Print only cached cookie JSON paths from `~/.slug/cookies`.
`version`	Print installed package version.

Key behavior:

main() resolves the config path: explicit --config, else ~/.slug/config.toml if present, else exits with a hint to pass --config or run bootstrap.
Then instantiates Pipeline and calls pipeline.run().

Usage:

Slug-Ig-Crawler --config config.toml
Slug-Ig-Crawler bootstrap
Slug-Ig-Crawler show-config
Slug-Ig-Crawler save-cookie --username your_instagram_username
Slug-Ig-Crawler list-cookies
Slug-Ig-Crawler version
Slug-Ig-Crawler   # same as run; uses ~/.slug/config.toml when present

Arguments by command:

run (default)
- --config <path> (optional): TOML path; if omitted, CLI autoloads ~/.slug/config.toml when present.
- Example:
  - Slug-Ig-Crawler --config ./config.toml
  - Slug-Ig-Crawler run --config /abs/path/config.toml
bootstrap
- --force (optional): re-download Chrome + ChromeDriver even if cache already exists.
- --force-config (optional): overwrite ~/.slug/config.toml with bundled sample.
- --setup-postgres / --no-setup-postgres: Postgres setup is enabled by default; use --no-setup-postgres to skip.
- --postgres-sql-file <path> (optional): override SQL file path used by --setup-postgres.
- Example:
  - Slug-Ig-Crawler bootstrap --force
  - Slug-Ig-Crawler bootstrap --force-config
  - Slug-Ig-Crawler bootstrap # runs browser + config + postgres setup by default
  - Slug-Ig-Crawler bootstrap --no-setup-postgres
  - Slug-Ig-Crawler bootstrap --setup-postgres --postgres-sql-file ./scripts/postgres_setup.sql
show-config
- No arguments.
- Example:
  - Slug-Ig-Crawler show-config
save-cookie
- --username <instagram_username> (required): used in cookie filename and session labeling.
- Example:
  - Slug-Ig-Crawler save-cookie --username your_instagram_username
list-cookies
- No arguments.
- Example:
  - Slug-Ig-Crawler list-cookies
version
- No arguments.
- Example:
  - Slug-Ig-Crawler version

This document also includes a VS Code debugging (launch.json) section below with a ready-to-paste debugger configuration for the same entry point.

VS Code debugging (`launch.json`)

Use this when you open the repository root (the folder that contains src/) in VS Code or Cursor. Create .vscode/launch.json and paste the following. It sets PYTHONPATH to src/ so python -m igscraper resolves the same way as in a shell where you exported PYTHONPATH, runs from ${workspaceFolder} so relative paths in config.toml work, and uses the Python extension’s debugpy adapter ("type": "debugpy"). If your tooling only recognizes the older launch type, change every "type": "debugpy" to "type": "python".

Adjust the --config argument if your TOML file is not named config.toml or does not live in the repo root. Select your virtual environment in the IDE before starting the debugger so breakpoints bind to the right interpreter.

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Slug-Ig-Crawler: CLI",
      "type": "debugpy",
      "request": "launch",
      "module": "igscraper",
      "cwd": "${workspaceFolder}",
      "args": ["--config", "config.toml"],
      "env": {
        "PYTHONPATH": "${workspaceFolder}/src"
      },
      "console": "integratedTerminal",
      "justMyCode": false
    },
    {
      "name": "Slug-Ig-Crawler: CLI (listen for debugger)",
      "type": "debugpy",
      "request": "launch",
      "module": "igscraper",
      "cwd": "${workspaceFolder}",
      "args": ["--config", "config.toml"],
      "env": {
        "PYTHONPATH": "${workspaceFolder}/src",
        "DEBUG_ATTACH": "1"
      },
      "console": "integratedTerminal",
      "justMyCode": false
    },
    {
      "name": "Slug-Ig-Crawler: Attach to debugpy",
      "type": "debugpy",
      "request": "attach",
      "connect": {
        "host": "localhost",
        "port": 5678
      },
      "pathMappings": [
        {
          "localRoot": "${workspaceFolder}",
          "remoteRoot": "${workspaceFolder}"
        }
      ],
      "justMyCode": false
    }
  ]
}

Optional attach flow: Pipeline can call debugpy.listen when DEBUG_ATTACH=1 (see pipeline.py). Start Slug-Ig-Crawler: CLI (listen for debugger) first, then start Slug-Ig-Crawler: Attach to debugpy so the process unblocks and you can hit breakpoints.

Core Components

1. Configuration Layer (`config.py`)

The configuration layer loads, validates, and processes settings from TOML files using Pydantic models.

Key Classes:

Config: Main configuration container that aggregates:
- MainConfig: Scraping behavior settings (mode, batch size, retries, push_to_gcs, gcs_bucket_name, etc.)
- DataConfig: File paths and data storage settings
- LoggingConfig: Logging configuration
- TraceConfig: thor_worker_id and related trace fields
ProfileTarget: Represents a single profile to scrape with name and num_posts fields

Key Functions:

load_config(path: str) -> Config:
- Loads TOML file
- Configures root logger
- Returns validated Config object
expand_paths(section, substitutions, depth):
- Expands path placeholders (e.g., {target_profile}, {date}, {datetime})
- Resolves relative paths to absolute paths
- Recursively processes nested configuration sections

Configuration Structure:

[main]
mode = 1
target_profiles = [{ name = "username", num_posts = 10 }]
headless = false
batch_size = 2
fetch_comments = true

[data]
output_dir = "outputs"
cookie_file = "~/.slug/cookies/latest.json"
posts_path = "{output_dir}/{date}/{target_profile}/posts_{target_profile}_{datetime}.txt"
metadata_path = "{output_dir}/{date}/{target_profile}/metadata_{target_profile}.jsonl"

2. Pipeline Layer (`pipeline.py`)

The Pipeline class orchestrates the entire scraping workflow, managing the browser lifecycle and coordinating profile scraping.

Key Methods:

__init__(config_path: str):
- Loads master configuration
- Validates [trace].thor_worker_id (required for Pipeline)
- Initializes SeleniumBackend
- Creates GraphQLModelRegistry for parsing network responses
run() -> dict:
- Starts the browser via backend.start()
- Determines scraping mode (URL file if data.urls_filepath exists, else profile list; see Runtime mode selection)
- Iterates through target profiles, calling _scrape_single_profile() for each
- Ensures browser cleanup in finally block
_scrape_single_profile(profile_target: ProfileTarget) -> dict:
- Creates profile-specific configuration copy
- Expands path placeholders with profile name and datetime
- Opens profile page via backend.open_profile()
- Collects post URLs via backend.get_post_elements()
- Scrapes posts in batches via backend.scrape_posts_in_batches()
- Returns results dictionary with scraped_posts and skipped_posts
_scrape_from_url_file() -> dict:
- Reads URLs from configured file
- Filters out already processed URLs
- Scrapes remaining URLs in batches

3. Backend Layer (`backends/selenium_backend.py`)

The SeleniumBackend class implements the Backend abstract interface, managing WebDriver lifecycle and browser automation.

Key Methods:

start():
- Configures Chrome options (anti-detection, performance logging)
- Environment-aware initialization:
  - Always: if CHROME_BIN / CHROMEDRIVER_BIN are set, those paths are used.
  - If use_docker=True: otherwise falls back to the image’s pinned Linux paths; adds Docker-specific flags (--no-sandbox, --disable-dev-shm-usage, …)
  - If use_docker=False: otherwise optional [main].chrome_binary_path / [main].chromedriver_binary_path → built-in macOS defaults
- Validates Chrome and ChromeDriver version compatibility
- Initializes Chrome WebDriver with appropriate binary locations
- Patches driver with patch_driver() for security monitoring
- Sets up network tracking via CDP commands
- Authenticates using cookies via _login_with_cookies()
- Initializes ProfilePage object and HumanScroller
stop():
- Stops screenshot worker thread
- Quits WebDriver and closes all browser windows
- Finalizes screenshots (if enabled): generates video, uploads to GCS, cleans up local files
_login_with_cookies():
- Navigates to https://www.instagram.com/
- Loads cookies from pickle file specified in config
- Adds cookies to WebDriver session
- Refreshes page to apply authentication
open_profile(profile_handle: str):
- Delegates to profile_page.navigate_to_profile()
get_post_elements(limit: int) -> Iterator[str]:
- Attempts to load cached post URLs from posts_path
- If no cache exists, calls profile_page.scroll_and_collect_() to scrape fresh URLs
- Saves collected URLs to cache file
- Filters out already processed URLs by loading from metadata_path
- Returns iterator of post URL strings
scrape_posts_in_batches(post_elements, batch_size, save_every, ...):
- Opens posts in batches using open_href_in_new_tab()
- For each post, calls _scrape_and_close_tab() to extract data
- Saves intermediate results via save_intermediate()
- Periodically saves final results via save_scrape_results()
- Implements rate limiting with random delays between batches
_scrape_and_close_tab(post_index, post_url, tab_handle, main_window_handle, debug):
- Switches to post's tab
- Extracts post metadata:
  - Title/header data via get_post_title_data()
  - Media (images/videos) via media_from_post_gpt() - handles carousel posts with improved robustness
  - Likes via get_section_with_highest_likes()
  - Comments via scrape_comments_with_gif() or _extract_comments_from_captured_requests()
- Handles errors gracefully, returning error dictionaries
- Ensures tab closure and window switching in finally block
_finalize_screenshots():
- Shutdown-time artifact finalization (runs after browser shutdown, before process exit)
- Generates MP4 video from all .webp screenshots in shot_dir (2.5 FPS, 640p height)
- Uploads video to GCS bucket at gs://{bucket}/vid_log/{video_name}.mp4
- Deletes all local screenshots and video file after successful upload
- Works for both PROFILE (mode 1) and POST (mode 2) jobs
- Errors are logged but don't block shutdown
_extract_comments_from_captured_requests(driver, config, batch_scrolls):
- Uses ReplyExpander to expand comment threads
- Captures GraphQL network requests via capture_instagram_requests()
- Parses responses using GraphQLModelRegistry
- Handles rate limiting with exponential backoff
- Saves parsed comment data to post_entity_path
open_href_in_new_tab(href, tab_open_retries):
- Executes JavaScript to open URL in new tab
- Waits for new window handle to appear
- Returns the new window handle

4. Page Objects (`pages/`)

Page objects encapsulate page-specific interactions using the Page Object Model pattern.

`base_page.py`

Base class providing common WebDriver operations:

find(locator): Waits for and returns a single element
find_all(locator): Waits for and returns all matching elements
click(element): Clicks element using JavaScript
scroll_into_view(element): Scrolls element into viewport

`profile_page.py`

Handles Instagram profile page interactions:

navigate_to_profile(handle: str):
- Constructs profile URL: https://www.instagram.com/{handle}/
- Navigates to URL
- Waits for page sections to load
get_visible_post_elements() -> List[WebElement]:
- Finds post container elements using XPath
- Extracts all <a> tags containing post links
- Returns list of WebElement objects
scroll_and_collect_(limit: int) -> tuple[bool, List[str]]:
- Scrolls profile page using HumanScroller
- Collects unique post URLs from visible elements
- Periodically captures GraphQL data via registry.get_posts_data()
- Stops when limit reached or no new posts loaded
- Returns tuple: (is_data_saved, list_of_urls)
extract_comments(steps):
- Delegates to scrape_comments_with_gif() utility function

5. Data Extraction and Parsing

GraphQL Model Registry (`models/registry_parser.py`)

The GraphQLModelRegistry class parses GraphQL API responses captured from network requests.

Key Methods:

__init__(registry, schema_path):
- Initializes model registry mapping patterns to Pydantic models
- Loads flatten schema from YAML file
get_posts_data(config, data_keys, data_type):
- Captures network requests via capture_instagram_requests()
- Filters GraphQL responses matching data_keys
- Parses responses using registered models
- Flattens data according to schema rules
- Saves parsed results to configured paths
- Returns boolean indicating if data was saved
parse_responses(extracted, selected_data_keys, driver):
- Parses list of captured network responses
- Matches data keys to registered models
- Validates and structures data using Pydantic models
- Returns list of parsed results with flattened data

Utility Functions (`utils.py`)

Key extraction utilities:

capture_instagram_requests(driver, limit):
- Retrieves Chrome performance logs
- Filters requests containing api/v1 or graphql/query
- Fetches response bodies via CDP Network.getResponseBody
- Returns list of {requestId, url, request, response} dictionaries
scrape_comments_with_gif(driver, config):
- Scrolls comment section
- Extracts comment text, author, likes, timestamps
- Captures GIF/image URLs from comments
- Returns list of comment dictionaries
media_from_post_gpt(driver):
- Extracts image URLs and video URLs from post
- Returns tuple: (images_data, video_data_list, img_vid_map)
get_section_with_highest_likes(driver):
- Finds like count element using DOM traversal
- Returns dictionary with likesNumber and likesText
media_from_post_gpt(driver):
- Robust media extraction function that handles carousel posts
- Returns tuple: (images_list, videos_list, img_vid_map)
- Uses improved selectors that don't rely on fragile Instagram class names
- Includes fallback mechanisms for single-image posts
- Handles video extraction with proper curl command generation
- Includes safety caps to prevent infinite loops in carousel navigation
save_intermediate(post_data, tmp_file):
- Appends post data as JSON line to temporary file
save_scrape_results(results, output_dir, config):
- Writes scraped posts to metadata_path as JSONL
- Writes skipped posts to skipped_path
- Clears temporary file

6. Data Persistence

Local File Storage

Data is saved to local files in JSONL format:

metadata_path: Main output file with scraped post data
skipped_path: Log of posts that failed to scrape
tmp_path: Temporary file for intermediate results
post_entity_path: Parsed GraphQL entities (comments, posts)
profile_path: Profile page GraphQL data

Cloud Storage and Enqueueing (`services/upload_enqueue.py`)

The UploadAndEnqueue class handles cloud storage and database integration:

upload_and_enqueue(local_path, kind, ...):
- Optionally sorts JSONL file by timestamp
- Uploads file to Google Cloud Storage (GCS)
- Enqueues GCS URI to PostgreSQL database via FileEnqueuer
- Returns GCS URI string

Integration Points:

on_posts_batch_ready(local_jsonl_path): Called when profile data is ready
on_comments_batch_ready(local_jsonl_path): Called when comment data is ready

7. Authentication (`login_Save_cookie.py`)

Standalone script for generating authentication cookies:

Opens Chrome browser to Instagram login page
Waits for user to manually log in
Saves cookies to pickle file: cookies_{timestamp}.pkl
Cookie file is referenced in config.toml for subsequent runs

End-to-End Workflow

High-Level Flow

CLI Invocation: User runs Slug-Ig-Crawler --config config.toml
Configuration Loading: Pipeline loads and validates TOML configuration
Browser Initialization: SeleniumBackend.start() initializes Chrome WebDriver
Authentication: Cookies are loaded and applied to browser session
Profile Iteration: For each target profile:
- Profile page is opened
- Post URLs are collected (from cache or fresh scrape)
- Posts are scraped in batches
Data Extraction: For each post:
- Post metadata is extracted (title, media, likes)
- Comments are collected (via DOM scraping or GraphQL capture)
- Data is saved to local files
Cloud Upload: Completed data files are uploaded to GCS and enqueued
Browser Shutdown: WebDriver is closed in finally block

Detailed Step-by-Step Execution

Phase 1: Initialization

CLI (cli.py)
- main() parses --config argument
- Instantiates Pipeline(config_path)
Pipeline (pipeline.py)
- __init__() calls load_config(config_path) and validates [trace].thor_worker_id
- Creates SeleniumBackend(self.master_config)
- Initializes GraphQLModelRegistry with model registry and schema path
Configuration (config.py)
- load_config() reads TOML file
- Configures root logger with level and directory
- Returns Config object with nested Pydantic models
Backend Initialization (selenium_backend.py)
- Pipeline.run() calls backend.start()
- Chrome options configured (headless, anti-detection, performance logging)
- WebDriver binaries resolved with env overrides, then Docker image paths or local config/defaults
- Driver patched with patch_driver() for security monitoring
- Network tracking enabled via CDP commands
- _login_with_cookies() loads and applies authentication cookies
- ProfilePage object created

Phase 2: Profile Scraping

Profile Navigation
- Pipeline._scrape_single_profile() creates profile-specific config
- Paths expanded with {target_profile}, {date}, {datetime} placeholders
- backend.open_profile(profile_name) navigates to profile page
- ProfilePage.navigate_to_profile() constructs URL and waits for sections
Post URL Collection
- backend.get_post_elements(limit) called
- Attempts to load cached URLs from posts_path
- If no cache: profile_page.scroll_and_collect_(limit):
  - Scrolls page using HumanScroller
  - Collects visible post elements
  - Extracts href attributes
  - Periodically captures GraphQL data via registry.get_posts_data()
  - Saves URLs to cache file
- Filters out processed URLs by loading from metadata_path
- Returns iterator of post URL strings
Batch Scraping
- backend.scrape_posts_in_batches() called with post URLs
- For each batch:
  - Opens posts in new tabs via open_href_in_new_tab()
  - For each post tab:
    - Switches to tab
    - Calls _scrape_and_close_tab():
      - Extracts title via get_post_title_data()
      - Extracts media via media_from_post_gpt()
      - Extracts likes via get_section_with_highest_likes()
      - Extracts comments:
        
        If scrape_using_captured_requests=True: _extract_comments_from_captured_requests()
        
        Otherwise: scrape_comments_with_gif()
    - Saves intermediate result to tmp_path
    - Closes tab and switches back
  - After save_every posts: save_scrape_results() writes to metadata_path
  - Random delay between batches for rate limiting

Phase 3: Comment Extraction (GraphQL Mode)

Comment Thread Expansion (if fetch_replies=True)
- ReplyExpander clicks "View replies" buttons
- Scrolls comment section to load more comments
- Detects rate limiting via _handle_comment_load_error()
Network Request Capture
- capture_instagram_requests() retrieves Chrome performance logs
- Filters GraphQL requests matching post_page_data_key
- Fetches response bodies via CDP
Data Parsing
- registry.get_posts_data() calls parse_responses()
- Matches data keys to registered Pydantic models
- Validates and structures data
- Flattens according to schema rules
- Saves to post_entity_path as JSONL
Cloud Upload
- on_comments_batch_ready() called with post_entity_path
- UploadAndEnqueue.upload_and_enqueue():
  - Sorts JSONL file by timestamp
  - Uploads to GCS bucket
  - Enqueues GCS URI to PostgreSQL

Phase 4: Cleanup

Browser Shutdown
- Pipeline.run() finally block calls backend.stop()
- SeleniumBackend.stop():
  - Stops screenshot worker thread
  - Calls driver.quit() to close browser
  - If enable_screenshots=True: calls _finalize_screenshots():
    - Generates MP4 video from all screenshots (2.5 FPS, 640p height)
    - Uploads video to GCS at gs://{bucket}/vid_log/{video_name}.mp4
    - Deletes all local screenshots and video file
- All browser windows closed

Sequence Diagram

The following Mermaid diagram illustrates the runtime interaction between major components:

sequenceDiagram
    participant User
    participant CLI as cli.py
    participant Pipeline as pipeline.py
    participant Config as config.py
    participant Backend as selenium_backend.py
    participant WebDriver as Selenium WebDriver
    participant ProfilePage as profile_page.py
    participant Registry as registry_parser.py
    participant Utils as utils.py
    participant Uploader as upload_enqueue.py

    User->>CLI: Slug-Ig-Crawler --config config.toml
    CLI->>Pipeline: Pipeline(config_path)
    Pipeline->>Config: load_config(config_path)
    Config-->>Pipeline: Config object
    Pipeline->>Backend: SeleniumBackend(config)
    Pipeline->>Registry: GraphQLModelRegistry(registry, schema_path)
    
    Pipeline->>Backend: start()
    Backend->>WebDriver: Initialize Chrome WebDriver
    Backend->>WebDriver: Configure options (anti-detection, logging)
    Backend->>WebDriver: patch_driver() for security
    Backend->>WebDriver: setup_network() via CDP
    Backend->>WebDriver: Navigate to instagram.com
    Backend->>Backend: _login_with_cookies()
    Backend->>WebDriver: Load cookies from file
    Backend->>WebDriver: Refresh page
    Backend->>ProfilePage: ProfilePage(driver, config)
    Backend-->>Pipeline: Browser ready

    loop For each target profile
        Pipeline->>Pipeline: _scrape_single_profile(profile_target)
        Pipeline->>Pipeline: Create profile-specific config
        Pipeline->>Pipeline: expand_paths() with substitutions
        Pipeline->>Backend: open_profile(profile_name)
        Backend->>ProfilePage: navigate_to_profile(handle)
        ProfilePage->>WebDriver: Navigate to /{handle}/
        ProfilePage->>WebDriver: wait_for_sections()
        
        Pipeline->>Backend: get_post_elements(limit)
        alt Cache exists
            Backend->>Backend: _load_cached_urls(posts_path)
        else No cache
            Backend->>ProfilePage: scroll_and_collect_(limit)
            loop Scroll and collect
                ProfilePage->>WebDriver: get_visible_post_elements()
                ProfilePage->>WebDriver: Scroll page
                ProfilePage->>Registry: get_posts_data(profile_page_data_key)
                Registry->>Utils: capture_instagram_requests()
                Utils->>WebDriver: get_log("performance")
                Utils-->>Registry: Network requests
                Registry->>Registry: parse_responses()
                Registry-->>ProfilePage: Data saved
            end
            Backend->>Backend: _save_urls(profile, urls, posts_path)
        end
        Backend->>Backend: _load_processed_urls(metadata_path)
        Backend-->>Pipeline: Iterator of post URLs

        Pipeline->>Backend: scrape_posts_in_batches(post_urls, batch_size)
        
        loop For each batch
            loop For each post in batch
                Backend->>Backend: open_href_in_new_tab(href)
                Backend->>WebDriver: execute_script("window.open(...)")
                Backend->>WebDriver: Wait for new window handle
                
                Backend->>Backend: _scrape_and_close_tab(...)
                Backend->>WebDriver: switch_to.window(tab_handle)
                Backend->>WebDriver: Refresh page
                
                alt scrape_using_captured_requests=True
                    Backend->>Backend: _extract_comments_from_captured_requests()
                    Backend->>Utils: find_comment_container(driver)
                    Backend->>Backend: ReplyExpander.expand_replies()
                    Backend->>Registry: get_posts_data(post_page_data_key)
                    Registry->>Utils: capture_instagram_requests()
                    Utils->>WebDriver: get_log("performance")
                    Utils-->>Registry: GraphQL responses
                    Registry->>Registry: parse_responses()
                    Registry->>Registry: Save to post_entity_path
                    Registry-->>Backend: Comments extracted
                    Backend->>Uploader: on_comments_batch_ready(post_entity_path)
                    Uploader->>Uploader: upload_and_enqueue(kind="comment")
                    Uploader->>Uploader: Sort JSONL file
                    Uploader->>Uploader: Upload to GCS
                    Uploader->>Uploader: Enqueue to PostgreSQL
                else Traditional scraping
                    Backend->>Utils: get_post_title_data(handle_slug)
                    Backend->>Utils: media_from_post_gpt(driver)
                    Backend->>Utils: get_section_with_highest_likes(driver)
                    Backend->>Utils: scrape_comments_with_gif(driver, config)
                end
                
                Backend->>Utils: save_intermediate(post_data, tmp_path)
                Backend->>WebDriver: close() tab
                Backend->>WebDriver: switch_to.window(main_handle)
            end
            
            alt save_every posts reached
                Backend->>Utils: save_scrape_results(results, output_dir, config)
                Utils->>Utils: Write to metadata_path
                Utils->>Utils: Write to skipped_path
                Utils->>Utils: clear_tmp_file(tmp_path)
            end
            
            Backend->>Backend: random_delay() between batches
        end
    end

    Pipeline->>Backend: stop()
    Backend->>Backend: Stop screenshot worker
    Backend->>WebDriver: quit()
    WebDriver-->>Backend: Browser closed
    alt enable_screenshots = true
        Backend->>Backend: _finalize_screenshots()
        Backend->>Backend: Generate video from screenshots
        Backend->>Uploader: Upload video to GCS (vid_log/)
        Backend->>Backend: Cleanup local screenshots and video
    end
    Backend-->>Pipeline: Cleanup complete
    Pipeline-->>CLI: Results dictionary
    CLI-->>User: Scraping complete

Configuration

Trace (`[trace]`)

Pipeline requires a non-empty [trace].thor_worker_id in the config file used for a full run. It is used for structured logs, enqueue metadata, and Chrome profile suffixing. Orchestrators typically inject a job-specific id.

Configuration File Structure

The application uses TOML configuration files with the following structure:

[main]
mode = 1  # May be overwritten at runtime; see "Runtime mode selection"
target_profiles = [
    { name = "username1", num_posts = 10 },
    { name = "username2", num_posts = 5 }
]
headless = false
enable_screenshots = false  # Set to true to enable screenshot capture and video generation
use_docker = false  # Set to true when running in Docker
batch_size = 2
fetch_comments = true
fetch_replies = true
max_comments = 130
scrape_using_captured_requests = true
rate_limit_seconds_min = 2
rate_limit_seconds_max = 4
max_retries = 3
save_every = 2
gcs_bucket_name = "pugsy_ai_crawled_data"  # GCS bucket for video uploads (automatically sanitized if path-like)
consumer_id = "default_consumer"  # Consumer ID for video naming (automatically sanitized)

[data]
output_dir = "outputs"
shot_dir = "{output_dir}/{date}/screens"  # Screenshot directory (used for video generation)
cookie_file = "~/.slug/cookies/latest.json"
posts_path = "{output_dir}/{date}/{target_profile}/posts_{target_profile}_{datetime}.txt"
metadata_path = "{output_dir}/{date}/{target_profile}/metadata_{target_profile}.jsonl"
post_entity_path = "{output_dir}/{date}/{target_profile}/post_entity_{target_profile}_{datetime}.jsonl"
profile_path = "{output_dir}/{date}/{target_profile}/profile_data_{target_profile}_{datetime}.jsonl"
schema_path = "src/igscraper/flatten_schema.yaml"
post_page_data_key = [
    "xdt_api__v1__media__media_id__comments__connection",
    "xdt_api__v1__media__media_id__comments__parent_comment_id__child_comments__connection"
]
profile_page_data_key = ["xdt_api__v1__feed__user_timeline_graphql_connection"]

[logging]
level = "DEBUG"
log_dir = "outputs/logs"
log_format = "%(asctime)s [%(levelname)s/%(processName)s] %(name)s: %(message)s"
date_format = "%Y-%m-%d %H:%M:%S"

[trace]
thor_worker_id = "your-worker-or-job-id"

A full sanitized template is config.example.toml in the repository root.

Path Placeholders

Path strings support the following placeholders that are automatically expanded:

{output_dir}: Base output directory
{target_profile}: Current profile name
{date}: Current date in YYYYMMDD format
{datetime}: Current datetime in YYYYMMDD_HHMM format

External services and infrastructure

This section lists outbound integrations (cloud, database, HTTP) and what is required by the config schema vs required only when a code path runs.

Required TOML sections

load_config validates a Config with [main], [data], [logging], and [trace] only. There is no message queue or broker section.

Instagram and the browser (always for scraping)

Item	Purpose
HTTPS to `instagram.com` (and related CDN domains)	Selenium drives a real browser; there is no separate Instagram API key. Session auth uses `[data].cookie_file` (JSON cookies on disk).
GraphQL / XHR data	Parsed from Chrome performance logs (captured requests), not from a standalone HTTP client to a documented public API.

Google Cloud Storage (when upload paths run)

SeleniumBackend constructs google.cloud.storage.Client() and uses [main].gcs_bucket_name for:

UploadAndEnqueue.upload_and_enqueue — uploads JSONL artifacts and enqueues (see PostgreSQL below). Triggered from on_posts_batch_ready / on_comments_batch_ready when those batches complete.
upload_video_to_gcs — when enable_screenshots is true, uploads the shutdown MP4 to the same bucket under vid_log/.

Setup: Application Default Credentials, or GOOGLE_APPLICATION_CREDENTIALS pointing to a service account JSON with write access to the configured bucket. Without valid credentials, these steps fail when executed.

Path rule: services/upload_enqueue.py builds object names from local paths that contain the marker /outputs/ (default GcsUploadConfig.outputs_marker). Typical layouts use something like .../outputs/<date>/... so uploads resolve correctly.

PostgreSQL (when enqueue runs)

igscraper/services/enqueue_client.py FileEnqueuer inserts rows after a successful GCS upload, using psycopg with DSN from environment. Env files are loaded in order: ~/.slug/.env (if present), then ENV_FILE or .env in the current working directory (project file overrides the cache file for duplicate keys).

Variable	Role (defaults in code)
`PUGSY_PG_HOST`	Host (`localhost`)
`PUGSY_PG_PORT`	Port (`5432` default; use `5433` if Postgres listens on a Docker-mapped port)
`PUGSY_PG_USER`	User (`postgres` when unset on Linux; on macOS defaults to your login — Homebrew often has no `postgres` role)
`PUGSY_PG_PASSWORD`	Password (empty default)
`PUGSY_PG_DATABASE`	Database name (`postgres` when unset — typical local default; override for production)

Tables: crawled_posts and crawled_comments (see docstring in enqueue_client.py for expected columns, including thor_worker_id).

Full-video download script (in-process)

When scrape_using_captured_requests is false and DOM media extraction yields videos, services/full_media_download_script.py write_and_run_full_download_script runs in the same process as the pipeline (writes a bash script under the media path and optionally executes it). No Redis, Celery, or separate worker is used.

Other HTTP (`requests`)

Helpers in utils.py / downloader.py may use requests for ancillary downloads (e.g. media URLs). Those are not separate “API accounts”; they use normal HTTPS when those code paths run.

Data Models and Parsing

GraphQL Model Registry

The application uses a registry-based approach to parse GraphQL API responses:

Model Registration: Pydantic models are registered with regex patterns matching GraphQL data keys
Network Capture: Chrome performance logs are captured to extract GraphQL responses
Pattern Matching: Data keys are matched against registered patterns
Validation: Responses are validated and structured using Pydantic models
Flattening: Data is flattened according to schema rules defined in flatten_schema.yaml

Flatten Schema

The flatten_schema.yaml file defines rules for extracting and flattening nested GraphQL data structures. It specifies:

Which keys to extract from responses
How to flatten nested objects
Field mappings and transformations

Authentication

Cookie Generation

Before running the scraper, authentication cookies must be generated:

Run python src/igscraper/login_Save_cookie.py
A Chrome browser window opens to Instagram login page
Manually log in to your Instagram account
Press Enter in the terminal
Cookies are saved to src/igscraper/cookies_{timestamp}.pkl

Cookie Usage

During scraping:

SeleniumBackend.start() calls _login_with_cookies()
Browser navigates to https://www.instagram.com/
Cookies are loaded from the pickle file
Cookies are added to the WebDriver session
Page is refreshed to apply authentication

Docker and Docker Compose

The scraper supports running in Docker containers, providing a consistent environment across different platforms and simplifying deployment.

Docker Support

Docker support is controlled via the use_docker configuration option in config.toml:

[main]
use_docker = true  # Set to true when running in Docker

When use_docker=True, the backend:

Uses CHROME_BIN / CHROMEDRIVER_BIN when set; otherwise the same pinned paths as the Dockerfile (/opt/chrome-linux64/chrome, /opt/chromedriver-linux64/chromedriver)
Applies Docker-specific Chrome flags: --no-sandbox, --disable-dev-shm-usage, --disable-gpu
Uses /tmp/chrome-profile as the Chrome user data directory (can be overridden via IGSCRAPER_CHROME_PROFILE env var)
Configures platform identity as "Linux x86_64"

Dockerfile

The project includes a Dockerfile that:

Uses Python 3.10 slim base image
Installs Chrome for Testing (version-locked to 143.0.7499.170) and matching ChromeDriver
Installs all Python dependencies from requirements.txt
Sets up environment variables for Chrome binaries
Includes version validation to ensure Chrome and ChromeDriver major versions match

Key Dockerfile Features:

Version-locked Chrome installation for reproducibility
Hard assertion that Chrome and ChromeDriver major versions match
Proper Chrome runtime dependencies installed
Optimized for Linux x86_64 platform

Docker Compose

The repository includes a canonical docker-compose.yml (service name Slug-Ig-Crawler, image built from this Dockerfile) for local and manual runs. Thor and other orchestrators do not ship this file; they use whatever path is in DOCKER_COMPOSE_FILE and typically run one-off jobs like:

docker compose -f /path/to/compose.yml run --rm -v "$WORKSPACE:/job" Slug-Ig-Crawler \
  Slug-Ig-Crawler --config /job/config.toml

The compose file in this repo sets PYTHONPATH, CHROME_BIN, CHROMEDRIVER_BIN, and shm_size: 2gb to match the image. Optional host-specific variables (GCS credentials, etc.) can be passed with -e or via a local .env (see .env.example; use e.g. docker compose --env-file .env … if you add one).

Usage (examples):

docker compose build
docker compose run --rm Slug-Ig-Crawler Slug-Ig-Crawler --config config.toml

Prerequisites:

Docker and Docker Compose installed
Valid config.toml with use_docker = true for in-container runs
For GCS upload from the container, mount credentials and set GOOGLE_APPLICATION_CREDENTIALS as appropriate

Important Notes:

The Chrome profile directory defaults to /tmp/chrome-profile (RAM-mounted on remote servers) and is automatically created if it doesn't exist. Can be overridden via IGSCRAPER_CHROME_PROFILE environment variable.
Shared memory size (shm_size) in compose is set to reduce Chrome crashes in containers

Data Persistence

Local Storage

Data is persisted to local files in JSONL (JSON Lines) format:

Metadata File: Contains complete post data including title, media, likes, comments
Skipped File: Logs posts that failed to scrape with error reasons
Post Entity File: Parsed GraphQL entities (comments, replies) with flattened structure
Profile File: Profile page GraphQL data

Cloud Storage Integration

Completed data files are automatically:

Sorted: JSONL files are sorted by timestamp (optional)
Uploaded: Files are uploaded to Google Cloud Storage (GCS)
Enqueued: GCS URIs are enqueued to PostgreSQL database for downstream processing

Screenshot Video Finalization

When enable_screenshots = true in configuration, the scraper automatically:

Captures Screenshots: Takes periodic screenshots (every 7 seconds) during scraping, saved as .webp files in shot_dir
Generates Video: At shutdown, converts all screenshots into a single MP4 video:
- FPS: 2.5 frames per second
- Resolution: 640p height (width auto-scaled to preserve aspect ratio)
- Format: MP4 (H.264 codec)
- Location: Generated in-place in the screenshot directory
Uploads to GCS: Video is uploaded to gs://{bucket}/vid_log/{video_name}.mp4
- PROFILE mode: profile_{consumer_id}_{profile_name}_{timestamp}.mp4
- POST mode: post_{consumer_id}_{run_name}_{timestamp}.mp4
- Bucket Name Validation: The bucket name is automatically sanitized and validated:
  - Handles path-like bucket names (e.g., /app/pugsy_ai_crawled_data → pugsy_ai_crawled_data)
  - Removes gs:// prefix if present
  - Validates GCS bucket name format (must start/end with letter/number, 3-63 chars)
  - Works correctly in both local and Docker environments
Cleans Up: After successful upload (or on failure), all local screenshots and the video file are deleted

Requirements:

At least 2 screenshots must exist (otherwise video generation is skipped)
gcs_bucket_name must be configured in config.toml
Video finalization runs automatically during shutdown (no manual intervention needed)

Configuration:

Bucket Name: Can be specified as:
- Simple name: gcs_bucket_name = "pugsy_ai_crawled_data"
- With gs:// prefix: gcs_bucket_name = "gs://pugsy_ai_crawled_data" (prefix is automatically removed)
- Path-like values are handled: If the config value looks like a path (e.g., /app/pugsy_ai_crawled_data), the basename is extracted automatically
Consumer ID: Used in video filename for identification
Profile/Run Names: Automatically sanitized to remove invalid filename characters

Error Handling:

Video generation failures are logged but don't block shutdown
GCS upload failures are logged but cleanup still runs
Missing configuration fields result in skipped video generation (with warnings)
Invalid bucket names are validated and sanitized automatically, with clear error messages if sanitization fails

File Formats

Metadata JSONL Format:

{
  "post_url": "https://www.instagram.com/p/ABC123/",
  "post_id": "post_0",
  "post_title": {
    "aHref": "/username/",
    "timeDatetime": "2024-01-01T12:00:00.000Z",
    "siblingTexts": ["Post caption text"]
  },
  "post_media": [...],
  "likes": {
    "likesNumber": 1000,
    "likesText": "1,000 likes"
  },
  "post_comments_gif": [...]
}

Key Design Patterns

Page Object Model: Page interactions are encapsulated in BasePage and ProfilePage classes
Backend Abstraction: Backend abstract base class allows for different browser automation backends
Configuration Management: Pydantic models provide type-safe configuration with validation
Registry Pattern: GraphQL models are registered and matched dynamically
Batch Processing: Posts are processed in configurable batches to manage memory and rate limiting
Error Handling: Comprehensive try-except blocks with logging ensure graceful failure handling
Resource Cleanup: finally blocks ensure browser cleanup even on errors

Dependencies

Key external dependencies:

selenium: WebDriver automation
seleniumwire: Network request interception
pydantic: Configuration validation
google-cloud-storage: GCS upload functionality
psycopg2: PostgreSQL database connectivity
imageio and imageio-ffmpeg: Video generation from screenshots
Pillow: Image processing for screenshot resizing

Security Considerations

Platform policy and law: Technical mitigations below do not replace compliance with Instagram / Meta terms or applicable law—see Open source, research use, and acceptable use.
URL Validation: chrome.py patches WebDriver methods to monitor for suspicious navigation
Cookie Security: Cookies are stored locally and never exposed in logs
Rate Limiting: Random delays and batch processing reduce detection risk
Anti-Detection: Chrome options configured to evade bot detection

Troubleshooting

Common Issues

ChromeDriver not found:
- Set CHROME_BIN and CHROMEDRIVER_BIN to override in any mode; otherwise local runs use optional TOML paths or macOS defaults, Docker runs use the image’s pinned paths
Version mismatch: Chrome and ChromeDriver major versions must match. The Dockerfile validates this automatically.
Cookie authentication fails: Regenerate cookies using login_Save_cookie.py. In Docker, ensure cookies are in the mounted Chrome profile directory.
Rate limiting: Increase rate_limit_seconds_min and rate_limit_seconds_max in config
Memory issues: Reduce batch_size to process fewer posts simultaneously. In Docker, adjust mem_limit and mem_reservation in docker-compose.yml if needed.
Docker-specific issues:
- Ensure use_docker=true in config when running in Docker
- Check that shared memory size is sufficient (default is 2GB)
- Verify all required volumes are properly mounted

Debugging

Set headless = false to observe browser behavior
Set logging.level = "DEBUG" for verbose logging
Validate TOML (including [trace].thor_worker_id) against config.example.toml before long runs

Performance Timing & Observability

The scraper emits structured timing logs for performance analysis, bottleneck detection, and cost modeling. These logs are designed to be Prometheus/Loki-friendly and provide insights into both active processing time and total wall-clock time.

Timing Metrics

The scraper tracks two distinct timing metrics:

Total Time

Measures end-to-end wall time from when a unit of work becomes eligible for processing until it completes (success or error). This includes:

Scrolling operations
Clicking actions
DOM waits
Network waits
Explicit rate-limit sleeps
Retries and backoff delays

Total time reflects real-world scraping latency as experienced by the system.

Active Time

Measures intentional work time spent on actual scraping operations. This includes:

Scrolling actions
Clicking
DOM extraction
Media extraction
GraphQL capture and parsing

Active time excludes:

Explicit sleep() calls
Idle polling loops
Backoff delays

Active time answers: "How expensive is this profile or post to scrape?"

Required Invariant: active_time_ms <= total_time_ms (always enforced)

Log Events

Two separate structured log events are emitted for each tracked operation:

pipeline_total_time - Total wall-clock time
pipeline_active_time - Active processing time

These events are never combined into a single log entry. Each event is emitted independently with the same schema.

Log Schema

Both timing events use the following structured schema (emitted as JSON):

Field	Value
`event`	`pipeline_active_time` OR `pipeline_total_time`
`category`	`creator_profile` OR `creator_content`
`creator_handle`	Instagram profile handle
`content_id`	Post/Reel ID or URL slug, or `null` for profile
`pipeline`	Fixed value: `"Slug-Ig-Crawler"`
`duration_ms`	Integer milliseconds
`status`	`"success"` or `"error"`
`error_type`	Exception class name or `null`
`consumer_id`	Consumer ID from config (or `null` if not set)
`thor_worker_id`	Value from `[trace].thor_worker_id` in config

Timing Levels

Profile-Level Timing

Location: pipeline.py - _scrape_single_profile()

Scope: Wraps the entire profile scraping execution, including:

Profile navigation
Post URL collection
Batch post scraping

Category: creator_profile

Example Log Entries:

{"event": "pipeline_total_time", "category": "creator_profile", "creator_handle": "example_user", "content_id": null, "pipeline": "Slug-Ig-Crawler", "duration_ms": 125000, "status": "success", "error_type": null, "consumer_id": "default_consumer"}
{"event": "pipeline_active_time", "category": "creator_profile", "creator_handle": "example_user", "content_id": null, "pipeline": "Slug-Ig-Crawler", "duration_ms": 95000, "status": "success", "error_type": null, "consumer_id": "default_consumer"}

Post/Reel-Level Timing

Location: selenium_backend.py - _scrape_and_close_tab()

Scope: Wraps the full lifecycle of scraping one post/reel, including:

Tab switching
Title/metadata extraction
Media extraction
Likes extraction
Comments extraction

Category: creator_content

Content ID: Uses post shortcode if available, otherwise falls back to post URL slug.

Example Log Entries:

{"event": "pipeline_total_time", "category": "creator_content", "creator_handle": "example_user", "content_id": "ABC123xyz", "pipeline": "Slug-Ig-Crawler", "duration_ms": 8500, "status": "success", "error_type": null, "consumer_id": "default_consumer"}
{"event": "pipeline_active_time", "category": "creator_content", "creator_handle": "example_user", "content_id": "ABC123xyz", "pipeline": "Slug-Ig-Crawler", "duration_ms": 6200, "status": "success", "error_type": null, "consumer_id": "default_consumer"}

Error Handling

Timing logs always emit, even on failure:

On exception: status = "error", error_type = <exception class name>
After logging: Exception is re-raised (never swallowed)
Both total and active time are recorded up to the point of failure

Implementation Details

Clock: Uses time.perf_counter() (monotonic clock) for precise measurements
Precision: Durations converted to integer milliseconds
Independence: Active and total time are measured independently with separate timers
Placement: Timers wrap existing function boundaries, avoiding tight inner loops

Use Cases

These timing logs enable:

Latency Analysis: Understand real-world scraping performance across profiles and posts
Bottleneck Detection: Identify slow operations by comparing active vs total time
Cost Modeling: Estimate resource costs based on active processing time
Prometheus/Loki Ingestion: Structured JSON format is ready for log aggregation systems
Performance Regression Detection: Track timing trends over time

Example Analysis

# Extract profile-level timings
grep "pipeline_total_time.*creator_profile" scraper_log_*.log | jq '.duration_ms'

# Compare active vs total time for posts
grep "pipeline_.*_time.*creator_content" scraper_log_*.log | jq '{event, duration_ms}' | jq -s 'group_by(.event)'

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

SHANGFY

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.2.28

Apr 8, 2026

2.2.28b0 pre-release

Apr 8, 2026

2.2.28a0 pre-release

Apr 8, 2026

2.2.26

Apr 8, 2026

2.2.25

Apr 8, 2026

2.2.24

Apr 8, 2026

2.2.22

Apr 8, 2026

2.2.21

Apr 8, 2026

2.2.2

Apr 8, 2026

2.2.1

Apr 8, 2026

2.2.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slug_ig_crawler-2.2.28.tar.gz (225.2 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slug_ig_crawler-2.2.28-py3-none-any.whl (196.8 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file slug_ig_crawler-2.2.28.tar.gz.

File metadata

Download URL: slug_ig_crawler-2.2.28.tar.gz
Upload date: Apr 8, 2026
Size: 225.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slug_ig_crawler-2.2.28.tar.gz
Algorithm	Hash digest
SHA256	`da86eadba1cfbf7effe5a0345367e7c103fce020e5424c59d969fcdf70e2bed9`
MD5	`5041b0e2b71beabb25a4e438266dbbba`
BLAKE2b-256	`62d5fba12f1b4ee0efeb357e08be4538b0901d1c94a19346d550a1a2e5d7a58f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slug_ig_crawler-2.2.28.tar.gz:

Publisher: publish-pypi.yml on Pugsyfy/Slug-IG-Crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slug_ig_crawler-2.2.28.tar.gz
- Subject digest: da86eadba1cfbf7effe5a0345367e7c103fce020e5424c59d969fcdf70e2bed9
- Sigstore transparency entry: 1258010714
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: Pugsyfy/Slug-IG-Crawler@f94335eeb0f166503c435cad4c85117de5afe8b4
- Branch / Tag: refs/tags/v2.2.28
- Owner: https://github.com/Pugsyfy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@f94335eeb0f166503c435cad4c85117de5afe8b4
- Trigger Event: release

File details

Details for the file slug_ig_crawler-2.2.28-py3-none-any.whl.

File metadata

Download URL: slug_ig_crawler-2.2.28-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 196.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slug_ig_crawler-2.2.28-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7bb52dc5c4031466690171340eaff0b30d835527bc22ab8f36f7d74a50ccc0aa`
MD5	`4f26f8551868d5306ee1b1e877b19382`
BLAKE2b-256	`351887f59d41d4466004e82d9f966b1b7b1658084c0e1dbccfefd8937b00bbd1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slug_ig_crawler-2.2.28-py3-none-any.whl:

Publisher: publish-pypi.yml on Pugsyfy/Slug-IG-Crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slug_ig_crawler-2.2.28-py3-none-any.whl
- Subject digest: 7bb52dc5c4031466690171340eaff0b30d835527bc22ab8f36f7d74a50ccc0aa
- Sigstore transparency entry: 1258010812
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: Pugsyfy/Slug-IG-Crawler@f94335eeb0f166503c435cad4c85117de5afe8b4
- Branch / Tag: refs/tags/v2.2.28
- Owner: https://github.com/Pugsyfy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@f94335eeb0f166503c435cad4c85117de5afe8b4
- Trigger Event: release

slug-ig-crawler 2.2.28

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Slug-Ig-Crawler

Table of contents

Start here

Reference (deep dive)

What this repository is

Objectives & scope

Features

Key configuration flags

Installation

Quickstart (5-10 minutes)

Documentation map

Development from source (git clone)

Open source, research use, and acceptable use

Architecture Overview

Runtime mode selection

Config template and Thor

Entry Point: CLI

cli.py

VS Code debugging (launch.json)

Core Components

1. Configuration Layer (config.py)

2. Pipeline Layer (pipeline.py)

3. Backend Layer (backends/selenium_backend.py)

4. Page Objects (pages/)

base_page.py

profile_page.py

5. Data Extraction and Parsing

GraphQL Model Registry (models/registry_parser.py)

Utility Functions (utils.py)

6. Data Persistence

Local File Storage

Cloud Storage and Enqueueing (services/upload_enqueue.py)

7. Authentication (login_Save_cookie.py)

End-to-End Workflow

High-Level Flow

Detailed Step-by-Step Execution

Phase 1: Initialization

Phase 2: Profile Scraping

Phase 3: Comment Extraction (GraphQL Mode)

Phase 4: Cleanup

Sequence Diagram

Configuration

Trace ([trace])

Configuration File Structure

Path Placeholders

External services and infrastructure

Required TOML sections

Instagram and the browser (always for scraping)

Google Cloud Storage (when upload paths run)

PostgreSQL (when enqueue runs)

Full-video download script (in-process)

Other HTTP (requests)

Data Models and Parsing

GraphQL Model Registry

Flatten Schema

Authentication

Cookie Generation

Cookie Usage

Docker and Docker Compose

Docker Support

Dockerfile

Docker Compose

Data Persistence

Local Storage

Cloud Storage Integration

Screenshot Video Finalization

File Formats

Key Design Patterns

Dependencies

Security Considerations

`cli.py`

VS Code debugging (`launch.json`)

1. Configuration Layer (`config.py`)

2. Pipeline Layer (`pipeline.py`)

3. Backend Layer (`backends/selenium_backend.py`)

4. Page Objects (`pages/`)

`base_page.py`

`profile_page.py`

GraphQL Model Registry (`models/registry_parser.py`)

Utility Functions (`utils.py`)

Cloud Storage and Enqueueing (`services/upload_enqueue.py`)

7. Authentication (`login_Save_cookie.py`)

Trace (`[trace]`)

Other HTTP (`requests`)