High-performance semantic search with intelligent company grouping and parallel execution

Project description

Smart Batching Search

A high-performance semantic search system that reduces API queries by 67-99% (varies by topic specificity) through intelligent company grouping and parallel execution.

This module provides a two-step system for efficient semantic search:

Planning: Organize search using smart batching and return Chunk upper bound estimate
Execution: Perform search with proportional sampling to preserve distribution

Key Benefits

67-99% Query Reduction: Search 4,732 companies with only 17-3,699 queries (varies by topic)
Parallel Execution: Rate-limited concurrent requests with semaphore control
Proportional Sampling: Retrieve percentage of results while preserving distribution
Production Ready: Comprehensive error handling, retries, and logging
Scalable: Efficiently handles universes with 10,000+ companies

Installation

Install the package from PyPI (Python 3.11+):

pip install bigdata-smart-batching

With uv:

uv add bigdata-smart-batching

Development

To work on this repository locally, from the project root:

uv sync

Environment Setup

Set up environment variables:

export BIGDATA_API_KEY="your_api_key_here"
export BIGDATA_API_BASE_URL="https://api.bigdata.com"  # Optional, defaults to this

Or create a .env file:

BIGDATA_API_KEY=your_api_key_here
BIGDATA_API_BASE_URL=https://api.bigdata.com

Universe CSV file

plan_search() loads company entity IDs from a UTF-8 CSV. IDs must match the identifiers used by the Bigdata API for your dataset. Two layouts are supported:

1. Header row with an id column (optional extra columns such as name are ignored):

id,name
B8EF97,Example Corp A
BB07E4,Example Corp B
3461CF,Example Corp C

Quick Start

from bigdata_smart_batching import (
    plan_search,
    execute_search,
    deduplicate_documents,
    convert_to_dataframe,
)

# Step 1: Plan the search
plan = plan_search(
    text="earnings revenue profit",
    universe_csv_path="id_name_mapping_us_top_3000.csv",
    start_date="2023-01-01",
    end_date="2023-12-31",
    api_key="your_api_key",  # or set BIGDATA_API_KEY env var
)

print(f"Chunk upper bound estimate: {plan['chunk_upper_bound_estimate']:,}")

# Step 2: Execute search with 10% of total chunks (preserves distribution)
results_raw = execute_search(
    search_plan=plan,
    chunk_percentage=0.1,
    requests_per_minute=100,
)

# Step 3: Deduplicate and convert to DataFrame
results = deduplicate_documents(results_raw)
print(f"Retrieved {len(results)} documents (deduplicated)")

df = convert_to_dataframe(results)  # one row per chunk

Save and Load Plans

from bigdata_smart_batching import plan_search, execute_search, save_plan, load_plan

# Create and save a plan
plan = plan_search(
    text="merger acquisition",
    universe_csv_path="id_name_mapping_us_top_3000.csv",
    start_date="2023-01-01",
    end_date="2023-12-31",
)
save_plan(plan, "my_search_plan.json")

# Later: reload and run with different sampling
plan = load_plan("my_search_plan.json")
raw_10 = execute_search(plan, chunk_percentage=0.1)
raw_50 = execute_search(plan, chunk_percentage=0.5)

How It Works

Architecture Overview

Step 1: PLANNING
  Universe CSV  -->  Co-mention API Query  -->  Basket Creation  -->  Search Plan

Step 2: EXECUTION
  Proportional Sampling  -->  Parallel Search (Rate Limited)  -->  Collect & Aggregate

Planning (`plan_search()`)

Loads the universe of companies from CSV
Queries the comention endpoint to get chunk volumes per company
Splits date ranges by volume when a company exceeds the chunk limit
Creates optimized baskets grouped by volume
Returns a plan with Chunk upper bound estimate and basket configurations

Execution (`execute_search()`)

Calculates proportional chunks per basket
Ensures minimum of 1 chunk per basket (if expected > 0)
Executes searches in parallel with rate limiting and semaphore
Collects and returns document results

API Reference

`plan_search()`

Parameter	Type	Default	Description
`text`	`str`	required	Search query text
`universe_csv_path`	`str`	required	Path to CSV with entity IDs
`start_date`	`str`	required	Start date (YYYY-MM-DD)
`end_date`	`str`	required	End date (YYYY-MM-DD)
`api_key`	`str`	env var	API key
`api_base_url`	`str`	env var	API base URL
`volume_query_mode`	`str`	`"three_pass"`	`"three_pass"` or `"iterative"`
`apply_volume_splits`	`bool`	`True`	Use volume time series for period splitting
`min_period_days`	`int`	`30`	Minimum days per sub-period

`execute_search()`

Parameter	Type	Default	Description
`search_plan`	`Dict`	required	Plan from `plan_search()`
`chunk_percentage`	`float`	required	0.0 to 1.0 sampling ratio
`requests_per_minute`	`int`	`100`	Rate limit
`api_key`	`str`	env var	API key
`max_workers`	`int`	`40`	Parallel workers

Helper Functions

deduplicate_documents(documents) -- Merges duplicate documents by id
load_universe_from_csv(csv_path) -- Loads entity IDs from CSV
convert_to_dataframe(raw_results) -- Converts documents to DataFrame (one row per chunk)
save_plan(plan, path) / load_plan(path) -- Persist plans as JSON
portfolio_backtesting_pipeline(...) -- Long-short portfolio backtesting

Testing

# Run all tests
uv run pytest

# With coverage
uv run pytest --cov=bigdata_smart_batching --cov-report=term-missing

# Specific test file
uv run pytest tests/test_validation.py -v

Project Structure

bigdata-smart-batching/
├── pyproject.toml
├── README.md
├── .python-version
├── src/
│   └── bigdata_smart_batching/
│       ├── __init__.py
│       ├── smart_batching.py
│       ├── smart_batching_config.py
│       ├── search_function.py
│       ├── output_converter.py
│       └── portfolio_backtesting.py
└── tests/
    ├── __init__.py
    ├── test_config.py
    ├── test_output_converter.py
    ├── test_validation.py
    └── test_rate_limiter.py

Configuration

Environment Variables

BIGDATA_API_KEY: Required -- Your Bigdata API key
BIGDATA_API_BASE_URL: Optional -- API base URL (default: https://api.bigdata.com)

Default Settings

requests_per_minute: 100
max_workers: 40
max_chunks_per_basket: 1000
volume_query_mode: "three_pass"

License

This project is part of the Bigdata.com and WorldQuant Challenge.

Disclaimer: This software is provided "as is" without warranty of any kind, express or implied. The authors and contributors assume no responsibility for the accuracy, completeness, or usefulness of any information, results, or processes provided. This software is for educational and research purposes only and is not intended to be used as financial advice.

Project details

Release history Release notifications | RSS feed

1.1.0

Apr 20, 2026

1.0.2

Apr 17, 2026

1.0.1

Apr 17, 2026

This version

0.2.0

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigdata_smart_batching-0.2.0.tar.gz (114.5 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigdata_smart_batching-0.2.0-py3-none-any.whl (31.8 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file bigdata_smart_batching-0.2.0.tar.gz.

File metadata

Download URL: bigdata_smart_batching-0.2.0.tar.gz
Upload date: Apr 14, 2026
Size: 114.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.15

File hashes

Hashes for bigdata_smart_batching-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6ed1bf8bb730fcdb90f637f56410eb046644c1acc2fd4fa87e9f04a49807c86c`
MD5	`6cd231183a45c3731e678b0ed822dd2a`
BLAKE2b-256	`089ffd4d5977f3ac7d9a369bf6e152e59439dc3ede67f4afec71bca7277140e7`

See more details on using hashes here.

File details

Details for the file bigdata_smart_batching-0.2.0-py3-none-any.whl.

File metadata

Download URL: bigdata_smart_batching-0.2.0-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 31.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.15

File hashes

Hashes for bigdata_smart_batching-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95718f998cee39ee44af50e6a38053c3414b813bae6e4473ed3bd1e0156739cb`
MD5	`d181d54e434a83190742be1bc8d1b1a7`
BLAKE2b-256	`66a65e767cfa29a3c0931a52e1ebdda2a9461063615790df596273d5b6d1a90c`

See more details on using hashes here.

bigdata-smart-batching 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Smart Batching Search

Key Benefits

Installation

Development

Environment Setup

Universe CSV file

Quick Start

Save and Load Plans

How It Works

Architecture Overview

Planning (plan_search())

Execution (execute_search())

API Reference

plan_search()

execute_search()

Helper Functions

Testing

Project Structure

Configuration

Environment Variables

Default Settings

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Planning (`plan_search()`)

Execution (`execute_search()`)

`plan_search()`

`execute_search()`