High-performance semantic search with intelligent company grouping and parallel execution
Project description
Smart Batching Search
A high-performance semantic search system that reduces API queries by 67-99% (varies by topic specificity) through intelligent company grouping and parallel execution.
This module provides a two-step system for efficient semantic search:
- Planning: Organize search using smart batching and return total expected chunks
- Execution: Perform search with proportional sampling to preserve distribution
Key Benefits
- 67-99% Query Reduction: Search 4,732 companies with only 17-3,699 queries (varies by topic)
- Parallel Execution: Rate-limited concurrent requests with semaphore control
- Proportional Sampling: Retrieve percentage of results while preserving distribution
- Production Ready: Comprehensive error handling, retries, and logging
- Scalable: Efficiently handles universes with 10,000+ companies
Installation
Using uv (Recommended)
# Clone or navigate to the project
cd bigdata-smart-batching
# Install in development mode
uv sync
# Or with dev dependencies
uv sync --all-extras
Environment Setup
Set up environment variables:
export BIGDATA_API_KEY="your_api_key_here"
export BIGDATA_API_BASE_URL="https://api.bigdata.com" # Optional, defaults to this
Or create a .env file:
BIGDATA_API_KEY=your_api_key_here
BIGDATA_API_BASE_URL=https://api.bigdata.com
Quick Start
from bigdata_smart_batching import (
plan_search,
execute_search,
deduplicate_documents,
convert_to_dataframe,
)
# Step 1: Plan the search
plan = plan_search(
text="earnings revenue profit",
universe_csv_path="id_name_mapping_us_top_3000.csv",
start_date="2023-01-01",
end_date="2023-12-31",
api_key="your_api_key", # or set BIGDATA_API_KEY env var
)
print(f"Total expected chunks: {plan['total_expected_chunks']:,}")
# Step 2: Execute search with 10% of total chunks (preserves distribution)
results_raw = execute_search(
search_plan=plan,
chunk_percentage=0.1,
requests_per_minute=100,
)
# Step 3: Deduplicate and convert to DataFrame
results = deduplicate_documents(results_raw)
print(f"Retrieved {len(results)} documents (deduplicated)")
df = convert_to_dataframe(results) # one row per chunk
Save and Load Plans
from bigdata_smart_batching import plan_search, execute_search, save_plan, load_plan
# Create and save a plan
plan = plan_search(
text="merger acquisition",
universe_csv_path="id_name_mapping_us_top_3000.csv",
start_date="2023-01-01",
end_date="2023-12-31",
)
save_plan(plan, "my_search_plan.json")
# Later: reload and run with different sampling
plan = load_plan("my_search_plan.json")
raw_10 = execute_search(plan, chunk_percentage=0.1)
raw_50 = execute_search(plan, chunk_percentage=0.5)
How It Works
Architecture Overview
Step 1: PLANNING
Universe CSV --> Co-mention API Query --> Basket Creation --> Search Plan
Step 2: EXECUTION
Proportional Sampling --> Parallel Search (Rate Limited) --> Collect & Aggregate
Planning (plan_search())
- Loads the universe of companies from CSV
- Queries the comention endpoint to get chunk volumes per company
- Splits date ranges by volume when a company exceeds the chunk limit
- Creates optimized baskets grouped by volume
- Returns a plan with total expected chunks and basket configurations
Execution (execute_search())
- Calculates proportional chunks per basket
- Ensures minimum of 1 chunk per basket (if expected > 0)
- Executes searches in parallel with rate limiting and semaphore
- Collects and returns document results
API Reference
plan_search()
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Search query text |
universe_csv_path |
str |
required | Path to CSV with entity IDs |
start_date |
str |
required | Start date (YYYY-MM-DD) |
end_date |
str |
required | End date (YYYY-MM-DD) |
api_key |
str |
env var | API key |
api_base_url |
str |
env var | API base URL |
volume_query_mode |
str |
"three_pass" |
"three_pass" or "iterative" |
apply_volume_splits |
bool |
True |
Use volume time series for period splitting |
min_period_days |
int |
30 |
Minimum days per sub-period |
execute_search()
| Parameter | Type | Default | Description |
|---|---|---|---|
search_plan |
Dict |
required | Plan from plan_search() |
chunk_percentage |
float |
required | 0.0 to 1.0 sampling ratio |
requests_per_minute |
int |
100 |
Rate limit |
api_key |
str |
env var | API key |
max_workers |
int |
40 |
Parallel workers |
Helper Functions
deduplicate_documents(documents)-- Merges duplicate documents byidload_universe_from_csv(csv_path)-- Loads entity IDs from CSVconvert_to_dataframe(raw_results)-- Converts documents to DataFrame (one row per chunk)save_plan(plan, path)/load_plan(path)-- Persist plans as JSONportfolio_backtesting_pipeline(...)-- Long-short portfolio backtesting
Testing
# Run all tests
uv run pytest
# With coverage
uv run pytest --cov=bigdata_smart_batching --cov-report=term-missing
# Specific test file
uv run pytest tests/test_validation.py -v
Project Structure
bigdata-smart-batching/
├── pyproject.toml
├── README.md
├── .python-version
├── src/
│ └── bigdata_smart_batching/
│ ├── __init__.py
│ ├── smart_batching.py
│ ├── smart_batching_config.py
│ ├── search_function.py
│ ├── output_converter.py
│ └── portfolio_backtesting.py
└── tests/
├── __init__.py
├── test_config.py
├── test_output_converter.py
├── test_validation.py
└── test_rate_limiter.py
Configuration
Environment Variables
BIGDATA_API_KEY: Required -- Your Bigdata API keyBIGDATA_API_BASE_URL: Optional -- API base URL (default:https://api.bigdata.com)
Default Settings
requests_per_minute: 100max_workers: 40max_chunks_per_basket: 1000volume_query_mode:"three_pass"
License
This project is part of the Bigdata.com and WorldQuant Challenge.
Disclaimer: This software is provided "as is" without warranty of any kind, express or implied. The authors and contributors assume no responsibility for the accuracy, completeness, or usefulness of any information, results, or processes provided. This software is for educational and research purposes only and is not intended to be used as financial advice.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bigdata_smart_batching-0.1.0.tar.gz.
File metadata
- Download URL: bigdata_smart_batching-0.1.0.tar.gz
- Upload date:
- Size: 173.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5701c047df597192e7b6516b3cefdf2163ed00fff91d0a0ee8fb2ae9d2915ac8
|
|
| MD5 |
84340aa9e46398df4895aebb1869ede8
|
|
| BLAKE2b-256 |
c5eb66a113eecc88ffd4b88ff9dcf9bfcb4709a07955cb9c7a0f8f8e77e86dc6
|
File details
Details for the file bigdata_smart_batching-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bigdata_smart_batching-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24a7fda28ca0fdb623ee058e5131a2bad53909f22a9723ab9db7577f459fc383
|
|
| MD5 |
e6a81362379d94f38cdac1f5ef80bc86
|
|
| BLAKE2b-256 |
4f367e0356b3d7625a195e35705b74427204ebdb0b89060863d25aca5ca2c716
|