A comprehensive factor library management system for quantitative trading research
Project description
FactorDBMS
A comprehensive factor library management system for quantitative trading research.
Features
- Unified Operator Mapping: Integrates operators from different factor mining frameworks (gpfactor, masfactorMiner; mas1 legacy) into a common set of operations
- Expression Processing: Parse and calculate factor expressions across frameworks; normalization is available (auto-detect by default, can be disabled when you already use unified names)
- Factor Evaluation: 30+ evaluation metrics including IC, monotonicity, returns, and distribution statistics
- Orthogonality Analysis: Comprehensive factor redundancy analysis with correlation, clustering, and selection algorithms
- Database Storage: ClickHouse-based storage for factor values and metadata
- Automated Factor Management: Factor registration, calculation, and lifecycle management
Installation
# Clone the repository
git clone https://github.com/ElenYoung/FactorDBMS.git
cd FactorDBMS
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .
Configuration
Create a .env file in the project root with your ClickHouse database credentials:
DB_HOST=localhost
DB_PORT=9000
DB_USER=default
DB_PASSWORD=your_password
Quick Start
1. Expression Processing
from factordb import ExpressionNormalizer, ExpressionCalculator
# Normalize expressions from different frameworks (optional)
normalizer = ExpressionNormalizer()
# gpfactor (C++ style)
expr = normalizer.normalize_expression('sub(close)(low)', 'gpfactor')
# Result: 'sub(close, low)'
# mas1 (CamelCase style)
expr = normalizer.normalize_expression('Mul(Rank(close), EMA(volume, 5))', 'mas1')
# Result: 'mul(cs_rank(close), ts_ema(volume, 5))'
# Calculate factor values (normalization is enabled by default, pass normalize=False to skip)
calculator = ExpressionCalculator()
factor_values = calculator.calculate('ts_mean(close, 20)', market_data) # auto-detect framework
factor_values_no_norm = calculator.calculate('ts_mean(close, 20)', market_data, normalize=False)
2. Parse Mining Results
from factordb.parsers import parse_factor_file
# Auto-detect framework and parse (supports gpfactor, masfactorMiner; mas1 legacy mapping)
factors = parse_factor_file('path/to/mining_results.json')
for factor in factors:
print(f"Expression: {factor.normalized_expression}")
print(f"IC Mean: {factor.get_metric('ic_mean')}")
3. Standalone Factor Analysis (for out-of-DB or US-equity data)
If you just want to compute/analyze factor expressions on arbitrary data (CSV/Parquet/ClickHouse) without using the full FactorDB pipeline:
from factor_analysis import FactorAnalysisConfig, FactorAnalysis, FactorAnalyzer
# Load config (YAML) defining data source/columns
cfg = FactorAnalysisConfig.from_yaml('path/to/factor_analysis.yaml')
# Calculate factor values
fa = FactorAnalysis(cfg)
values = fa.calculate('ts_mean(close, 20)') # returns Series with MultiIndex (code, date)
# Quick single-factor analysis
report = FactorAnalyzer().analyze(values)
print(report)
Example factor_analysis.yaml:
data_source: clickhouse # csv | parquet | clickhouse
data_path: null # required when data_source is csv/parquet
date_column: date
code_column: code
clickhouse:
database: us_market
table: daily
where: "date >= '2024-01-01'"
order_by: "code, date"
limit: 200000
Notes for factor_analysis ClickHouse mode:
factor_analysisis standalone and does not route throughstk_factors/etf_factors.- Use either
clickhouse.database+clickhouse.table, a fully qualifiedclickhouse.tablesuch asus_market.daily, or a rawclickhouse.query. asset_type/factor_typeare not used byfactor_analysisand should not appear in its config.
4. Create Custom Factors
from factordb import ExpressionFactor, CustomFactor
# Method 1: Using expression
factor = ExpressionFactor(
expression='ts_zscore(close, 20)',
name='price_zscore',
explanation='20-day price z-score'
)
# Method 2: Custom calculation
class MomentumFactor(CustomFactor):
def __init__(self, window=20):
super().__init__()
self.window = window
self._name = f"momentum_{window}d"
def _compute(self, group_data):
return group_data['close'].pct_change(self.window)
5. Factor Evaluation
from factordb.evaluators import FactorEvaluator
evaluator = FactorEvaluator()
metrics = evaluator.evaluate(factor_values, returns_data)
print(f"IC Mean: {metrics['ic_mean']:.4f}")
print(f"IC IR: {metrics['ic_ir']:.4f}")
print(f"Direction: {metrics['direction']}")
6. Factor Management (with Database)
from factordb import FactorManager, AssetType, FactorType
manager = FactorManager()
# Register a factor
factor_id = manager.register_from_expression(
expression='ts_zscore(close, 20)',
name='price_zscore',
asset_type=AssetType.STOCK,
factor_type=FactorType.DAILY
)
# Register factors from mining results
factor_ids = manager.register_mined_factors(
file_path='mining_results.json',
max_factors=100,
min_ic=0.03
)
# Calculate and save factor values
manager.calculate_and_save(factor, market_data)
# Evaluate and save results
metrics = manager.evaluate_factor(factor_id, returns_data, save_results=True)
# Search for good factors
good_factors = manager.search_factors(min_ic=0.03, min_icir=0.5)
7. Orthogonality Analysis
Analyze factor redundancy and select non-redundant factors using a three-phase framework:
from factordb.orthogonality import OrthogonalityAnalyzer
# Initialize analyzer
analyzer = OrthogonalityAnalyzer(
correlation_threshold=0.7, # High correlation pair threshold
vif_threshold=5.0, # VIF collinearity threshold
marginal_ic_threshold=0.015 # Minimum marginal IC for selection
)
# Prepare factor matrix (wide format: rows=observations, cols=factors)
factor_matrix = analyzer.prepare_factor_matrix(factor_data)
# Run full analysis
report = analyzer.analyze(
factor_matrix=factor_matrix,
returns=returns_data,
ic_values=ic_dict # {factor_id: ic_value}
)
# Phase 1: Global Correlation Results
print(f"Total factors: {report.effective_n.total_factors}")
print(f"Effective N (90%): {report.effective_n.effective_n_90}")
print(f"Mean correlation: {report.correlation_stats.mean_correlation:.3f}")
print(f"High correlation pairs: {len(report.correlation_stats.high_correlation_pairs)}")
# Phase 2: Clustering Results
print(f"Number of clusters: {report.clustering_result.n_clusters}")
print(f"Central factors: {report.mst_result.central_factors}")
print(f"Peripheral factors (unique alpha): {report.mst_result.peripheral_factors}")
# Phase 3: Selection Results
print(f"Selected factors: {len(report.final_selected_factors)}")
print(f"Removed factors: {len(report.removed_factors)}")
# Get final non-redundant factor set
selected_factors = report.final_selected_factors
Quick Analysis (No Returns Required)
For initial exploration without return data:
# Run Phase 1 & 2 only
results = analyzer.quick_analysis(factor_matrix, ic_values)
print(f"Redundancy ratio: {results['summary']['redundancy_ratio_90']:.1%}")
print(f"Suggested clusters: {analyzer.suggest_optimal_cluster_count(results['phase1']['effective_n'])}")
Interactive CLI (main.py)
main.py provides an interactive menu for managing the full factor pipeline. It reads parameters from a YAML config file and prompts for additional inputs at runtime.
Quick Start
# Run with default config (pipeline_config.yaml)
python main.py
# Run with custom config
python main.py -c path/to/config.yaml
Interactive Menu
==================================================
FactorDB Pipeline (STOCK / DAILY)
==================================================
1. Upload factors from file
2. Update factor values (incremental)
3. Evaluate factors
4. Show database status
5. Switch asset type (STOCK <-> ETF)
0. Exit
==================================================
Select option:
1. Upload Factors
Parses factors from a mining results file, registers new factors, calculates values, and evaluates them. Supports incremental execution -- if a previous run was interrupted, it detects which factors are already registered/calculated/evaluated and only runs the remaining steps.
Prompts:
- File path: Path to mining results JSON file (default from config)
- Skip evaluation: Whether to skip the evaluation step after upload
2. Update Factor Values (Incremental)
Incrementally updates all registered factor values to the latest date. Uses the A3_factor_upgrade table to determine what data is already present, then only calculates and saves the new portion.
Prompts:
- Re-evaluate: Whether to re-evaluate factors that were updated
3. Evaluate Factors
Calculates evaluation metrics (IC, ICIR, monotonicity, returns, etc.) for factors.
Prompts:
- Re-evaluate ALL: By default only evaluates factors missing metrics. Choose yes to re-evaluate all factors.
4. Show Database Status
Displays a summary of the factor database: total registered factors, how many have values, how many have evaluation metrics, and the latest/earliest update dates.
Prompts:
- Detailed list: Whether to show the full factor list
5. Switch Asset Type
Switch between STOCK and ETF factor databases. The current selection is shown in the menu header. Each asset type has its own separate database:
- STOCK:
stk_factorsdatabase, factor IDs likeF_stk_000001 - ETF:
etf_factorsdatabase, factor IDs likeF_etf_000001
Note: HIGH_FREQ (intraday) factors are not yet supported in the interactive CLI due to different evaluation logic.
Configuration File (pipeline_config.yaml)
# Factor classification
asset_type: "STOCK" # STOCK | ETF
factor_type: "DAILY" # DAILY | HIGH_FREQ (HIGH_FREQ not yet supported)
# Parallelism
n_jobs: 16
# Market data source
market_data:
database: "stocks"
price_table: "daily_adj_tushare"
basic_table: "daily_basic_tushare"
start_date: "2000-01-01"
end_date: null # null = today
# Upload command defaults
upload:
file_path: "mined_factors_demo/stocks/new_128.json"
max_factors: 140
min_score: 30.0
# Update command defaults
update:
evaluate_after_update: false
# Evaluate command defaults
evaluate:
return_column: "pct_chg"
cap_column: "circ_mv"
n_jobs: 4 # Evaluation threads (lower to avoid memory issues)
Note for ETF factors: If processing ETF factors with different market data tables, either:
- Modify the
market_datasection inpipeline_config.yamlbefore switching to ETF - Or use a separate config file:
python main.py -c etf_config.yaml
Web Dashboard (dashboard.py)
A Streamlit-based web interface for viewing factor information.
Quick Start
# Activate virtual environment and run (Windows)
.venv\Scripts\python -m streamlit run dashboard.py
# Or using uv (if project has pyproject.toml)
uv run streamlit run dashboard.py
# Or activate venv first, then run
.venv\Scripts\activate
streamlit run dashboard.py
The dashboard will open in your browser at http://localhost:8501.
Features
- Summary Statistics: Total factors, factors with values, factors with evaluation
- Factor List: Sortable table with key metrics (IC, RankIC, ICIR, etc.)
- Filtering: Filter by minimum IC, ICIR, or monotonicity
- Factor Detail: Detailed view of selected factor with expression and all metrics
- Asset Type Switch: Toggle between STOCK and ETF databases
Displayed Metrics
| Metric | Description |
|---|---|
| IC Mean | Pearson correlation with future returns |
| Rank IC | Spearman correlation (more robust) |
| IC IR | Information ratio (IC mean / IC std) |
| Rank IC IR | Rank IC information ratio |
| Mono (10g) | Monotonicity of 10-group returns |
| Top-Bottom Return | Long-short portfolio return |
| Top-Bottom Sharpe | Long-short Sharpe ratio |
All metrics are shown for Full period, 5-year, and 1-year windows.
Project Structure
FactorDB/
├── main.py # Interactive CLI entry point
├── dashboard.py # Streamlit web dashboard
├── pipeline_config.yaml # Pipeline configuration
├── src/factordb/
│ ├── core/
│ │ ├── config.py # Configuration management
│ │ ├── expression.py # Expression processing
│ │ ├── factor.py # Factor base classes
│ │ └── factor_manager.py # Factor lifecycle management
│ ├── evaluators/
│ │ ├── factor_evaluator.py # Main evaluator
│ │ ├── ic_calculator.py # IC metrics
│ │ ├── return_calculator.py # Return metrics
│ │ └── monotonicity_calculator.py
│ ├── orthogonality/ # Factor orthogonality analysis
│ │ ├── orthogonality_analyzer.py # Main orchestrator
│ │ ├── correlation_analyzer.py # Phase 1: Correlation analysis
│ │ ├── clustering_analyzer.py # Phase 2: Clustering & MST
│ │ └── selection_analyzer.py # Phase 3: VIF & Marginal IC
│ ├── operators/
│ │ └── unified_operators.py # Operator registry (45+ operators)
│ ├── parsers/
│ │ ├── gpfactor_parser.py # gpfactor results parser
│ │ ├── masfactor_miner_parser.py# masfactorMiner results parser (mas2 legacy)
│ │ └── parser_factory.py # Auto-detection & parsing
│ └── storage/
│ ├── clickhouse_storage.py # Database operations
│ └── schema.py # Table schemas
├── src/factor_analysis/ # Standalone expression compute & analysis
│ ├── config.py
│ ├── calculator.py
│ └── analyzer.py
├── examples/
│ └── stock_factor_pipeline.py # Example pipeline script
├── mined_factors_demo/ # Sample mining results
├── requirements.txt
└── README.md
Supported Operators
Unary Operators
abs, sign, log, neg, inv, sqrt, square, sigmoid, tanh
Binary Operators
add, sub, mul, div, max, min, power
Time-Series Operators
ts_mean, ts_std, ts_var, ts_max, ts_min, ts_sum, ts_median, ts_delta, ts_delay, ts_return, ts_slope, ts_corr, ts_cov, ts_ema, ts_wma, ts_skew, ts_kurt, ts_rank, ts_zscore, ts_argmax, ts_argmin, ts_prod, ts_quantile
Cross-Sectional Operators
cs_rank, cs_zscore, cs_demean, cs_scale
Conditional Operators
if_else, greater, less
Evaluation Metrics
| Metric | Description |
|---|---|
| IC Mean | Pearson correlation with future returns |
| Rank IC Mean | Spearman correlation (more robust) |
| IC IR | Information ratio (IC mean / IC std) |
| IC t-stat | Statistical significance |
| Direction | Trading direction (1=long high, -1=short high) |
| isMono 5/10/15 | Monotonicity of group returns |
| Top-Bottom Return | Long-short portfolio return |
| Top-Bottom SR | Long-short Sharpe ratio |
| Factor Std/Skew/Kurt | Distribution statistics |
All metrics are calculated for full period, recent 5 years, and recent 1 year.
Orthogonality Analysis
The orthogonality module provides a three-phase framework to analyze factor redundancy and select non-redundant factors:
Phase 1: Global Correlation Check
| Method | Description |
|---|---|
| Correlation Matrix | Spearman rank correlation between all factor pairs |
| Effective N | PCA-based eigenvalue analysis to measure true dimensionality |
Key Metrics:
effective_n_90: Number of factors explaining 90% of varianceredundancy_ratio: 1 - (effective_n / total_factors)high_correlation_pairs: Factor pairs with |corr| > threshold
Phase 2: Clustering & Structure
| Method | Description |
|---|---|
| Hierarchical Clustering | Ward/Average linkage to group similar factors |
| Minimum Spanning Tree | Graph-based structure to find central/peripheral factors |
Key Outputs:
clusters: Factor groupings with intra-cluster correlationcentral_factors: Proxy factors representing each styleperipheral_factors: Unique alpha factors (most orthogonal)representative_factors: Best factor per cluster (by IC)
Phase 3: Selection & Pruning
| Method | Description |
|---|---|
| VIF (Variance Inflation Factor) | Detect multicollinearity within clusters |
| Marginal IC Analysis | Stepwise selection based on incremental IC contribution |
Selection Logic:
- Start with highest IC factor
- For each remaining factor, orthogonalize against selected set
- If residual IC > threshold (default 0.015), add to selection
- Repeat until no significant marginal contribution
Database Schema
A1_factor_basic
Factor registration and metadata.
| Column | Type | Description |
|---|---|---|
| factor_id | String | Unique ID (e.g., F_stk_000001) |
| name | Nullable(String) | Factor name |
| expression | Nullable(String) | Factor expression |
| explanation | Nullable(String) | Factor explanation |
| register_datetime | DateTime | Registration time |
A2_factor_evaluate
Factor evaluation metrics.
| Column | Type | Description |
|---|---|---|
| factor_id | String | Factor ID |
| upgrade_datetime | DateTime | Evaluation time |
| ic_mean, rank_ic_mean, ... | Float32 | Evaluation metrics |
A3_factor_upgrade
Factor value update tracking.
| Column | Type | Description |
|---|---|---|
| factor_id | String | Factor ID |
| latest_date | Date | Latest factor value date |
| upgrade_datetime | DateTime | Last update time |
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file factordbms-0.1.1.tar.gz.
File metadata
- Download URL: factordbms-0.1.1.tar.gz
- Upload date:
- Size: 82.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
948b7673f328d6c2a4cc72a437dff1ff15a9d2caf665b49aa1e29dcef12975c2
|
|
| MD5 |
66b1775c5b0c42b831246428e59dd35c
|
|
| BLAKE2b-256 |
a91561212c77384808925fc2d4157fc8db8a9432d6841994dd339b9f8f51a14e
|
File details
Details for the file factordbms-0.1.1-py3-none-any.whl.
File metadata
- Download URL: factordbms-0.1.1-py3-none-any.whl
- Upload date:
- Size: 91.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9dedb365aaa7d5e45bb4edcd1c9bb4d28fc3e53eea3789b1231287b0dbe88d5
|
|
| MD5 |
1ff1f26bafbc3275a593250365ea2028
|
|
| BLAKE2b-256 |
ad76dde429e02cddec793f992a9761007e773463336a222dceb608c311f1be7a
|