Skip to main content

A comprehensive factor library management system for quantitative trading research

Project description

FactorDBMS

A comprehensive factor library management system for quantitative trading research.

Features

  • Unified Operator Mapping: Integrates operators from different factor mining frameworks (gpfactor, masfactorMiner; mas1 legacy) into a common set of operations
  • Expression Processing: Parse and calculate factor expressions across frameworks; normalization is available (auto-detect by default, can be disabled when you already use unified names)
  • Factor Evaluation: 30+ evaluation metrics including IC, monotonicity, returns, and distribution statistics
  • Orthogonality Analysis: Comprehensive factor redundancy analysis with correlation, clustering, and selection algorithms
  • Database Storage: ClickHouse-based storage for factor values and metadata
  • Automated Factor Management: Factor registration, calculation, and lifecycle management

Installation

# Clone the repository
git clone https://github.com/ElenYoung/FactorDBMS.git
cd FactorDBMS

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Configuration

Create a .env file in the project root with your ClickHouse database credentials:

DB_HOST=localhost
DB_PORT=9000
DB_USER=default
DB_PASSWORD=your_password

Quick Start

1. Expression Processing

from factordb import ExpressionNormalizer, ExpressionCalculator

# Normalize expressions from different frameworks (optional)
normalizer = ExpressionNormalizer()

# gpfactor (C++ style)
expr = normalizer.normalize_expression('sub(close)(low)', 'gpfactor')
# Result: 'sub(close, low)'

# mas1 (CamelCase style)
expr = normalizer.normalize_expression('Mul(Rank(close), EMA(volume, 5))', 'mas1')
# Result: 'mul(cs_rank(close), ts_ema(volume, 5))'

# Calculate factor values (normalization is enabled by default, pass normalize=False to skip)
calculator = ExpressionCalculator()
factor_values = calculator.calculate('ts_mean(close, 20)', market_data)          # auto-detect framework
factor_values_no_norm = calculator.calculate('ts_mean(close, 20)', market_data, normalize=False)

2. Parse Mining Results

from factordb.parsers import parse_factor_file

# Auto-detect framework and parse (supports gpfactor, masfactorMiner; mas1 legacy mapping)
factors = parse_factor_file('path/to/mining_results.json')

for factor in factors:
    print(f"Expression: {factor.normalized_expression}")
    print(f"IC Mean: {factor.get_metric('ic_mean')}")

3. Standalone Factor Analysis (for out-of-DB or US-equity data)

If you just want to compute/analyze factor expressions on arbitrary data (CSV/Parquet/ClickHouse) without using the full FactorDB pipeline:

from factor_analysis import FactorAnalysisConfig, FactorAnalysis, FactorAnalyzer

# Load config (YAML) defining data source/columns
cfg = FactorAnalysisConfig.from_yaml('path/to/factor_analysis.yaml')

# Calculate factor values
fa = FactorAnalysis(cfg)
values = fa.calculate('ts_mean(close, 20)')  # returns Series with MultiIndex (code, date)

# Quick single-factor analysis
report = FactorAnalyzer().analyze(values)
print(report)

Example factor_analysis.yaml:

data_source: clickhouse       # csv | parquet | clickhouse
data_path: null               # required when data_source is csv/parquet
date_column: date
code_column: code
clickhouse:
  database: us_market
  table: daily
  where: "date >= '2024-01-01'"
  order_by: "code, date"
  limit: 200000

Notes for factor_analysis ClickHouse mode:

  • factor_analysis is standalone and does not route through stk_factors / etf_factors.
  • Use either clickhouse.database + clickhouse.table, a fully qualified clickhouse.table such as us_market.daily, or a raw clickhouse.query.
  • asset_type / factor_type are not used by factor_analysis and should not appear in its config.

4. Create Custom Factors

from factordb import ExpressionFactor, CustomFactor

# Method 1: Using expression
factor = ExpressionFactor(
    expression='ts_zscore(close, 20)',
    name='price_zscore',
    explanation='20-day price z-score'
)

# Method 2: Custom calculation
class MomentumFactor(CustomFactor):
    def __init__(self, window=20):
        super().__init__()
        self.window = window
        self._name = f"momentum_{window}d"

    def _compute(self, group_data):
        return group_data['close'].pct_change(self.window)

5. Factor Evaluation

from factordb.evaluators import FactorEvaluator

evaluator = FactorEvaluator()
metrics = evaluator.evaluate(factor_values, returns_data)

print(f"IC Mean: {metrics['ic_mean']:.4f}")
print(f"IC IR: {metrics['ic_ir']:.4f}")
print(f"Direction: {metrics['direction']}")

6. Factor Management (with Database)

from factordb import FactorManager, AssetType, FactorType

manager = FactorManager()

# Register a factor
factor_id = manager.register_from_expression(
    expression='ts_zscore(close, 20)',
    name='price_zscore',
    asset_type=AssetType.STOCK,
    factor_type=FactorType.DAILY
)

# Register factors from mining results
factor_ids = manager.register_mined_factors(
    file_path='mining_results.json',
    max_factors=100,
    min_ic=0.03
)

# Calculate and save factor values
manager.calculate_and_save(factor, market_data)

# Evaluate and save results
metrics = manager.evaluate_factor(factor_id, returns_data, save_results=True)

# Search for good factors
good_factors = manager.search_factors(min_ic=0.03, min_icir=0.5)

7. Orthogonality Analysis

Analyze factor redundancy and select non-redundant factors using a three-phase framework:

from factordb.orthogonality import OrthogonalityAnalyzer

# Initialize analyzer
analyzer = OrthogonalityAnalyzer(
    correlation_threshold=0.7,   # High correlation pair threshold
    vif_threshold=5.0,           # VIF collinearity threshold
    marginal_ic_threshold=0.015  # Minimum marginal IC for selection
)

# Prepare factor matrix (wide format: rows=observations, cols=factors)
factor_matrix = analyzer.prepare_factor_matrix(factor_data)

# Run full analysis
report = analyzer.analyze(
    factor_matrix=factor_matrix,
    returns=returns_data,
    ic_values=ic_dict  # {factor_id: ic_value}
)

# Phase 1: Global Correlation Results
print(f"Total factors: {report.effective_n.total_factors}")
print(f"Effective N (90%): {report.effective_n.effective_n_90}")
print(f"Mean correlation: {report.correlation_stats.mean_correlation:.3f}")
print(f"High correlation pairs: {len(report.correlation_stats.high_correlation_pairs)}")

# Phase 2: Clustering Results
print(f"Number of clusters: {report.clustering_result.n_clusters}")
print(f"Central factors: {report.mst_result.central_factors}")
print(f"Peripheral factors (unique alpha): {report.mst_result.peripheral_factors}")

# Phase 3: Selection Results
print(f"Selected factors: {len(report.final_selected_factors)}")
print(f"Removed factors: {len(report.removed_factors)}")

# Get final non-redundant factor set
selected_factors = report.final_selected_factors

Quick Analysis (No Returns Required)

For initial exploration without return data:

# Run Phase 1 & 2 only
results = analyzer.quick_analysis(factor_matrix, ic_values)

print(f"Redundancy ratio: {results['summary']['redundancy_ratio_90']:.1%}")
print(f"Suggested clusters: {analyzer.suggest_optimal_cluster_count(results['phase1']['effective_n'])}")

Interactive CLI (main.py)

main.py provides an interactive menu for managing the full factor pipeline. It reads parameters from a YAML config file and prompts for additional inputs at runtime.

Quick Start

# Run with default config (pipeline_config.yaml)
python main.py

# Run with custom config
python main.py -c path/to/config.yaml

Interactive Menu

==================================================
  FactorDB Pipeline (STOCK / DAILY)
==================================================
  1. Upload factors from file
  2. Update factor values (incremental)
  3. Evaluate factors
  4. Show database status
  5. Switch asset type (STOCK <-> ETF)
  0. Exit
==================================================
Select option:

1. Upload Factors

Parses factors from a mining results file, registers new factors, calculates values, and evaluates them. Supports incremental execution -- if a previous run was interrupted, it detects which factors are already registered/calculated/evaluated and only runs the remaining steps.

Prompts:

  • File path: Path to mining results JSON file (default from config)
  • Skip evaluation: Whether to skip the evaluation step after upload

2. Update Factor Values (Incremental)

Incrementally updates all registered factor values to the latest date. Uses the A3_factor_upgrade table to determine what data is already present, then only calculates and saves the new portion.

Prompts:

  • Re-evaluate: Whether to re-evaluate factors that were updated

3. Evaluate Factors

Calculates evaluation metrics (IC, ICIR, monotonicity, returns, etc.) for factors.

Prompts:

  • Re-evaluate ALL: By default only evaluates factors missing metrics. Choose yes to re-evaluate all factors.

4. Show Database Status

Displays a summary of the factor database: total registered factors, how many have values, how many have evaluation metrics, and the latest/earliest update dates.

Prompts:

  • Detailed list: Whether to show the full factor list

5. Switch Asset Type

Switch between STOCK and ETF factor databases. The current selection is shown in the menu header. Each asset type has its own separate database:

  • STOCK: stk_factors database, factor IDs like F_stk_000001
  • ETF: etf_factors database, factor IDs like F_etf_000001

Note: HIGH_FREQ (intraday) factors are not yet supported in the interactive CLI due to different evaluation logic.

Configuration File (pipeline_config.yaml)

# Factor classification
asset_type: "STOCK"          # STOCK | ETF
factor_type: "DAILY"         # DAILY | HIGH_FREQ (HIGH_FREQ not yet supported)

# Parallelism
n_jobs: 16

# Market data source
market_data:
  database: "stocks"
  price_table: "daily_adj_tushare"
  basic_table: "daily_basic_tushare"
  start_date: "2000-01-01"
  end_date: null             # null = today

# Upload command defaults
upload:
  file_path: "mined_factors_demo/stocks/new_128.json"
  max_factors: 140
  min_score: 30.0

# Update command defaults
update:
  evaluate_after_update: false

# Evaluate command defaults
evaluate:
  return_column: "pct_chg"
  cap_column: "circ_mv"
  n_jobs: 4                  # Evaluation threads (lower to avoid memory issues)

Note for ETF factors: If processing ETF factors with different market data tables, either:

  1. Modify the market_data section in pipeline_config.yaml before switching to ETF
  2. Or use a separate config file: python main.py -c etf_config.yaml

Web Dashboard (dashboard.py)

A Streamlit-based web interface for viewing factor information.

Quick Start

# Activate virtual environment and run (Windows)
.venv\Scripts\python -m streamlit run dashboard.py

# Or using uv (if project has pyproject.toml)
uv run streamlit run dashboard.py

# Or activate venv first, then run
.venv\Scripts\activate
streamlit run dashboard.py

The dashboard will open in your browser at http://localhost:8501.

Features

  • Summary Statistics: Total factors, factors with values, factors with evaluation
  • Factor List: Sortable table with key metrics (IC, RankIC, ICIR, etc.)
  • Filtering: Filter by minimum IC, ICIR, or monotonicity
  • Factor Detail: Detailed view of selected factor with expression and all metrics
  • Asset Type Switch: Toggle between STOCK and ETF databases

Displayed Metrics

Metric Description
IC Mean Pearson correlation with future returns
Rank IC Spearman correlation (more robust)
IC IR Information ratio (IC mean / IC std)
Rank IC IR Rank IC information ratio
Mono (10g) Monotonicity of 10-group returns
Top-Bottom Return Long-short portfolio return
Top-Bottom Sharpe Long-short Sharpe ratio

All metrics are shown for Full period, 5-year, and 1-year windows.

Project Structure

FactorDB/
├── main.py                          # Interactive CLI entry point
├── dashboard.py                     # Streamlit web dashboard
├── pipeline_config.yaml             # Pipeline configuration
├── src/factordb/
│   ├── core/
│   │   ├── config.py          # Configuration management
│   │   ├── expression.py      # Expression processing
│   │   ├── factor.py          # Factor base classes
│   │   └── factor_manager.py  # Factor lifecycle management
│   ├── evaluators/
│   │   ├── factor_evaluator.py      # Main evaluator
│   │   ├── ic_calculator.py         # IC metrics
│   │   ├── return_calculator.py     # Return metrics
│   │   └── monotonicity_calculator.py
│   ├── orthogonality/               # Factor orthogonality analysis
│   │   ├── orthogonality_analyzer.py  # Main orchestrator
│   │   ├── correlation_analyzer.py    # Phase 1: Correlation analysis
│   │   ├── clustering_analyzer.py     # Phase 2: Clustering & MST
│   │   └── selection_analyzer.py      # Phase 3: VIF & Marginal IC
│   ├── operators/
│   │   └── unified_operators.py     # Operator registry (45+ operators)
│   ├── parsers/
│   │   ├── gpfactor_parser.py       # gpfactor results parser
│   │   ├── masfactor_miner_parser.py# masfactorMiner results parser (mas2 legacy)
│   │   └── parser_factory.py        # Auto-detection & parsing
│   └── storage/
│       ├── clickhouse_storage.py    # Database operations
│       └── schema.py                # Table schemas
├── src/factor_analysis/             # Standalone expression compute & analysis
│   ├── config.py
│   ├── calculator.py
│   └── analyzer.py
├── examples/
│   └── stock_factor_pipeline.py     # Example pipeline script
├── mined_factors_demo/              # Sample mining results
├── requirements.txt
└── README.md

Supported Operators

Unary Operators

abs, sign, log, neg, inv, sqrt, square, sigmoid, tanh

Binary Operators

add, sub, mul, div, max, min, power

Time-Series Operators

ts_mean, ts_std, ts_var, ts_max, ts_min, ts_sum, ts_median, ts_delta, ts_delay, ts_return, ts_slope, ts_corr, ts_cov, ts_ema, ts_wma, ts_skew, ts_kurt, ts_rank, ts_zscore, ts_argmax, ts_argmin, ts_prod, ts_quantile

Cross-Sectional Operators

cs_rank, cs_zscore, cs_demean, cs_scale

Conditional Operators

if_else, greater, less

Evaluation Metrics

Metric Description
IC Mean Pearson correlation with future returns
Rank IC Mean Spearman correlation (more robust)
IC IR Information ratio (IC mean / IC std)
IC t-stat Statistical significance
Direction Trading direction (1=long high, -1=short high)
isMono 5/10/15 Monotonicity of group returns
Top-Bottom Return Long-short portfolio return
Top-Bottom SR Long-short Sharpe ratio
Factor Std/Skew/Kurt Distribution statistics

All metrics are calculated for full period, recent 5 years, and recent 1 year.

Orthogonality Analysis

The orthogonality module provides a three-phase framework to analyze factor redundancy and select non-redundant factors:

Phase 1: Global Correlation Check

Method Description
Correlation Matrix Spearman rank correlation between all factor pairs
Effective N PCA-based eigenvalue analysis to measure true dimensionality

Key Metrics:

  • effective_n_90: Number of factors explaining 90% of variance
  • redundancy_ratio: 1 - (effective_n / total_factors)
  • high_correlation_pairs: Factor pairs with |corr| > threshold

Phase 2: Clustering & Structure

Method Description
Hierarchical Clustering Ward/Average linkage to group similar factors
Minimum Spanning Tree Graph-based structure to find central/peripheral factors

Key Outputs:

  • clusters: Factor groupings with intra-cluster correlation
  • central_factors: Proxy factors representing each style
  • peripheral_factors: Unique alpha factors (most orthogonal)
  • representative_factors: Best factor per cluster (by IC)

Phase 3: Selection & Pruning

Method Description
VIF (Variance Inflation Factor) Detect multicollinearity within clusters
Marginal IC Analysis Stepwise selection based on incremental IC contribution

Selection Logic:

  1. Start with highest IC factor
  2. For each remaining factor, orthogonalize against selected set
  3. If residual IC > threshold (default 0.015), add to selection
  4. Repeat until no significant marginal contribution

Database Schema

A1_factor_basic

Factor registration and metadata.

Column Type Description
factor_id String Unique ID (e.g., F_stk_000001)
name Nullable(String) Factor name
expression Nullable(String) Factor expression
explanation Nullable(String) Factor explanation
register_datetime DateTime Registration time

A2_factor_evaluate

Factor evaluation metrics.

Column Type Description
factor_id String Factor ID
upgrade_datetime DateTime Evaluation time
ic_mean, rank_ic_mean, ... Float32 Evaluation metrics

A3_factor_upgrade

Factor value update tracking.

Column Type Description
factor_id String Factor ID
latest_date Date Latest factor value date
upgrade_datetime DateTime Last update time

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

factordbms-0.1.1.tar.gz (82.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

factordbms-0.1.1-py3-none-any.whl (91.4 kB view details)

Uploaded Python 3

File details

Details for the file factordbms-0.1.1.tar.gz.

File metadata

  • Download URL: factordbms-0.1.1.tar.gz
  • Upload date:
  • Size: 82.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.10

File hashes

Hashes for factordbms-0.1.1.tar.gz
Algorithm Hash digest
SHA256 948b7673f328d6c2a4cc72a437dff1ff15a9d2caf665b49aa1e29dcef12975c2
MD5 66b1775c5b0c42b831246428e59dd35c
BLAKE2b-256 a91561212c77384808925fc2d4157fc8db8a9432d6841994dd339b9f8f51a14e

See more details on using hashes here.

File details

Details for the file factordbms-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: factordbms-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 91.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.10

File hashes

Hashes for factordbms-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d9dedb365aaa7d5e45bb4edcd1c9bb4d28fc3e53eea3789b1231287b0dbe88d5
MD5 1ff1f26bafbc3275a593250365ea2028
BLAKE2b-256 ad76dde429e02cddec793f992a9761007e773463336a222dceb608c311f1be7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page