A comprehensive factor library management system for quantitative trading research

Project description

FactorDBMS

A comprehensive factor library management system for quantitative trading research.

Features

Unified Operator Mapping: Integrates operators from different factor mining frameworks (gpfactor, masfactorMiner; mas1 legacy) into a common set of operations
Expression Processing: Parse and calculate factor expressions across frameworks; normalization is available (auto-detect by default, can be disabled when you already use unified names)
Factor Evaluation: 30+ evaluation metrics including IC, monotonicity, returns, and distribution statistics
Orthogonality Analysis: Comprehensive factor redundancy analysis with correlation, clustering, and selection algorithms
Database Storage: ClickHouse-based storage for factor values and metadata
Automated Factor Management: Factor registration, calculation, and lifecycle management

Installation

# Clone the repository
git clone https://github.com/ElenYoung/FactorDBMS.git
cd FactorDBMS

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Configuration

Create a .env file in the project root with your ClickHouse database credentials:

DB_HOST=localhost
DB_PORT=9000
DB_USER=default
DB_PASSWORD=your_password

Quick Start

1. Expression Processing

from factordb import ExpressionNormalizer, ExpressionCalculator

# Normalize expressions from different frameworks (optional)
normalizer = ExpressionNormalizer()

# gpfactor (C++ style)
expr = normalizer.normalize_expression('sub(close)(low)', 'gpfactor')
# Result: 'sub(close, low)'

# mas1 (CamelCase style)
expr = normalizer.normalize_expression('Mul(Rank(close), EMA(volume, 5))', 'mas1')
# Result: 'mul(cs_rank(close), ts_ema(volume, 5))'

# Calculate factor values (normalization is enabled by default, pass normalize=False to skip)
calculator = ExpressionCalculator()
factor_values = calculator.calculate('ts_mean(close, 20)', market_data)          # auto-detect framework
factor_values_no_norm = calculator.calculate('ts_mean(close, 20)', market_data, normalize=False)

2. Parse Mining Results

from factordb.parsers import parse_factor_file

# Auto-detect framework and parse (supports gpfactor, masfactorMiner; mas1 legacy mapping)
factors = parse_factor_file('path/to/mining_results.json')

for factor in factors:
    print(f"Expression: {factor.normalized_expression}")
    print(f"IC Mean: {factor.get_metric('ic_mean')}")

3. Standalone Factor Analysis (for out-of-DB or US-equity data)

If you just want to compute/analyze factor expressions on arbitrary data (CSV/Parquet/ClickHouse) without using the full FactorDB pipeline:

from factor_analysis import FactorAnalysisConfig, FactorAnalysis, FactorAnalyzer

# Load config (YAML) defining data source/columns
cfg = FactorAnalysisConfig.from_yaml('path/to/factor_analysis.yaml')

# Calculate factor values
fa = FactorAnalysis(cfg)
values = fa.calculate('ts_mean(close, 20)')  # returns Series with MultiIndex (code, date)

# Quick single-factor analysis
report = FactorAnalyzer().analyze(values)
print(report)

Example factor_analysis.yaml:

data_source: clickhouse       # csv | parquet | clickhouse
data_path: null               # required when data_source is csv/parquet
date_column: date
code_column: code
clickhouse:
  database: us_market
  table: daily
  where: "date >= '2024-01-01'"
  order_by: "code, date"
  limit: 200000

Notes for factor_analysis ClickHouse mode:

factor_analysis is standalone and does not route through stk_factors / etf_factors.
Use either clickhouse.database + clickhouse.table, a fully qualified clickhouse.table such as us_market.daily, or a raw clickhouse.query.
asset_type / factor_type are not used by factor_analysis and should not appear in its config.

4. Create Custom Factors

from factordb import ExpressionFactor, CustomFactor

# Method 1: Using expression
factor = ExpressionFactor(
    expression='ts_zscore(close, 20)',
    name='price_zscore',
    explanation='20-day price z-score'
)

# Method 2: Custom calculation
class MomentumFactor(CustomFactor):
    def __init__(self, window=20):
        super().__init__()
        self.window = window
        self._name = f"momentum_{window}d"

    def _compute(self, group_data):
        return group_data['close'].pct_change(self.window)

5. Factor Evaluation

from factordb.evaluators import FactorEvaluator

evaluator = FactorEvaluator()
metrics = evaluator.evaluate(factor_values, returns_data)

print(f"IC Mean: {metrics['ic_mean']:.4f}")
print(f"IC IR: {metrics['ic_ir']:.4f}")
print(f"Direction: {metrics['direction']}")

6. Factor Management (with Database)

from factordb import FactorManager, AssetType, FactorType

manager = FactorManager()

# Register a factor
factor_id = manager.register_from_expression(
    expression='ts_zscore(close, 20)',
    name='price_zscore',
    asset_type=AssetType.STOCK,
    factor_type=FactorType.DAILY
)

# Register factors from mining results
factor_ids = manager.register_mined_factors(
    file_path='mining_results.json',
    max_factors=100,
    min_ic=0.03
)

# Calculate and save factor values
manager.calculate_and_save(factor, market_data)

# Evaluate and save results
metrics = manager.evaluate_factor(factor_id, returns_data, save_results=True)

# Search for good factors
good_factors = manager.search_factors(min_ic=0.03, min_icir=0.5)

7. Orthogonality Analysis

Analyze factor redundancy and select non-redundant factors using a three-phase framework:

from factordb.orthogonality import OrthogonalityAnalyzer

# Initialize analyzer
analyzer = OrthogonalityAnalyzer(
    correlation_threshold=0.7,   # High correlation pair threshold
    vif_threshold=5.0,           # VIF collinearity threshold
    marginal_ic_threshold=0.015  # Minimum marginal IC for selection
)

# Prepare factor matrix (wide format: rows=observations, cols=factors)
factor_matrix = analyzer.prepare_factor_matrix(factor_data)

# Run full analysis
report = analyzer.analyze(
    factor_matrix=factor_matrix,
    returns=returns_data,
    ic_values=ic_dict  # {factor_id: ic_value}
)

# Phase 1: Global Correlation Results
print(f"Total factors: {report.effective_n.total_factors}")
print(f"Effective N (90%): {report.effective_n.effective_n_90}")
print(f"Mean correlation: {report.correlation_stats.mean_correlation:.3f}")
print(f"High correlation pairs: {len(report.correlation_stats.high_correlation_pairs)}")

# Phase 2: Clustering Results
print(f"Number of clusters: {report.clustering_result.n_clusters}")
print(f"Central factors: {report.mst_result.central_factors}")
print(f"Peripheral factors (unique alpha): {report.mst_result.peripheral_factors}")

# Phase 3: Selection Results
print(f"Selected factors: {len(report.final_selected_factors)}")
print(f"Removed factors: {len(report.removed_factors)}")

# Get final non-redundant factor set
selected_factors = report.final_selected_factors

Quick Analysis (No Returns Required)

For initial exploration without return data:

# Run Phase 1 & 2 only
results = analyzer.quick_analysis(factor_matrix, ic_values)

print(f"Redundancy ratio: {results['summary']['redundancy_ratio_90']:.1%}")
print(f"Suggested clusters: {analyzer.suggest_optimal_cluster_count(results['phase1']['effective_n'])}")

Interactive CLI (main.py)

main.py provides an interactive menu for managing the full factor pipeline. It reads parameters from a YAML config file and prompts for additional inputs at runtime.

Quick Start

# Run with default config (pipeline_config.yaml)
python main.py

# Run with custom config
python main.py -c path/to/config.yaml

Interactive Menu

==================================================
  FactorDB Pipeline (STOCK / DAILY)
==================================================
  1. Upload factors from file
  2. Update factor values (incremental)
  3. Evaluate factors
  4. Show database status
  5. Switch asset type (STOCK <-> ETF)
  0. Exit
==================================================
Select option:

1. Upload Factors

Parses factors from a mining results file, registers new factors, calculates values, and evaluates them. Supports incremental execution -- if a previous run was interrupted, it detects which factors are already registered/calculated/evaluated and only runs the remaining steps.

Prompts:

File path: Path to mining results JSON file (default from config)
Skip evaluation: Whether to skip the evaluation step after upload

2. Update Factor Values (Incremental)

Incrementally updates all registered factor values to the latest date. Uses the A3_factor_upgrade table to determine what data is already present, then only calculates and saves the new portion.

Prompts:

Re-evaluate: Whether to re-evaluate factors that were updated

3. Evaluate Factors

Calculates evaluation metrics (IC, ICIR, monotonicity, returns, etc.) for factors.

Prompts:

Re-evaluate ALL: By default only evaluates factors missing metrics. Choose yes to re-evaluate all factors.

4. Show Database Status

Displays a summary of the factor database: total registered factors, how many have values, how many have evaluation metrics, and the latest/earliest update dates.

Prompts:

Detailed list: Whether to show the full factor list

5. Switch Asset Type

Switch between STOCK and ETF factor databases. The current selection is shown in the menu header. Each asset type has its own separate database:

STOCK: stk_factors database, factor IDs like F_stk_000001
ETF: etf_factors database, factor IDs like F_etf_000001

Note: HIGH_FREQ (intraday) factors are not yet supported in the interactive CLI due to different evaluation logic.

Configuration File (pipeline_config.yaml)

# Factor classification
asset_type: "STOCK"          # STOCK | ETF
factor_type: "DAILY"         # DAILY | HIGH_FREQ (HIGH_FREQ not yet supported)

# Parallelism
n_jobs: 16

# Market data source
market_data:
  database: "stocks"
  price_table: "daily_adj_tushare"
  basic_table: "daily_basic_tushare"
  start_date: "2000-01-01"
  end_date: null             # null = today

# Upload command defaults
upload:
  file_path: "mined_factors_demo/stocks/new_128.json"
  max_factors: 140
  min_score: 30.0

# Update command defaults
update:
  evaluate_after_update: false

# Evaluate command defaults
evaluate:
  return_column: "pct_chg"
  cap_column: "circ_mv"
  n_jobs: 4                  # Evaluation threads (lower to avoid memory issues)

Note for ETF factors: If processing ETF factors with different market data tables, either:

Modify the market_data section in pipeline_config.yaml before switching to ETF
Or use a separate config file: python main.py -c etf_config.yaml

Web Dashboard (dashboard.py)

A Streamlit-based web interface for viewing factor information.

Quick Start

# Activate virtual environment and run (Windows)
.venv\Scripts\python -m streamlit run dashboard.py

# Or using uv (if project has pyproject.toml)
uv run streamlit run dashboard.py

# Or activate venv first, then run
.venv\Scripts\activate
streamlit run dashboard.py

The dashboard will open in your browser at http://localhost:8501.

Features

Summary Statistics: Total factors, factors with values, factors with evaluation
Factor List: Sortable table with key metrics (IC, RankIC, ICIR, etc.)
Filtering: Filter by minimum IC, ICIR, or monotonicity
Factor Detail: Detailed view of selected factor with expression and all metrics
Asset Type Switch: Toggle between STOCK and ETF databases

Displayed Metrics

Metric	Description
IC Mean	Pearson correlation with future returns
Rank IC	Spearman correlation (more robust)
IC IR	Information ratio (IC mean / IC std)
Rank IC IR	Rank IC information ratio
Mono (10g)	Monotonicity of 10-group returns
Top-Bottom Return	Long-short portfolio return
Top-Bottom Sharpe	Long-short Sharpe ratio

All metrics are shown for Full period, 5-year, and 1-year windows.

Project Structure

FactorDB/
├── main.py                          # Interactive CLI entry point
├── dashboard.py                     # Streamlit web dashboard
├── pipeline_config.yaml             # Pipeline configuration
├── src/factordb/
│   ├── core/
│   │   ├── config.py          # Configuration management
│   │   ├── expression.py      # Expression processing
│   │   ├── factor.py          # Factor base classes
│   │   └── factor_manager.py  # Factor lifecycle management
│   ├── evaluators/
│   │   ├── factor_evaluator.py      # Main evaluator
│   │   ├── ic_calculator.py         # IC metrics
│   │   ├── return_calculator.py     # Return metrics
│   │   └── monotonicity_calculator.py
│   ├── orthogonality/               # Factor orthogonality analysis
│   │   ├── orthogonality_analyzer.py  # Main orchestrator
│   │   ├── correlation_analyzer.py    # Phase 1: Correlation analysis
│   │   ├── clustering_analyzer.py     # Phase 2: Clustering & MST
│   │   └── selection_analyzer.py      # Phase 3: VIF & Marginal IC
│   ├── operators/
│   │   └── unified_operators.py     # Operator registry (45+ operators)
│   ├── parsers/
│   │   ├── gpfactor_parser.py       # gpfactor results parser
│   │   ├── masfactor_miner_parser.py# masfactorMiner results parser (mas2 legacy)
│   │   └── parser_factory.py        # Auto-detection & parsing
│   └── storage/
│       ├── clickhouse_storage.py    # Database operations
│       └── schema.py                # Table schemas
├── src/factor_analysis/             # Standalone expression compute & analysis
│   ├── config.py
│   ├── calculator.py
│   └── analyzer.py
├── examples/
│   └── stock_factor_pipeline.py     # Example pipeline script
├── mined_factors_demo/              # Sample mining results
├── requirements.txt
└── README.md

Supported Operators

Unary Operators

abs, sign, log, neg, inv, sqrt, square, sigmoid, tanh

Binary Operators

add, sub, mul, div, max, min, power

Time-Series Operators

ts_mean, ts_std, ts_var, ts_max, ts_min, ts_sum, ts_median, ts_delta, ts_delay, ts_return, ts_slope, ts_corr, ts_cov, ts_ema, ts_wma, ts_skew, ts_kurt, ts_rank, ts_zscore, ts_argmax, ts_argmin, ts_prod, ts_quantile

Cross-Sectional Operators

cs_rank, cs_zscore, cs_demean, cs_scale

Conditional Operators

if_else, greater, less

Evaluation Metrics

Metric	Description
IC Mean	Pearson correlation with future returns
Rank IC Mean	Spearman correlation (more robust)
IC IR	Information ratio (IC mean / IC std)
IC t-stat	Statistical significance
Direction	Trading direction (1=long high, -1=short high)
isMono 5/10/15	Monotonicity of group returns
Top-Bottom Return	Long-short portfolio return
Top-Bottom SR	Long-short Sharpe ratio
Factor Std/Skew/Kurt	Distribution statistics

All metrics are calculated for full period, recent 5 years, and recent 1 year.

Orthogonality Analysis

The orthogonality module provides a three-phase framework to analyze factor redundancy and select non-redundant factors:

Phase 1: Global Correlation Check

Method	Description
Correlation Matrix	Spearman rank correlation between all factor pairs
Effective N	PCA-based eigenvalue analysis to measure true dimensionality

Key Metrics:

effective_n_90: Number of factors explaining 90% of variance
redundancy_ratio: 1 - (effective_n / total_factors)
high_correlation_pairs: Factor pairs with |corr| > threshold

Phase 2: Clustering & Structure

Method	Description
Hierarchical Clustering	Ward/Average linkage to group similar factors
Minimum Spanning Tree	Graph-based structure to find central/peripheral factors

Key Outputs:

clusters: Factor groupings with intra-cluster correlation
central_factors: Proxy factors representing each style
peripheral_factors: Unique alpha factors (most orthogonal)
representative_factors: Best factor per cluster (by IC)

Phase 3: Selection & Pruning

Method	Description
VIF (Variance Inflation Factor)	Detect multicollinearity within clusters
Marginal IC Analysis	Stepwise selection based on incremental IC contribution

Selection Logic:

Start with highest IC factor
For each remaining factor, orthogonalize against selected set
If residual IC > threshold (default 0.015), add to selection
Repeat until no significant marginal contribution

Database Schema

A1_factor_basic

Factor registration and metadata.

Column	Type	Description
factor_id	String	Unique ID (e.g., F_stk_000001)
name	Nullable(String)	Factor name
expression	Nullable(String)	Factor expression
explanation	Nullable(String)	Factor explanation
register_datetime	DateTime	Registration time

A2_factor_evaluate

Factor evaluation metrics.

Column	Type	Description
factor_id	String	Factor ID
upgrade_datetime	DateTime	Evaluation time
ic_mean, rank_ic_mean, ...	Float32	Evaluation metrics

A3_factor_upgrade

Factor value update tracking.

Column	Type	Description
factor_id	String	Factor ID
latest_date	Date	Latest factor value date
upgrade_datetime	DateTime	Last update time

License

MIT License

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Mar 6, 2026

0.1.0

Mar 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

factordbms-0.1.1.tar.gz (82.0 kB view details)

Uploaded Mar 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

factordbms-0.1.1-py3-none-any.whl (91.4 kB view details)

Uploaded Mar 6, 2026 Python 3

File details

Details for the file factordbms-0.1.1.tar.gz.

File metadata

Download URL: factordbms-0.1.1.tar.gz
Upload date: Mar 6, 2026
Size: 82.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.10

File hashes

Hashes for factordbms-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`948b7673f328d6c2a4cc72a437dff1ff15a9d2caf665b49aa1e29dcef12975c2`
MD5	`66b1775c5b0c42b831246428e59dd35c`
BLAKE2b-256	`a91561212c77384808925fc2d4157fc8db8a9432d6841994dd339b9f8f51a14e`

See more details on using hashes here.

File details

Details for the file factordbms-0.1.1-py3-none-any.whl.

File metadata

Download URL: factordbms-0.1.1-py3-none-any.whl
Upload date: Mar 6, 2026
Size: 91.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.10

File hashes

Hashes for factordbms-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9dedb365aaa7d5e45bb4edcd1c9bb4d28fc3e53eea3789b1231287b0dbe88d5`
MD5	`1ff1f26bafbc3275a593250365ea2028`
BLAKE2b-256	`ad76dde429e02cddec793f992a9761007e773463336a222dceb608c311f1be7a`

See more details on using hashes here.

factordbms 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

FactorDBMS

Features

Installation

Configuration

Quick Start

1. Expression Processing

2. Parse Mining Results

3. Standalone Factor Analysis (for out-of-DB or US-equity data)

4. Create Custom Factors

5. Factor Evaluation

6. Factor Management (with Database)

7. Orthogonality Analysis

Quick Analysis (No Returns Required)

Interactive CLI (main.py)

Quick Start

Interactive Menu

1. Upload Factors

2. Update Factor Values (Incremental)

3. Evaluate Factors

4. Show Database Status

5. Switch Asset Type

Configuration File (pipeline_config.yaml)

Web Dashboard (dashboard.py)

Quick Start

Features

Displayed Metrics

Project Structure

Supported Operators

Unary Operators

Binary Operators

Time-Series Operators

Cross-Sectional Operators

Conditional Operators

Evaluation Metrics

Orthogonality Analysis

Phase 1: Global Correlation Check

Phase 2: Clustering & Structure

Phase 3: Selection & Pruning

Database Schema

A1_factor_basic

A2_factor_evaluate

A3_factor_upgrade

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes