Skip to main content

Extreme fast factor expression & computation library for Python.

Project description

Factor Expr status pypi

Factor Expression Historical Data Factor Values
(LogReturn 30 :close) + 2019-12-27~2020-01-14.pq = [0.01, 0.035, ...]

Extreme fast factor expression & computation library for quantitative trading in Python.

On a server with an E7-4830 CPU (16 cores, 2000MHz), computing 48 factors over a dataset with 24.5M rows x 683 columns (12GB) takes 150s.

Join [Discussions] for Q&A and feature proposal!

Features

  • Express factors in S-Expression.
  • Compute factors in parallel over multiple factors and multiple datasets.

Usage

There are three steps to use this library.

  1. Prepare the datasets into files. Currently, only the Parquet format is supported.
  2. Define factors using S-Expression.
  3. Run replay to compute the factors on the dataset.

1. Prepare the dataset

A dataset is a tabular format with float64 columns and arbitrary column names. Each row in the dataset represents a tick, e.g. for a daily dataset, each row is one day. For example, here is an OHLC candle dataset representing 2 ticks:

df = pd.DataFrame({
    "open": [3.1, 5.8], 
    "high": [8.8, 7.7], 
    "low": [1.1, 2.1], 
    "close": [4.4, 3.4]
})

You can use the following code to store the DataFrame into a Parquet file:

df.to_parquet("data.pq")

2. Define your factors

Factor Expr uses the S-Expression to describe a factor. For example, on a daily OHLC dataset, the 30 days log return on the column close is expressed as:

from factor_expr import Factor

Factor("(LogReturn 30 :close)")

Note, in Factor Expr, column names are referred by the :column-name syntax.

3. Compute the factors on the prepared dataset

Following step 1 and 2, you can now compute the factors using the replay function:

from factor_expr import Factor, replay

result = await replay(
    ["data.pq"],
    [Factor("(LogReturn 30 :close)")]
)

The first parameter of replay is a list of dataset files and the second parameter is a list of Factors. This gives you the ability to compute multiple factors on multiple datasets. Don't worry about the performance! Factor Expr allows you parallelize the computation over the factors as well as the datasets by setting n_factor_jobs and n_data_jobs in the replay function.

The returned result is a pandas DataFrame with factors as the column names and time as the index. In case of multiple datasets are passed in, the results will be concatenated with the exact order of the datasets. This is useful if you have a scattered dataset. E.g. one file for each year.

For example, the code above will give you a DataFrame looks similar to this:

index (LogReturn 30 :close)
0 0.23
... ...

Check out the docstring of replay for more information!

Installation

pip install factor-expr

Supported Functions

Notations:

  • <const> means a constant, e.g. 3.
  • <expr> means either a constant or an S-Expression or a column name, e.g. 3 or (+ :close 3) or :open.

Here's the full list of supported functions. If you didn't find one you need, consider asking on Discussions or creating a PR!

Arithmetics

  • Addition: (+ <expr> <expr>)
  • Subtraction: (- <expr> <expr>)
  • Multiplication: (* <expr> <expr>)
  • Division: (/ <expr> <expr>)
  • Power: (^ <const> <expr>) - compute <expr> ^ <const>
  • Negation: (Neg <expr>)
  • Signed Power: (SPow <const> <expr>) - compute sign(<expr>) * abs(<expr>) ^ <const>
  • Natural Logarithm after Absolute: (LogAbs <expr>)
  • Sign: (Sign <expr>)
  • Abs: (Abs <expr>)

Logics

Any <expr> larger than 0 are treated as true.

  • If: (If <expr> <expr> <expr>) - if the first <expr> is larger than 0, return the second <expr> otherwise return the third <expr>
  • And: (And <expr> <expr>)
  • Or: (Or <expr> <expr>)
  • Less Than: (< <expr> <expr>)
  • Less Than or Equal: (<= <expr> <expr>)
  • Great Than: (> <expr> <expr>)
  • Greate Than or Equal: (>= <expr> <expr>)
  • Equal: (== <expr> <expr>)
  • Not: (! <expr>)

Window Functions

All the window functions take a window size as the first argument. The computation will be done on the look-back window with the size given in <const>.

  • Sum of the window elements: (Sum <const> <expr>)
  • Mean of the window elements: (Mean <const> <expr>)
  • Min of the window elements: (Min <const> <expr>)
  • Max of the window elements: (Max <const> <expr>)
  • The index of the min of the window elements: (ArgMin <const> <expr>)
  • The index of the max of the window elements: (ArgMax <const> <expr>)
  • Stdev of the window elements: (Std <const> <expr>)
  • Skew of the window elements: (Skew <const> <expr>)
  • The rank (ascending) of the current element in the window: (Rank <const> <expr>)
  • The value <const> ticks back: (Delay <const> <expr>)
  • The log return of the value <const> ticks back to current value: (LogReturn <const> <expr>)
  • Rolling correlation between two series: (Correlation <const> <expr> <expr>)
  • Rolling quantile of a series: (Quantile <const> <const> <expr>), e.g. (Quantile 100 0.5 <expr>) computes the median of a window sized 100.

Warm-up Period for Window Functions

Factors containing window functions require a warm-up period. For example, for (Sum 10 :close), it will not generate data until the 10th tick is replayed. In this case, replay will write NaN into the result during the warm-up period, until the factor starts to produce data. This ensures the length of the factor output will be as same as the length of the input dataset. You can use the trim parameter to let replay trim off the warm-up period before it returns.

Factors Failed to Compute

Factor Expr guarantees that there will not be any inf, -inf or NaN appear in the result, except for the warm-up period. However, sometimes a factor can fail due to numerical issues. For example, (Pow 3 (Pow 3 (Pow 3 :volume))) might overflow and become inf, and 1 / inf will become NaN. Factor Expr will detect these situations and mark these factors as failed. The failed factors will still be returned in the replay result, but the values in that column will be all NaN. You can easily remove these failed factors from the result by using pd.DataFrame.dropna(axis=1, how="all").

I Want to Have a Time Index for the Result

The replay function optionally accepts a index_col parameter. If you want to set a column from the dataset as the index of the returned result, you can do the following:

from factor_expr import Factor, replay

pd.DataFrame({
    "time": [datetime(2021,4,23), datetime(2021,4,24)], 
    "open": [3.1, 5.8], 
    "high": [8.8, 7.7], 
    "low": [1.1, 2.1], 
    "close": [4.4, 3.4],
}).to_parquet("data.pq")

result = await replay(
    ["data.pq"],
    [Factor("(LogReturn 30 :close)")],
    index_col="time",
)

Note, accessing the time column from factor expressions will cause an error. Factor expressions can only read float64 columns.

API

There are two components in Factor Expr, a Factor class and a replay function.

Factor

The factor class takes an S-Expression to construct. It has the following signature:

class Factor:
    def __init__(sexpr: str) -> None:
        """Construct a Factor using an S-Expression"""

    def ready_offset(self) -> int:
        """Returns the first index after the warm-up period. 
        For non-window functions, this will always return 0."""

    def __len__(self) -> int:
        """Returns how many subtrees contained in this factor tree.

        Example
        -------
        `(+ (/ :close :open) :high)` has 5 subtrees, namely:
        1. (+ (/ :close :open) :high)
        2. (/ :close :open)
        3. :close
        4. :open
        5. :high
        """

    def __getitem__(self, i:int) -> Factor:
        """Get the i-th subtree of the sequence from the pre-order traversal of the factor tree.

        Example
        -------
        `(+ (/ :close :open) :high)` is traversed as:
        0. (+ (/ :close :open) :high)
        1. (/ :close :open)
        2. :close
        3. :open
        4. :high

        Consequently, f[2] will give you `Factor(":close")`.
        """

    def depth(self) -> int:
        """How deep is this factor tree.

        Example
        -------
        `(+ (/ :close :open) :high)` has a depth of 2, namely:
        1. (+ (/ :close :open) :high)
        2. (/ :close :open)
        """

    def child_indices(self) -> List[int]:
        """The indices for the children of this factor tree.

        Example
        -------
        The child_indices result of `(+ (/ :close :open) :high)` is [1, 4]
        """
        
    def replace(self, i: int, other: Factor) -> Factor:
        """Replace the i-th node with another subtree.

        Example
        -------
        `Factor("+ (/ :close :open) :high").replace(4, Factor("(- :high :low)")) == Factor("+ (/ :close :open) (- :high :low)")`
        """

    def columns(self) -> List[str]:
        """Return all the columns that are used by this factor.

        Example
        -------
        `(+ (/ :close :open) :high)` uses [:close, :open, :high].
        """
    
    def clone(self) -> Factor:
        """Create a copy of itself."""

replay

Replay has the following signature:

async def replay(
    files: Iterable[str],
    factors: List[Factor],
    *,
    reset: bool = True,
    batch_size: int = 40960,
    n_data_jobs: int = 1,
    n_factor_jobs: int = 1,
    pbar: bool = True,
    verbose: bool = False,
    output: Literal["pandas", "pyarrow", "raw"] = "pandas",
) -> Union[pd.DataFrame, pa.Table]:
    """
    Replay a list of factors on a bunch of data.

    Parameters
    ----------
    files: Iterable[str | pa.Table]
        Paths to the datasets. Or already read pyarrow Tables.
    factors: List[Factor]
        A list of Factors to replay.
    reset: bool = True
        Whether to reset the factors. Factors carries memory about the data they already replayed. If you are calling
        replay multiple times and the factors should not starting from fresh, set this to False.
    batch_size: int = 40960
        How many rows to replay at one time. Default is 40960 rows.
    n_data_jobs: int = 1
        How many datasets to run in parallel. Note that the factor level parallelism is controlled by n_factor_jobs.
    n_factor_jobs: int = 1
        How many factors to run in parallel for **each** dataset.
        e.g. if `n_data_jobs=3` and `n_factor_jobs=5`, you will have 3 * 5 threads running concurrently.
    pbar: bool = True
        Whether to show the progress bar using tqdm.
    verbose: bool = False
        If True, failed factors will be printed out in stderr.
    output: Literal["pyarrow" | "raw"] = "pyarrow"
        The return format, can be pyarrow Table ("pyarrow") or un-concatenated pyarrow Tables ("raw").
    """

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

factor_expr-0.3.0-cp312-cp312-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.12Windows x86-64

factor_expr-0.3.0-cp312-cp312-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.12

factor_expr-0.3.0-cp312-cp312-macosx_10_15_intel.whl (3.5 MB view details)

Uploaded CPython 3.12macOS 10.15+ Intel (x86-64, i386)

factor_expr-0.3.0-cp311-cp311-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.11Windows x86-64

factor_expr-0.3.0-cp311-cp311-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11

factor_expr-0.3.0-cp311-cp311-macosx_10_15_intel.whl (3.5 MB view details)

Uploaded CPython 3.11macOS 10.15+ Intel (x86-64, i386)

File details

Details for the file factor_expr-0.3.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for factor_expr-0.3.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 79ad975b282df8f1ffd6422531bc8dba0ca72da36300704c935e8ed896d995f3
MD5 2483ad1bffbbc7592b0e8b6fff988f41
BLAKE2b-256 9cfbd93af87e9041a380bd58ac1ce71e644ff9f1f08038d7285b20afffcf13da

See more details on using hashes here.

File details

Details for the file factor_expr-0.3.0-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for factor_expr-0.3.0-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 13bf2389597564b6898e325ba61c0f4253db18de4a0e436209850eebd6ef7fa6
MD5 bdcf883c15adbba8a49435005781670a
BLAKE2b-256 f466216d39fa24590f478a0b0bd31be1ec456c090c1e56f0335d2de19084ca80

See more details on using hashes here.

File details

Details for the file factor_expr-0.3.0-cp312-cp312-macosx_10_15_intel.whl.

File metadata

File hashes

Hashes for factor_expr-0.3.0-cp312-cp312-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 9ad9c3e7d6693bdf9764c335e2ad87f75d2a5e11f1617a0f6df1b0ba9d2b19e1
MD5 72255d8716646915bc11b20a58e65566
BLAKE2b-256 687ab3e7bda1f2165cf2f9f9f6a252a0db43b74aa061bf4f4e37f7c89317ff3a

See more details on using hashes here.

File details

Details for the file factor_expr-0.3.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for factor_expr-0.3.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 db2b9fb42d643d43f2d7f0b091f1ae4c87b45f21f9434f87dcf90c279334bb46
MD5 4ab9c03518f269bd23695d4964e3b6c7
BLAKE2b-256 ad8a9deea0a8f714910e279a9e127c1ba57392f3cb1cee14fa8bbd68f1c06870

See more details on using hashes here.

File details

Details for the file factor_expr-0.3.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for factor_expr-0.3.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 937d52ccf098b9843221d4a852e24a7e0bc200bf7eac53feee9a5fe3c673d61b
MD5 4db06534492765b091248456195d8d1b
BLAKE2b-256 db2dba1dd9b4efc8f67ad69da5d71854a078f41e196afd6799e8161b26807385

See more details on using hashes here.

File details

Details for the file factor_expr-0.3.0-cp311-cp311-macosx_10_15_intel.whl.

File metadata

File hashes

Hashes for factor_expr-0.3.0-cp311-cp311-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 aa70d0d7acf033adcdce249f52a6b4cdbbc9df7879ec78ba85c3e3f9f3e6281d
MD5 b1ecdebd592fb7816564a912491f964b
BLAKE2b-256 fda84b8296627211a3af24f3df64afcde82531884c93a9599e8b4e7570d03d35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page