Skip to content

Machine Learning API

ML selection, feature engineering, and model wrappers.


FeaturePipeline

Feature engineering pipeline — computes technical-indicator features from OHLCV data.

Constructor

FeaturePipeline(
    features: list[str] | None = None,
    custom_features: dict[str, Callable] | None = None,
)
Parameter Type Description
features list[str] Feature names to compute. If None, uses the default set.
custom_features dict[str, Callable] Custom feature functions: {name: func(close, high, low, volume) -> float}

Methods

compute

compute(securities, context, lookback=60) -> pd.DataFrame

Compute the feature matrix for the given securities.

Parameter Type Description
securities list[str] List of security codes
context Context Current backtest context
lookback int Lookback window in days

Returns: pd.DataFrame — index is security codes, columns are feature names


BaseMLModel

ML model wrapper — unifies the sklearn model interface.

Constructor

BaseMLModel(
    model_type: str = 'random_forest',
    **kwargs,
)
Parameter Type Description
model_type str Model type: random_forest, logistic_regression, gradient_boosting, xgboost
**kwargs Passed to the underlying model

Methods

fit

fit(X: pd.DataFrame, y: pd.Series) -> None

Train the model.

predict

predict(X: pd.DataFrame) -> np.ndarray

Predict. For classifiers, returns the positive-class probability (0–1).

predict_proba

predict_proba(X: pd.DataFrame) -> np.ndarray

Return probabilities for all classes.

feature_importances

feature_importances() -> pd.Series

Return feature importances (sorted).

save / load

model.save(path: str)
loaded = BaseMLModel.load(path: str)

Serialize / deserialize the model.


MLSelector

ML-based stock selector, inherits from StockSelector.

Single-cross-section training

The default training path uses one day's data to fit the model. With small universes (<50 stocks) this produces few samples and the model cannot learn meaningful patterns. For robust training, provide label_data as a panel DataFrame with historical features and labels across many dates.

Constructor

MLSelector(
    model: str = 'random_forest',
    features: list[str] | None = None,
    target: str = 'past_return_5d',
    top_n: int = 5,
    train_start: str | None = None,
    train_end: str | None = None,
    lookback: int = 60,
    label_data: pd.DataFrame | None = None,
    custom_features: dict[str, Callable] | None = None,
    **model_kwargs,
)
Parameter Type Description
model str Model type or BaseMLModel instance
features list[str] Feature list
target str Target variable: past_return_5d, past_return_10d, will_rise_5d. forward_return_5d raises NotImplementedError — use label_data for true forward-return prediction.
top_n int Number of stocks to select
train_start str Training start date (YYYY-MM-DD)
train_end str Training end date (YYYY-MM-DD)
lookback int Historical lookback in days
label_data pd.DataFrame \| None Pre-computed label panel DataFrame. Must contain columns ['security', 'date', 'label']. When provided, the training stage uses this panel instead of computing labels from target — this is the recommended path for true forward-return prediction.
custom_features dict[str, Callable] \| None Custom feature functions: {name: func(close, high, low, volume) -> float}. Function names must also appear in features to be invoked.
**model_kwargs Extra model parameters

Methods

train

train(securities: list[str], context) -> None

Train the model on historical data.

rank

rank(securities: list[str], context) -> list[str]

Return the top-N stocks ranked by model prediction score.

Returns: list[str] — stock codes (best first)


optimize_hyperparams

Hyperparameter optimization with time-series-aware cross-validation.

from eqlib.ml.tuning import optimize_hyperparams

best_params = optimize_hyperparams(
    pipeline,
    model_type='random_forest',
    X=X_train,
    y=y_train,
    param_grid={'n_estimators': [50, 100, 200]},
    cv_method='time_series_split',
    n_splits=5,
    scoring='roc_auc',
)
Parameter Type Description
pipeline FeaturePipeline Feature pipeline instance
model_type str Model type
X pd.DataFrame Feature matrix
y pd.Series Target variable
param_grid dict Parameter grid
cv_method str time_series_split or walk_forward
n_splits int Number of CV folds
scoring str Scoring metric: roc_auc, accuracy, neg_log_loss

validate_ml_strategy

ML strategy validation.

from eqlib.ml.validation import validate_ml_strategy

report = validate_ml_strategy(
    backtest_result,
    model,
    feature_importance_threshold=0.01,
)

Return fields: - feature_importance: per-feature importance - concentration_risk: whether importance is too concentrated - model_stability: model stability


check_feature_drift

Detect feature-distribution drift between train and live data via the Kolmogorov-Smirnov statistic.

from eqlib.ml.validation import check_feature_drift

drift = check_feature_drift(X_train, X_live, threshold=0.1)
Parameter Type Description
X_train pd.DataFrame Training feature matrix
X_test pd.DataFrame Live / test feature matrix
threshold float KS-statistic threshold above which a feature is flagged as drifted

Return fields: - drift_scores: per-feature {ks_stat, p_value} dict - drifted_features: list of feature names that exceeded the threshold - drift_detected: boolean — whether any drift was found

When to use

Call before each daily / weekly live run to compare the day's feature distribution against the training set. Features that drift need model retraining or monitoring alerts.


auto_tune_selector

Auto-tune hyperparameters for an MLSelector instance, using time-series-aware cross-validation.

from eqlib.ml.tuning import auto_tune_selector

best_params = auto_tune_selector(
    selector,
    context,
    param_grid=None,             # default grid by model_type
    cv_method='time_series_split',
    n_splits=3,
    scoring='roc_auc',
)
Parameter Type Description
selector MLSelector Configured selector instance
context Context Current backtest context (used to read universe and compute features)
param_grid dict \| None Parameter grid; None selects a default grid by model_type
cv_method str 'time_series_split' or 'walk_forward' (both use TimeSeriesSplit underneath)
n_splits int Number of CV folds
scoring str Scoring metric: roc_auc, accuracy, neg_log_loss

Returns: dict — best parameters. Returns an empty dict when data is insufficient or no universe is available.

Difference from optimize_hyperparams

optimize_hyperparams requires the caller to prepare X / y; auto_tune_selector pulls data directly from selector.pipeline.compute(...) and selector._compute_target(...), suitable for a one-line call inside a strategy's initialize.


Built-in features

Feature Computation
rsi RSI(14)
macd_dif MACD difference
macd_dea MACD signal line
macd_hist MACD histogram
atr ATR(14)
boll_upper Bollinger upper band
boll_mid Bollinger middle band
boll_lower Bollinger lower band
donchian_upper Donchian upper
donchian_mid Donchian middle
donchian_lower Donchian lower
cci CCI(14)
obv OBV
volume_ratio 5-day avg volume / 20-day avg volume
momentum 20-day momentum
volatility 20-day return std
roc 12-period rate of change