Machine Learning Stock Selection¶

Overview

| Item | Description |

|------|-------------| | Goal | Replace hand-tuned factor weights with MLSelector — let the model learn factor weights and interactions automatically | | Prerequisites | Run a backtest, Selection strategy framework |

1. Why ML-based stock selection?¶

Traditional multi-factor selection uses hand-tuned weights:

from eqlib import MultiFactorSelector

selector = MultiFactorSelector(
    factors={"pe": -0.4, "pb": -0.2, "pct_change": 0.4},
    top_n=5,
)

Problems: - Weights are author-tuned by experience and may only fit a specific market regime - Interactions between factors are ignored (e.g., "momentum + high vol" behaves differently across regimes) - Non-linear relationships cannot be captured

ML selection learns automatically: - Per-factor importance (data-driven, not fixed) - Interactions between factors - Time-varying weights

2. A minimal ML selection strategy¶

from eqlib import *
from eqlib.ml import MLSelector

def initialize(context):
    set_benchmark('000300.XSHG')
    set_order_cost(OrderCost(open_tax=0, close_tax=0.0005,
                              open_commission=0.00025, close_commission=0.00025))

    context.universe = ['601390', '600519', '000858', '002594', '601398']

    # ML selector: learns factor weights from data
    # target='past_return_5d' uses historical 5-day returns as labels (default).
    # For true forward-return prediction, pass pre-computed labels via label_data (panel).
    g.selector = MLSelector(
        model='random_forest',
        features=['rsi', 'macd_hist', 'atr', 'momentum', 'volatility'],
        target='past_return_5d',
        top_n=3,
    )

    before_trading_start(train_model)  # register pre-market training hook
    run_weekly(rebalance, day_of_week=0, time='every_bar')

def train_model(context, data=None):
    # Retrain the model before each trading day (uses latest data)
    g.selector.train(context.universe, context)

def rebalance(context):
    # Get the top-3 stocks selected by the model
    selected = g.selector.rank(context.universe, context)

    # Sell positions not in the selected list
    for pos in list(context.portfolio.positions.keys()):
        if pos not in selected:
            order_target(pos, 0)

    # Buy selected stocks
    if selected:
        cash_per_stock = context.portfolio.available_cash / len(selected)
        for stock in selected:
            order_value(stock, cash_per_stock)

# Run the backtest
result = run_strategy(
    initialize,
    start_date='2022-01-01',
    end_date='2024-01-01',
    securities=['601390', '600519', '000858', '002594', '601398'],
)

3. MLSelector parameters¶

MLSelector(
    model='random_forest',           # model type
    features=['rsi', 'macd_hist'],  # features to use
    target='past_return_5d',         # prediction target (default: past 5-day return)
    top_n=5,                         # number of stocks to select
    train_start=None,                # training start date (optional)
    train_end=None,                  # training end date (optional)
    lookback=60,                     # historical lookback (days)
)

model — model type¶

Model	Description	Best for
`random_forest`	Random Forest (default)	General use; robust, less prone to overfit
`logistic_regression`	Logistic Regression	Simple scenarios; interpretable
`gradient_boosting`	Gradient Boosting	When accuracy matters
`xgboost`	XGBoost	When accuracy matters (requires `xgboost` installed)

features — available features¶

Feature	Description
`rsi`	RSI(14) relative strength
`macd_dif`, `macd_dea`, `macd_hist`	MACD lines
`atr`	ATR(14) average true range
`boll_upper`, `boll_mid`, `boll_lower`	Bollinger bands
`donchian_upper`, `donchian_mid`, `donchian_lower`	Donchian channel
`cci`	CCI(14) commodity channel index
`obv`	OBV on-balance volume
`volume_ratio`	5-day avg volume / 20-day avg volume
`momentum`	20-day momentum (price / price[-20] - 1)
`volatility`	20-day return std
`roc`	12-period rate of change

target — prediction target¶

Target	Description
`past_return_5d`	Past 5-day return (default; regression)
`past_return_10d`	Past 10-day return (regression)
`will_rise_5d`	Whether past 5 days were up (0/1 classification)

About forward_return_5d

target='forward_return_5d' is no longer supported — passing it raises NotImplementedError. Reason: a single day's cross-section cannot construct true forward-return labels. For true forward-return prediction, pass pre-computed labels via the label_data parameter (a panel DataFrame with columns ['security', 'date', 'label']).

4. Using FeaturePipeline standalone¶

Sometimes you want features directly (e.g., for visualization or analysis) without MLSelector:

from eqlib.ml import FeaturePipeline

g.pipeline = FeaturePipeline(features=['rsi', 'macd_hist', 'momentum'])

def rebalance(context):
    # Compute features
    features = g.pipeline.compute(context.universe, context, lookback=60)

    # features is a DataFrame
    # index = stock codes, columns = feature names
    print(features.head())
    #              rsi  macd_hist  momentum
    # 601390     45.2       0.12      0.03
    # 600519     62.1      -0.05     -0.01

5. Custom features¶

You can add your own feature functions:

from eqlib.ml import FeaturePipeline, MLSelector

def price_to_ma_ratio(close, high, low, volume):
    """Price deviation from 20-day MA."""
    if len(close) < 20:
        return float('nan')
    ma20 = close.iloc[-20:].mean()
    return float(close.iloc[-1] / ma20 - 1.0)

# Create a pipeline with custom features
g.selector = MLSelector(
    features=['rsi', 'momentum', 'price_ma_ratio'],
    target='past_return_5d',
    top_n=3,
)
g.selector.pipeline = FeaturePipeline(
    features=['rsi', 'momentum', 'price_ma_ratio'],
    custom_features={'price_ma_ratio': price_to_ma_ratio},
)

6. Model comparison¶

Different models have different characteristics:

from eqlib.ml import MLSelector

# Random Forest: general-purpose, robust
rf = MLSelector(model='random_forest', features=[...], top_n=3)

# Logistic Regression: simple, interpretable
lr = MLSelector(model='logistic_regression', features=[...], top_n=3)

# Gradient Boosting: high accuracy
gb = MLSelector(model='gradient_boosting', features=[...], top_n=3)

7. Best practices¶

7.1 Prevent overfitting¶

Use simpler models (Random Forest is less prone to overfit than XGBoost)
Limit max_depth and n_estimators
Retrain regularly, but not too frequently
Validate with Walk-Forward Analysis

7.2 Feature selection¶

Don't use too many features at once (5–10 recommended)
Prefer features with a direct logical relationship to the target
Avoid highly correlated features (e.g., RSI and CCI together)

7.3 Training frequency¶

# Option 1: Retrain before every trading day (recommended)
def train_model(context, data=None):
    g.selector.train(context.universe, context)

# Register in initialize: before_trading_start(train_model)

# Option 2: Retrain monthly (every 4 weeks)
g.train_counter = 0
def train_model(context, data=None):
    g.train_counter += 1
    if g.train_counter % 20 == 0:  # roughly every 20 trading days
        g.selector.train(context.universe, context)

8. Complete examples¶

See: - examples/21_ml_selector.py — basic ML selection - examples/22_feature_pipeline.py — standalone feature computation - examples/23_model_comparison.py — model comparison - examples/24_custom_features.py — custom features