The best way to learn about building models on our data is through our tutorials:

Benchmark Models

Numerai Benchmark Models are a set of standard models that the Numerai team built. Their predictions are then given out every round so that anyone can easily submit them and stake on them if they want. These models are an easy way to compare your model to the current state-of-the-art.

The list of models and their recent performance is here:


The validation and live predictions are available through the api.

from numerapi import NumerAPI
napi = NumerAPI()
napi.download_dataset("v4.3/train_benchmark_models.parquet", "train_benchmark_models.parquet")
napi.download_dataset("v4.3/validation_benchmark_models.parquet", "validation_benchmark_models.parquet")
napi.download_dataset("v4.3/live_benchmark_models.parquet", "live_benchmark_models.parquet")

How they made

Walk Forward Cross Validation

All predictions are made using a Walk-Forward framework. This means all predictions are made using models which were trained only on data which was available prior to the date of the prediction being made.

Specifically, the data is split up into chunks of 156 eras. Then for each chunk of eras, the predictions are given by a model which is trained up to first_era_of_chunk - purge_eras. The number of purge_eras is always 8 for 20D targets, and 16 for 60D targets.

So eras 157 to 313 are predicted using a model trained up to era148, and then eras 314-470 are predicted using a model trained up to 306, and so on.

Standard Large LGBM params

All of the models use the following LGBM parameters:

standard_large_lgbm_params = {
  "n_estimators": 20000,
  "learning_rate": 0.001,
  "max_depth": 6,
  "num_leaves": 2**6,
  "colsample_bytree": 0.1,

We've found that having more trees can be helpful, and we've found that having less trees with more depth can also achieve similar results with lower compute requirements.

You can read more about our hyper-parameter research in this forum post.


All of the ensembles use the following steps:

  1. gaussianize each of the predictions on a per-era basis

  2. standardize to standard deviation 1

  3. dot-product the predictions with a weights vector representing the desired weight on each model

  4. gaussianize the resulting vector

  5. (if applicable) neutralize the vector

Steps 1 through 4 look something like this:

def rank_gauss_pow1(s: pd.Series) -> pd.Series:
  # do rank-normalize
  s_rank = rank_keep_ties_keep_na(s)

  # gaussianize
  s_rank_norm = pd.Series(stats.norm.ppf(s_rank), index=s_rank.index)

  # Standardize to 1 std
  result_series = s_rank_norm / s_rank_norm.std()

  return result_series

ensemble_cols = ["V4_LGBM_NOMI20", "V42_RAIN_ENSEMBLE"]
weight_vector = [0.1, 0.9]
for col in X[ensemble_cols]:
  if "era" in X.columns:
      X[col] = X.groupby("era", group_keys=False)[col].transform(lambda s1: rank_gauss_pow1(s1))
      # check X contains only a single era
      assert 1800 < X.shape[0] < 6000
      X[col] = rank_gauss_pow1(X[col])
return X[ensemble_cols].dot(weight_vector)


A couple of the models have some neutralization involved. This is basically doing a regression to find out your predictions' exposures to each feature, and then subtracting those exposures from your predictions vector such that the result is a vector which is orthogonal to all of those features.

Here's the code to neutralize some set of vectors (columns) by some list of features (neutralizers):

def neutralize(
  df, columns, neutralizers=None, proportion=1.0, era_col="era"
  if neutralizers is None:
      neutralizers = []
  unique_eras = df[era_col].unique()
  computed = []
  for u in unique_eras:
      df_era = df[df[era_col] == u]
      scores = df_era[columns].values
      scores2 = []
      for x in scores.T:
          x = pd.Series(x)
          x = (x.rank(method="first") - 0.5) / len(x.dropna())
          x = stats.norm.ppf(x)
      scores = np.array(scores2).T
      exposures = (

      scores -= proportion *
          np.linalg.pinv(exposures.astype(np.float32), rcond=1e-6).dot(

      scores /= pd.DataFrame(scores).std(ddof=0, axis=0, skipna=True).values


  return pd.DataFrame(np.concatenate(computed), columns=columns, index=df.index)

What are they?

The naming formula for many benchmarks is as follows:


There are many models that have some combination of a data version (V2, V3, V4, V41, V42, V43) and a target (Nomi20, Sam60). These are models trained in the standard walk-forward way, with standard LGBM parameters, using the specified data version and target. That's all!

There are also unique models we created that don't have that naming scheme:


This is an ensemble of 22 different standard models. It uses the 11 best targets that we had made as of the release of the Rain dataset, with an emphasis on Cyrus, the scoring target, and twice as much weight on 20D targets as on 60D targets. The target weights are:

Cyrus20: 33%, Ralph20: 3.3%, Jeremy20: 3.3%, Waldo20: 3.3%, Tyler20: 3.3%, Victor20: 3.3%

Cyrus60: 17%, Ralph60: 1.7%, Jeremy60: 1.7%, Waldo60: 1.7%, Tyler60: 1.7%, Victor60: 1.7%

Alpha20: 3.3%, Bravo20: 3.3%, Charlie20: 3.3%, Delta20: 3.3%, Echo20: 3.3%

Alpha60: 1.7%, Bravo60: 1.7%, Charlie60: 1.7%, Delta60: 1.7%, Echo60: 1.7%


This is the same as v42_RAIN_ENSEMBLE, but with no weight on the 60D versions.

It is also neutral to the serenity set of features.


This is a simple 50/50 blend of V{42/43}_LGBM_TEAGER20 and V{42/43}_LGBM_CYRUS20


This is a standard model using V42 data and Cyrus20 target.


This is what was once the Sunshine Example Model. It uses standard models built on 6 different targets: Nomi20, Jerome60, Ralph20, Victor20, Waldo20, Tyler20.

These models are equal weight in the ensemble and the final predictions are neutralized at a proportion of 50% to the V4.1 "medium" feature set.


This is a standard model trained on the v3 equivalent feature set and the Ralph20 target.


This is a standard model trained on the v2 equivalent feature set and the Nomi20 target.

The following models are on the Benchmark Models page, but their predictions aren't present in the predictions files because they are either obsolete or reproducible:

V4_EXAMPLE_PREDS - A standard model on V4 data and Nomi20, and is neutral to the 50 "riskiest features" calculated by which features have the biggest change in mean correlation with Nomi between half1 and half2 of the training data.

INTEGRATION_TEST - Submits our favorite model at the time. For years, this was V2 example predictions, but with the advent of v3 and v4 datasets it evolves over time. At the time of writing this, it is submitting RAIN_ENSEMBLE predictions.

NB_HELLO_NUMERAI - Submits the model created by the default Hello Numerai tutorial notebook.

NB_FEATURE_NEUTRAL - Submits the model created by the feature neutralization tutorial notebook.

NB_TARGET_ENSEMBLE - Submits the model created by the target ensemble tutorial notebook.

NB_EXAMPLE_MODEL - Submits the model created by the barebones example_model notebook.

Last updated