Office Hours with Arbitrage

This series was hosted by long-time Numerai participant Jon R. Taylor, PhD. With years of experience participating in Numerai's tournaments, Jon decided to share what he's learned with the community and create a video space where other participants can learn, share, and get to know one another.

These videos include interviews with Numerai participants, top tips and tricks for participating in Numerai, presentations and more. Jon R. Taylor also makes a point to highlight Numerai tournament updates, new features and releases as close to the new release as possible, and brings on members of the Numerai team when available to bring foward community questions. There are also Special Features, which tend to hone in on specific technical matters such as Feature Neutralization.

While the full playlist is on our official YouTube channel, Numerai neither hosts nor directly moderates these conversations. We include these videos here and on YouTube as we believe they are helpful for knowledge-share among Numerai community members.

Below is a synthesis of FAQs from discussions throughout these videos.

About Numerai and the hedge fund

What kind of hedge fund is Numerai?

A global equities fund driven by the machine learning models of its data science community: long/short and neutral to market, country, sector, currency, and common factors — "just trying to find the edges other people can't find." Many conventional funds share crowded exposures (value, momentum) and get hurt together when everyone deleverages at once; Numerai neutralizes to as many factors as possible so it can do well when other funds do badly. Equities suit machine learning because thousands of stocks generate enough data points. The fund began trading real capital in December 2015.

Is my model's performance (or the burn rate) correlated with the VIX or market direction?

No. The targets are engineered to be neutral to overall market moves — Numerai is not a derivative of the VIX, and timing your stake around market volatility is a waste of time. When models burn, the predictions were wrong on stock-specific returns, not because the market fell.

Can I estimate how the hedge fund is doing from my scores or the leaderboard?

No. Submissions feed the Meta Model, which then goes through a large optimization step — neutralization to factors participants can't see, plus substantial leverage — so neither average model performance nor any single model's scores extrapolate to fund performance.

Why is a hedge fund's earning capacity limited?

Trading on an edge closes the very inefficiency that made it profitable, so every fund struggles to find a strategy that scales. This is also why the data has so many features: if other funds crowd into any one signal its value decays — and models overloaded on that signal decay with it.

How can Numerai afford to keep paying data scientists?

The tournament is a positive feedback loop, not a reserve being drained: better predictions improve the Meta Model, a better Meta Model attracts assets under management, and fund revenue funds payouts — while burns flow back the other way.

Is Numerai competing against its own data scientists? Do employees have an edge?

No — if the team had data that would improve models, releasing it to participants would be the first move, because Numerai only makes money when the Meta Model is accurate. Team members having models is an alignment of interests: payouts are purely score-based rather than rank-based, so their stakes take nothing from other users, and the team's own models are openly published as Benchmark Models.

Could a big financial firm just buy Numerai and shut it down?

Richard Craib has said no — he holds a controlling stake and board control specifically to prevent a hostile takeover, and because the core protocol code is open source, even a shutdown couldn't kill the idea.

If individual models look like random walks, what value do they provide?

Individual performance should look close to a random walk in equities — the value comes from averaging. Many individually noisy but mutually uncorrelated models are each right at different times; the Meta Model built on top filters the noise and carries an edge even when no individual model has one.

The data

Is the data encrypted? Homomorphically?

No — it's obfuscated, not encrypted ("encryption implies there's a key"). Numerai actually tried homomorphic encryption early on, but it turned 1 MB of data into 16 GB of high-dimensional polynomials. The normalization and cleaning phases already remove much of the original structure; the data remains fully modelable, which is the point.

What does it mean that the target is "residual"?

The residual is the part of a stock's return not explained by broader drivers — if Apple is up 10% while US tech is up 8%, the residual is 2%. Sector, country, and factor returns explain most raw returns, so Numerai strips them out of the target; the fund also can't sell factor exposure investors could buy cheaply themselves. Modeling residual returns generalizes across sectors, countries, and regimes instead of becoming a disguised sector bet. See Data for the current target descriptions.

Why won't Numerai reveal country/sector exposure, test-set metrics, or model improvement suggestions?

Three variants of the same anti-leak principle. Exposing categories like country could create degenerate behavior, and the targets are already neutral to them. Metrics computed on the hidden test set would leak information about it — historically, fast fine-grained feedback was exploited to reverse-engineer features. And official "suggestions" would bias everyone toward the same overfit models, defeating the purpose of a diverse crowd. Validation-based diagnostics are the intended feedback channel — meaningful only if you didn't train on validation.

Can I find out which historical periods the eras cover, or simulate a crash with synthetic eras?

No — Numerai never discloses the dates the data covers, and because the data is obfuscated, only Numerai could construct meaningful synthetic market regimes. Aim for a model that is good on average across all eras; don't over-weight crisis eras or you bias the model toward those events.

Are the live targets obfuscated too?

Yes — live targets are calculated exactly like the training targets: related to real stock performance in a heavily abstracted, residualized form.

Why does Numerai change the target from time to time?

New targets are engineered to improve the fund's investment performance — for example, capturing the extremes of predictions better. Each target generation is produced by the same method with different distributions and residualization, which is why models usually transfer across generations with retraining.

Why is the target forced into fixed-proportion bins?

Partly to keep the signal stationary: free-floating bins (especially the tails) would need unlimited width and wouldn't behave consistently across eras, so the bins are forced to the same proportions every era.

Building models

What do veteran participants say matters most?

The same three things, in every interview ever given: (1) Document everything — keep a research journal of what you tried and why, or you'll spend weeks re-deriving your own reasoning. (2) Make small changes and be patient — feedback takes months, so don't rewrite a model over a couple of weeks of live scores; if it looked good in research, leave it alone and build the next model instead. (3) Read the community — forum posts and papers from other domains stir up ideas even when you never implement them directly.

How should I structure cross-validation on this data?

Treat eras as the unit of observation, not rows — otherwise you effectively have many copies of the same security and will overfit. Check that your library actually splits by era: scikit-learn's time-series CV, for example, accepts a groups argument and silently ignores it. Because the current data has weekly eras with overlapping forward-looking targets, purged walk-forward CV matters — see Data.

How often can I check my model against the validation data?

Essentially once. Validating on the holdout is your hypothesis test — every additional peek degrades its validity. Do your research with cross-validation on training data, validate at the end, and stop. Bad validation results are informative (your model is weak); good ones tell you little, because validation covers only one slice of history.

Should I train on the validation data?

It's a real trade-off, and people do it: you gain training data, but you lose your only out-of-sample check, so your only performance read becomes live scores — which take months. Side-by-side comparisons of otherwise-identical models have found small gains and higher variance. If you do it, lock your hyperparameters first, and don't expect your diagnostics to generalize.

My in-sample scores look absurdly high. Am I overfit?

In-sample performance is nearly meaningless on this dataset — even standard example models look implausibly strong in-sample. The only measure of overfitting that matters is out-of-sample generalization. In general, be suspicious of any signal that looks too good.

How long before I can judge whether a model is actually good?

Months, not weeks. Rounds open daily and their scoring windows overlap heavily (~24 rounds resolving at any time — see Submissions), so consecutive rounds are essentially the same bet several times over: a few weeks of live scores tell you almost nothing, and adjacent rounds earning or burning together is expected, not a bug. Veterans consistently quote three to four months as a minimum.

Should I optimize mean correlation or something Sharpe-like?

Sharpe-like. A model that maximizes mean correlation can hide several terrible months a year; one that performs across all eras generalizes better, and live data may be dense with exactly the eras you chose to be bad at. Optimize the mean of per-era scores divided by their volatility, and prioritize limiting drawdowns — recovering from a hole costs more than the fall.

Why does feature exposure matter, and what should I watch?

Features can proxy real-world factors (some have a value tilt), so a model concentrated on a few features burns when that factor goes out of favor — and automated feature selection is regime-fragile: the feature group that "never fails" in training is exactly the one that fails live. Low feature exposure correlates with robust long-run performance. Watch your maximum single-feature exposure, not just the average — one large hidden risk hides behind many small ones.

How does feature neutralization work, and how much should I apply?

Per era, fit a linear regression from the features to your predictions and subtract some proportion of that fitted component — a projection that keeps the part of your signal orthogonal to the features. It's a dial: more neutrality generally improves consistency, with diminishing returns and lower raw scores. The gap between your raw and feature-neutral score tells you how much of your model is linear feature risk versus actual alpha. See Models for the neutralization code; feature-neutral correlation (FNC) is reported as an informational score.

Should I treat this as a regression problem?

The scoring metric is a ranking metric — prediction scale doesn't matter, only relative order. Optimizing MSE incentivizes memorizing the target curve; learning-to-rank fits the actual task (you can swap XGBoost's regressor for its ranker without changing hyperparameters), and pairwise approaches traditionally beat pointwise in ranking problems.

Why don't CNNs or LSTMs work well here?

Every architecture has an inductive bias. Convolution assumes invariant structure along the dimension you convolve over — but the features are an unordered set (what dimension would you convolve over?). Sequence models assume you can track entities across time, which the era structure doesn't give you. Plain networks with strong regularization (L1, dropout, an untouched holdout) are the workable neural-net path — and tree ensembles remain the strongest out-of-the-box choice, which is why the Benchmark Models are still gradient-boosted trees.

Do I need heavy feature engineering or a giant model?

Probably not. The data is already cleaned and regularized, and with a low signal-to-noise ratio, aggressive transformations risk destroying signal — feature sampling (a small fraction of features per tree) does more to control overfitting. Simple, stable models have repeatedly matched or beaten intricate ones; complexity mostly adds overfitting surface.

Which single metric should I optimize?

None. Chasing every metric fails on all of them, and there is no one metric to rule them all. Focus on one primary objective, sanity-check that the others stay reasonable across a few random seeds, and treat a persistently bad metric as a symptom to investigate. Multi-objective searches — score, feature exposure, distance from the benchmark — surface more interesting models than maximizing any single number.

Can I detect the current market regime and adapt my model to it?

Strongly cautioned against. Eras do cluster by model performance in hindsight, but post-hoc clusters only help if you can predict from the features which cluster the live era belongs to — and nobody has demonstrated that. The regime may also change before you can capitalize, and the attempt makes overfitting much more likely. Difficult eras are best used as training data and out-of-sample tests, spread across the whole performance spectrum.

What baseline should I try to beat, and what's a "good" validation score?

Use the Benchmark Models — their predictions are downloadable every round — and study why they work; that's the fastest education in the dataset. Don't anchor on absolute validation numbers quoted in old threads: they were computed on long-gone dataset versions. Compute your own cross-validation metrics and compare against the published benchmarks.

How should I use multiple model slots?

Run genuinely different ideas in parallel, give each months of live history, and periodically run a manual evolutionary process: kill the worst performer, try a new idea in its slot, and keep your largest stakes on your longest-proven models. There are no sacred cows. Averaging individually-predictive but mutually-different models is one of the most robust improvements available.

The Meta Model and MMC

How does Numerai combine everyone's predictions? Something fancy like stacking?

Nothing complicated — it's stake-weighted. Staking is the weighting mechanism, which is why payouts and burns continuously improve the Meta Model.

Explain MMC like I'm five.

A basketball plus-minus score: instead of measuring your individual stats, it measures whether the team (the Meta Model) performs better with you on the court than on the bench. Or: a team of only pitchers is a bad baseball team — the Meta Model wins by combining varied skill sets, and MMC pays you for bringing one.

My model is good but nearly identical to the Meta Model — why is my MMC ~zero?

To earn MMC you have to pull the Meta Model in a better direction: a clone adds nothing (after orthogonalization, what remains is mostly noise), and uniqueness adds nothing unless you're also good. The doctrine is different and good — there are infinite ways to be different and bad. Flipping a good model (submitting 1−p) doesn't work either; it's just a bad model.

Why is my MMC so volatile and confusing?

Because it depends on what every other participant submits, not just on you. A sudden MMC jump usually means you caught a signal in a regime the crowd missed. Judge MMC over many rounds: over long periods it tracks usefulness to the Meta Model well, but it's very hard to read in any single week.

Is there a trade-off between uniqueness and consistency?

Yes, by design: the more unique your model, the higher its potential contribution but the harder it is to score consistently, because you're deviating from the consensus of what a good model looks like. It's difficult but possible to be both — veteran models have run at ~30% correlation to the Meta Model with strong scores on both axes. Payouts reward balancing the two.

Should a newcomer chase MMC or correlation?

Start with correlation: it's intuitive and independent of the rest of the field, whereas contribution scores depend on everyone else. Collect live history on a solid model first, then deliberately build something different. Since payouts combine CORR and MMC with multipliers (see Staking), this is a portfolio balance, not an either/or choice.

Can I estimate my contribution before going live?

Yes — measure what your model retains after neutralizing against the published Benchmark Model predictions: if you're still good residual to them, expect positive contribution. Your correlation with the Meta Model is also reported as an informational score once you're live.

If the Meta Model became perfect, wouldn't MMC go to zero? Is that a flaw?

A perfect Meta Model doesn't exist and never will — regime changes, currency risk, fraud, and rule changes mean the market is never solved, so there's always signal left to add. If it ever were unbeatable, MMC going to zero would be the system working, not failing.

Staking and payouts

Why doesn't staking just earn interest? Why must the tournament stay hard?

A founding design principle, in Richard Craib's words: "If we ever make the Numerai tournament easier than the stock market, then there is an attack." If any stake earned a return regardless of skill, being long and short the same model (p / 1−p) would print risk-free money. Payouts must be strictly performance-driven, and per-NMR rather than per-human, because sybil-resistant per-person schemes don't work — the system can't care whether a stake belongs to a human, a dog, or an AI. The result is the simplest symmetric system: do well, get paid; do badly, get burned. See Staking.

Is there insurance against burns during volatile periods?

No. There's no insurance in this business — when you burn, Numerai burns too; it's part of the game. If you want to burn less, stake less or build a less volatile model. Burns aren't caused by market volatility per se: bad performance means the predictions were bad, and that's exactly what you're staking on.

How should I manage staking risk?

Design a system and follow it exactly — deviating from your own rules is gambling. Recurring veteran patterns: pick a fixed NMR amount you're comfortable risking and take profit above it; treat it like poker bankroll management (risk a fixed fraction, take risk-on steps only above your high-water mark, hard stops); ease in by dollar-cost averaging rather than staking a lump sum. Remember the double exposure: your stake is denominated in NMR, so your fiat risk moves with both your model and NMR's price. The take-profit versus reinvest choice is built into Atomic Blockchain Staking's Fixed and Compound modes.

Can I (or outsiders) stake on someone else's model?

No — the purpose of staking is to prove you believe in your own model. Third-party staking off leaderboard information conveys almost no information, and having the token represent fund cash flow would create legal risk. Numerai's framing: it's buying signals, not selling investment products. The alternatives: free Benchmark Models, or buying predictions on the community marketplace NumerBay (at your own risk) — but you stake your own models.

Why can't I stake on my account as a whole?

If you want a blend of your models, blend the predictions yourself and submit that as one model, then stake on it. Numerai doesn't want to be responsible for blending user submissions.

What happens if too much NMR is staked overall?

Stakes above a global threshold are pro-rated: everyone's effective stake (for both payouts and burns) scales down proportionally via the payout factor — see Staking.

Can the payout rules be gamed?

History says don't bother: whenever participants "solved" the payout structure, Numerai changed the rules, and the rules explicitly reserve the right to void earnings for abuse. Exploit-style models hurt everyone because the whole system is a feedback loop, and the community actively bot-hunts. Multiple accounts are against the rules for the same reason. The durable strategy is a model that makes sense.

Should I keep submitting through periods when it doesn't seem to pay?

Yes — consistent submissions build the documented track record that your reputation and rank are computed from, and veterans agree the worst move is skipping rounds or thrashing your model during drawdowns.

Community and getting started

What should newcomers know before starting?

It's tough, and it's a long game — good things take time. A NASA JPL scientist who competed called Numerai the harder problem than his day job: "a ton of data with no signal," an extremely low signal-to-noise search where the main transferable skill is quantifying uncertainty. The community is the greatest resource: ask questions on Discord, and use the forum so important conversations don't get lost.

Should I automate my submissions?

Yes — with daily rounds it's near-mandatory. Manual submission makes the tournament feel like a burden, discourages running more models, and makes it easy to give up. Use Model Uploads (Numerai runs your model server-side) or numerai-cli for your own infrastructure.

Why does Numerai need thousands of us instead of hiring the ten best?

Groupthink: hires converge toward their colleagues over time, while a large diverse user base attacks the problem from genuinely different perspectives. Numerai could never hire enough data scientists to get the Meta Model just right — crowd diversity is the whole point.

Should I trust modeling advice from the community?

Be critical — it's a mildly adversarial environment where participants compete for the same rewards, and occasionally advice is careless or deliberately bad. The vast majority is well-intentioned, but validate every idea yourself before deploying money behind it — and never test in production with your stake.

Do I need to model the entire Signals universe?

No — the philosophy behind Signals is predictions for names you're genuinely confident in, not coverage for its own sake. A robust niche model for one specific market can be very strong. The live universe file published each round defines which tickers are accepted.

Cited resources

Office Hours videos

The full series lives on Numerai's YouTube channel; later community-run sessions are collected in the Numerai — Community Office Hours playlist.

Journal Articles:

Spence, M. (1973). Job Market Signaling. Quarterly Journal of Economics, v.87, n.3, 355-374. https://www.jstor.org/stable/1882010?seq=1
Kahneman, D. (1973). Attention and effort (Vol. 1063). Englewood Cliffs, NJ: Prentice-Hall.
Malkiel, B. G., & Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2), 383-417. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.1970.tb00518.x
Grossman, S. J., & Stiglitz, J. E. (1980). On the impossibility of informationally efficient markets. The American economic review, 70(3), 393-408. https://www.jstor.org/stable/1805228?seq=1
Jensen, M. C., & Meckling, W. H. (1979). Theory of the firm: Managerial behavior, agency costs, and ownership structure. In Economics social institutions (pp. 163-231). Springer, Dordrecht. https://link.springer.com/chapter/10.1007/978-94-009-9257-3_8
Merton, R. C. (1987). A simple model of capital market equilibrium with incomplete information. The journal of finance, 42(3), 483-510. https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.1987.tb04565.x
Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, Volume 33, 3-56. https://www.sciencedirect.com/science/article/abs/pii/0304405X93900235
Staff, WikiHow (2006). How to dial a rotary phone. WikiHow. https://www.wikihow.com/Dial-a-Rotary-Phone
Hirshleifer, D., & Shumway, T. (2003) Good Day Sunshine: Stock Returns and the Weather. The Journal of Finance, Volume 58(3), 1009-1032 https://www.jstor.org/stable/3094570?seq=1

Resources for Python Users: Packages, modules, etc:

Anaconda - https://www.anaconda.com/products/individual
Scikit-Learn - https://scikit-learn.org/stable/index.html
Numpy - https://numpy.org/
Pandas - https://pandas.pydata.org/pandas-docs/stable/index.html
XGBoost - https://xgboost.readthedocs.io/en/latest/
Feather File Format - https://arrow.apache.org/docs/python/feather.html
Google Colab - https://colab.research.google.com/
Example Scripts - https://github.com/numerai/example-scripts
numerai-cli (successor to the retired Numerai Compute) - https://github.com/numerai/numerai-cli
NumerAPI - https://numerapi.readthedocs.io/en/stable/
MLJAR - https://mljar.com/automl/

Resources for R Users: Packages, misc:

R - https://cran.r-project.org/
Caret - https://cran.r-project.org/web/packages/caret/
Feather - https://cran.r-project.org/web/packages/feather/index.html
Tidyr - https://cran.r-project.org/web/packages/tidyr/
XGBoost - https://cran.r-project.org/web/packages/xgboost/
R-Numerai - https://cran.r-project.org/web/packages/Rnumerai/index.html

Submissions

Scoring

Scoring

Scoring

Office Hours with Arbitrage

About Numerai and the hedge fund

The data

Building models

The Meta Model and MMC

Staking and payouts

Community and getting started

Cited resources

Office Hours videos

Journal Articles:

Resources for Python Users: Packages, modules, etc:

Resources for R Users: Packages, misc:

Office Hours with Arbitrage ​

About Numerai and the hedge fund ​

The data ​

Building models ​

The Meta Model and MMC ​

Staking and payouts ​

Community and getting started ​

Cited resources ​

Office Hours videos ​

Journal Articles: ​

Resources for Python Users: Packages, modules, etc: ​

Resources for R Users: Packages, misc: ​

Office Hours with Arbitrage

About Numerai and the hedge fund

The data

Building models

The Meta Model and MMC

Staking and payouts

Community and getting started

Cited resources

Office Hours videos

Journal Articles:

Resources for Python Users: Packages, modules, etc:

Resources for R Users: Packages, misc: