Building a long-only equity rotation system that actually beats the index, then proving it didn't just get lucky.

EagleView is a Russell 1000 quantitative rotation strategy. It picks ten stocks every week from a pool of roughly a thousand using five sub-scores: momentum, sector strength, point-in-time fundamentals, SEC 8-K events, and recent insider buying. Across 29 phases of testing I kept what worked, threw out what didn't, and ran statistical tests at every step to make sure I weren't fooling myself. What follows is the full story: the math, the wins, the failures I wrote up with the same detail as the wins, and the honest expectation for what this thing can actually do with real money.

Out-of-sample CAGR
+16.71%
Walk-forward, 4 folds, 2017 through 2025
Walk-forward Sharpe
1.213
SPY over the same window: roughly 0.85
Max drawdown
-13.98%
SPY over the same window: roughly -25%
Honest live CAGR estimate
~13%
Backtest, minus four unavoidable discounts

01The thesis

The premise is simple. A $100,000 account, fully invested, holds ten US stocks at any moment. Once a week the system looks at the universe, scores everything, and replaces any holding that has fallen out of the top ranks. No shorting, no margin, no day trading. The whole edge comes from being slightly better than the index at deciding which ten names to hold.

Why long-only and why so few names? Because retail accounts have constraints that institutional desks don't. Tax drag on short-term gains, lousy borrow on shorts, and the very real possibility that the human running the account will panic during a 30% drawdown. A ten-stock long-only book is something a normal person can actually hold through a bad year. Ten is also the empirical sweet spot I found in Phase 16b: dropping from twenty positions to ten lifted CAGR by 0.97 percentage points because the highest-conviction picks systematically beat the diluted ones below them.

Every Friday close, the system asks six questions about every Russell 1000 name. The ten stocks with the best aggregate answer become next week's portfolio.

🌐

What's the macro weather?

Yield curve, realized vol on SPY, credit spreads. If multiple stress signals fire at once I scale total exposure from 100% down toward 60%. I never go to cash. Cash is its own bet, usually a losing one.

📈

Which sectors are leading?

I look at the eleven SPDR sector ETFs and measure their six-month return relative to SPY. A stock in a sector that is beating the market gets a tailwind bonus. This is the single largest weight in the score (40%) because sector rotation explains more cross-sectional variance than any single-stock factor.

💎

Is the company actually any good?

P/E, P/B, EV/EBITDA, ROE, debt-to-equity, sales growth. Built from SEC 10-Q and 10-K filings using the date the filing hit EDGAR, not the period end. That avoids the most common look-ahead bug in academic backtests.

Is price confirming the thesis?

Twelve-month return excluding the most recent month, plus a 200-day trend filter. I do not buy falling knives no matter how cheap they look. If capital is not already flowing into the name, I wait.

⚠️

Is anything blowing up at the company?

SEC 8-K filings have item codes. Item 4.02 (non-reliance on prior financials) and Item 1.03 (bankruptcy) are basically corporate emergency flares. I count these over a 60-day window and penalize names that just lit one off.

🎯

Are the insiders buying?

Form 4 open-market purchases by officers and directors. I exclude 10%+ holders because those are usually passive funds. When operating officers put their own money in, especially several of them clustered together, they tend to know something the market doesn't yet.

Why the goal is a smoother ride, not a higher peak. The DALBAR Quantitative Analysis of Investor Behavior has shown for thirty years that the average mutual fund investor trails the funds they own by 3 to 4 percentage points per year. The cause isn't the fund manager. It's the investor buying after a good year and selling after a bad one. A strategy with a 15% backtest CAGR that draws down 50% is worthless if the person holding it sells at -30%. My regime overlay and per-position drawdown caps exist for exactly this reason: to make the equity curve something a human can actually live with.

02The universe

I fish in the Russell 1000, not the S&P 500. That single decision was the largest source of alpha in the entire project: 2.94 additional percentage points of CAGR, matched-window, validated out-of-sample. It is worth explaining why.

The S&P 500 is the most over-analyzed slice of equity in the world. Every megacap has dozens of sell-side analysts publishing quarterly notes, every pension fund holds every name, every quant shop runs the same five factors on the same five hundred tickers. By the time a signal fires on Apple, the world has already priced it. The Russell 1000 includes those five hundred plus another five hundred mid-cap names below the S&P cap line. That second half is where the dispersion lives.

R1000 universe size
The point-in-time R1000 panel starts at roughly 770 names in mid-2016 because yfinance's shares-outstanding history is thin that far back, then expands to the full thousand by 2022. The picks for any given week are drawn only from the names that were actually in the R1000 on that date, never from later composition.

What "point-in-time" actually means and why it matters

A lot of academic backtests use the current S&P 500 list as the universe and run it back through history. This is a silent disaster. You end up testing a strategy on a universe that excludes every company that went bankrupt, got acquired, or fell out of the index because it underperformed. The survivors look great because the dead were quietly buried before the test even started.

What I do instead: rebuild the Russell 1000 membership as it actually existed on every June 30 (Russell's official reconstitution date) since 2016. A backtest considering June 2018 sees only the names that were in the index in June 2018, including names that later died or got delisted. The historical equity curve includes every loss that a real investor would have eaten.

The reconstruction math

candidates = (current Wikipedia R1000) ∪ (historical S&P 500 PIT lists) ≈ 1,247 distinct tickers for each June 30 from 2016 to 2025: for each candidate still listed on that date: market_cap = close[ticker, date] × shares_outstanding[ticker, date] sort descending by market_cap top 1,000 = R1000 members for the next 12 months

I validate this against the actual Wikipedia R1000 list on the most recent available date: 952 true positives, 48 names I included that shouldn't be there, 50 names I missed. That's an F1 score of 0.951. The 5% error is almost entirely at the size boundary, names ranked around 950 to 1050 by market cap where small shares-outstanding errors flip them across the cutoff. A paid Norgate subscription at $650 per year would push F1 above 0.99 and I may upgrade later, but the marginal alpha from fixing the last 5% is small relative to what's already in the strategy.

Why mid-caps specifically have alpha left in them

03The score stack

Every Friday at the close, each of the roughly thousand eligible names in the Russell 1000 gets a composite score. The composite is a weighted average of five sub-scores, each capturing a different kind of information. The ten names with the highest composite become next week's portfolio. Below I walk through each sub-score: what it measures, the actual formula, why it is in the stack, and what its weight was set to and why.

Score layer weights
The five layers and their weights in v0.7-candidate. Sector tailwind is by far the largest at 40% because across every test I ran, sector relative strength explained more cross-sectional return variance than any single-stock signal.

3.1 Why five layers and not one big model

The reason for stacking five independent layers instead of training one big machine-learning model is replicability. I can point at every input, name it, and reproduce it from scratch. When something stops working, I can isolate which layer broke and turn it off. A black box doesn't let you do that, and black boxes have a strong tendency to discover patterns that hold for exactly the in-sample window and then evaporate.

The other reason is that the layers actually capture different information. If momentum and value were 95% correlated, blending them would just add noise. The cross-sectional correlation matrix below shows most pairwise correlations sit below 0.30, which means each layer is contributing largely independent signal. The insider and events layers in particular are nearly orthogonal to the price-based layers, which is exactly why they add marginal alpha rather than duplicating what momentum already knows.

Layer correlations
Time-averaged cross-sectional correlations between the five layers. Values near zero mean the two layers are giving you independent information. Values near +1 or -1 mean they are substitutes for each other. The insider and events layers sit near zero against everything else, which is the structural reason they earn their slot in the blend.

3.2 Technical (weight 0.30)

technical_score = mean of four z-scored components: z(12-1 momentum) = (close[t-21] / close[t-252]) - 1 # Jegadeesh-Titman z(above 200d SMA) = (close[t] - SMA200) / SMA200 # Faber trend filter z(-5d return) * 0.5 = -(close[t] / close[t-5] - 1) * 0.5 # short-term reversal z(1 / vol_60d) = inverse of 60-day realized vol # low-vol anomaly

Four well-documented anomalies bundled into one layer because each works in a different regime and they smooth each other out. Twelve-minus-one momentum (Jegadeesh and Titman, 1993) is the workhorse: skip the most recent month because of short-term reversal, then look back twelve months. The 200-day trend filter is the Faber tactical asset allocation rule, used here as a sign filter so I don't buy names that are technically falling. Short-term reversal is the Lo-MacKinlay one-week mean reversion effect, weighted half because it works best in choppy markets and gets noisy in trends. The inverse-vol component captures the low-volatility anomaly (Ang, Hodrick, Xing and Zhang, 2006): empirically, lower-vol names earn higher risk-adjusted returns than CAPM predicts.

Combining all four matters because momentum dominates in trending markets, reversal dominates in choppy markets, and low-vol holds up in risk-off. A single one of these would underperform half the time. The blend is more stable across regimes than any component on its own.

3.3 Sector tailwind (weight 0.40)

for each ticker: sector_etf = SPDR ETF for the ticker's GICS sector (XLE for energy, etc.) relative_strength = (sector_etf return over 126d) - (SPY return over 126d) tailwind = z-score(relative_strength across all sectors)

This is the biggest weight and the most important conceptual call I made. Rather than try to predict which sector will lead next, I let the sectors that are already leading earn a bonus for the stocks inside them. When XLE has been beating SPY for six months, every energy name in the universe gets a positive tailwind contribution. When XLE rolls over and XLK takes the lead, the tailwind automatically rotates to tech. I never have to decide; the price data decides for me.

Why does this work? Because sector leadership is persistent over multi-month windows. A sector that has been leading for six months tends to lead for another one to three months on average. That's exactly the holding-period horizon for a weekly-rebalanced book picking names with momentum half-lives in the same range. The sector layer carries the most weight because in 4-fold walk-forward tests, dropping it costs about 3 percentage points of CAGR. Dropping any single one of the other four layers costs less than 1.5.

3.4 PIT value, quality, and growth (weight 0.30)

val_pit = mean of three z-scored sub-scores: value: low P/E, low P/B, low EV/EBITDA (Graham/Buffett) quality: high ROE, low D/E, gross margin stability (QARP) growth: positive year-over-year revenue and EPS growth (GARP) # Critical: every fundamental is keyed by SEC filing_date, NOT period_end_date. # Q2 results filed Aug 7 are first usable on Aug 7, not on June 30.

Built from SEC 10-Q and 10-K filings parsed straight from EDGAR. The thing that makes this point-in-time is that I never use a number before the date it was actually filed. A Q2 result with a period end of June 30 doesn't enter the data until the 10-Q is filed in early August. This sounds obvious but most academic backtests get it wrong by using the period end as the availability date, which leaks roughly 45 days of future information into the past. That single leak typically inflates strategy CAGR by 1 to 3 percentage points.

The composition (cheap + healthy + growing) is essentially Joel Greenblatt's Magic Formula generalized. I don't claim it's novel. I claim it works because it captures the empirical finding that good companies at fair prices outperform the index, and I wired it up carefully enough not to cheat.

3.5 8-K events (weight 0.10), a defensive filter

When something material happens at a US company, the SEC requires an 8-K filing within four business days, classified by item code. Item 1.03 means bankruptcy. Item 2.06 means a material asset impairment. Item 4.02 means the company is telling investors not to rely on previously-filed financials, which is the polite version of "my accounting is broken." Item 5.02 is an executive departure, often a CFO leaving suddenly.

for each ticker: negative_8k_count = count of items in (1.03, 2.04, 2.05, 2.06, 4.02, 5.02) over the trailing 60 days events_score = -1 * z-score(negative_8k_count across universe) # The negative sign is deliberate: more negative-item filings = lower score = penalty.

The weight is only 10% because most of the time most stocks have zero negative filings, so the layer is sparse and only fires when something is actually wrong. That's exactly what I want from a risk filter: silent when nothing is broken, loud when it is. The layer prevented me from holding several names that announced restatements and then dropped 30 to 60% over the following weeks.

3.6 Insider buys (weight 0.10), the Phase 19 discovery

for each ticker, over the trailing 90 days: informed_buys = Form 4 open-market purchases (transaction code "P") by officers and directors only (NOT 10%+ holders) dollar_size = sum(price * shares) for those buys volume_norm = dollar_size / median(90-day dollar volume) cluster = count of distinct insiders making at least one buy raw_score = log(1 + volume_norm) + log(1 + cluster) insider_score = z-score(raw_score across universe)

Insiders, meaning the C-suite and the board, are required by Section 16 of the Securities Exchange Act to disclose any trade in their own company's stock within two business days. There is a long literature (Lakonishok and Lee 2001, Cohen, Malloy and Pomorski 2012) showing that informed insider buys, the open-market purchases by operating executives, predict positive abnormal returns over the following three to twelve months.

The trick is excluding the noise. I drop sales entirely because the reasons to sell are many (taxes, diversification, divorce) and the reasons to buy are few (you think the stock will go up). I drop 10%+ holders because those are typically index funds or activist stakes that are not "informed" in the same sense. I weight by both dollar size and by the cluster of distinct insiders buying, because one insider buying $50k is noise but three insiders buying $500k each over the same month is a strong directional bet.

I tried this signal on the S&P 500 in Phase 5 and it didn't work. The reason: megacap insider buys are reported instantly, picked up by Bloomberg terminals, and priced within hours. By the time my Friday-close rebalance could act on it, the move was already done. On the Russell 1000 mid-caps in Phase 19, the same signal added 1.32 percentage points of CAGR. The price reaction in mid-caps is more gradual and I have time to participate. This was the entire reason to expand the universe.

Insider signal examples
The insider z-score over time for the three R1000 names with the most cumulative insider-buy activity in my window. Most days the score sits near zero. The spikes above zero are clusters of officer purchases, often coincident with a company-specific catalyst. The model overweights names sitting at the top of this distribution on rebalance day.
Insider score distribution
Cross-sectional distribution of insider scores on a recent date. Most names cluster near zero because most weeks have no insider activity. The right tail, roughly the top 10% with clustered or sizable recent buys, is where the alpha concentrates. The model picks from that tail when it sees it.

3.7 What a pick actually looks like inside

A composite score on its own is a single number. To know why the model picked a name I decompose the score into its five layer contributions. Names that score high on three or four layers tend to outperform names that score high on just one, even if the total composite is the same. Broad signal beats narrow signal.

Score decomposition
Decomposition of the composite score into its five contributing layers for the top five R1000 picks on the most recent date with full data. Picks with contributions from multiple layers (a name that's both in a leading sector AND has insider buying AND is cheap on PIT value) historically outperform picks that score high on only one layer.

04From scores to dollar positions

The scoring stack tells me which ten names to hold. It does not tell me how to size them, when to trim, or how much total exposure to run. Those decisions are the difference between a backtest that looks good on paper and one a human can actually hold through a bad year. Each of the six construction rules below was chosen for a specific reason and validated by a specific experiment.

Step 1: Eligibility filter

eligible = (in R1000 PIT for current date) AND (60-day median dollar volume > $1M) AND (price >= $5)

Liquidity and survivor filters. The dollar-volume floor rules out names where a $20k position would move the price. The $5 price filter excludes penny-stock noise. Both are conservative for a $100k book and become more binding as the account scales.

Step 2: Pick top 10 with a rotation buffer

if currently_held[i]: keep_threshold = composite_score[i] * (1 - 0.10) # 10% hysteresis else: keep_threshold = composite_score[i] new_portfolio = top 10 names by adjusted score

A naive "pick top ten every week" approach generates a lot of churn. Names sit near the cutoff and flip in and out depending on tiny score wiggles, paying friction on every flip with no signal advantage. The rotation buffer says: an already-held name keeps its slot unless a non-held name beats its score by more than 10% in score units. This is anti-churn hysteresis. Phase 25's parameter sweep tested buffers from 0.00 to 0.25 in steps of 0.05. The 0.10 value was the local maximum on Sharpe.

Step 3: Inverse-volatility weighting, capped at 20%

raw_weight[i] = 1 / realized_vol_63d[i] normalized[i] = raw_weight[i] / sum(raw_weight) capped[i] = min(normalized[i], 0.20) final_weight[i] = capped[i] / sum(capped) # renormalize back to 100%

Equal-weighting ten names treats a low-vol utility the same as a high-vol biotech, which is dumb. Inverse-vol weighting gives the steadier names larger positions because they can carry more dollars without dominating portfolio risk. The 20% cap exists because with only ten names, pure inverse-vol can occasionally produce a 35%+ position in the lowest-vol name. That's too much single-name risk. The cap keeps the largest possible position at one-fifth of the book.

Step 4: Per-position drawdown cap

for each held position: dd = (current_price / peak_price_since_entry) - 1 if dd < -0.15: weight = 0.30 * target_weight # severe trim elif dd < -0.08: weight = 0.60 * target_weight # initial trim else: weight = target_weight # full size

When a name is down 8% from its peak since I bought it, I cut to 60% of full size. At 15% down, I cut to 30%. This is mechanical stop-loss logic with two important properties: it scales the position rather than fully exiting (so I don't sell the bottom and miss the rebound), and it restores the position if the price recovers. The cash freed up sits idle, not redeployed, because I want the system to express less risk during deterioration rather than rotate into something else and hope.

Step 5: Macro regime overlay

Stock-by-stock risk controls catch idiosyncratic blowups. They do not catch a market-wide selloff where all ten of my names go down together. For that I have a regime overlay that scales total gross exposure based on macro stress. The inputs are the 2-year vs 10-year Treasury yield spread, realized SPY volatility, and high-yield credit spreads. When two of the three flag stress, exposure scales from 100% toward 60%.

Regime exposure
Total gross exposure over time. The two clear dips toward 60% are early 2020 (COVID lockdowns and the volatility spike) and 2022 (Fed tightening cycle plus inflation shock). Notice the system never goes to zero. Cash-timing is a notoriously bad bet, and being only 60% invested during a stress window is already a meaningful de-risk without giving up the long-run drift in equities.

Step 6: Friction model

None of the backtest results in this document are gross numbers. Every CAGR and Sharpe is computed after subtracting realistic execution costs. The table below shows the friction assumptions baked into the engine. They are tight but defensible for a retail account at Interactive Brokers or Schwab trading mid-caps in modest size.

ComponentValueWhy this number
Slippage5 bps each sideModest market impact at $100k notional in mid-caps
Half-spread3 bps each sideTypical bid-ask for R1000 names ex-megacaps
Commission$1.00 per tradeStandard at IB; Schwab is now $0 fixed
Cash yield4.0% APYOn the idle cash from drawdown caps and regime de-risking

Total round-trip friction is about 16 bps per trade. With weekly rebalancing and roughly 3 to 4 name changes per week, that's ~3.5% of annual turnover cost built into every quoted return. The cash yield gives a small offset when the regime overlay parks dollars on the sidelines.

What this looks like as sector exposure over time

Because the sector tailwind layer is the largest weight in the score, the portfolio's sector composition shifts dynamically as leadership rotates. I never coded an explicit "rotate into energy now" rule. The rotation is a downstream effect of the sector tailwind layer simply doing its job.

Sector rotation
Actual sector exposure of v0.7-candidate across the backtest window. Heavy tech tilt from 2018 to 2021 as XLK was leading. Pivot toward energy and materials in 2022 when XLE and XLB were the only sectors with positive relative strength during the inflation regime. Mixed allocation since 2023 as leadership has been less concentrated.

05Are I just fooling myself?

The hardest problem in quant research isn't building the strategy. It's knowing whether the result you got is real or whether you stumbled onto something that worked by luck and won't repeat going forward. Every promising result in this project went through three independent statistical tests before I believed it. The reason for the tests is uncomfortable: most published quant strategies don't replicate.

Why a "winning" backtest probably isn't winning

Imagine you test 100 slightly different versions of a strategy. Even if every single one is identical to baseline at the population level, simple sampling noise will produce a distribution of in-sample Sharpe ratios across the 100 variants. The best one of those 100 will look impressively better than the worst, purely by luck. If you pick the best one and call it your "discovery," you have fooled yourself.

The math is exact. Under the null hypothesis that all N variants have identical true Sharpe, the expected maximum Sharpe across N trials is approximately:

E[max Sharpe under null] ~= sigma_Sharpe * sqrt(2 * ln(N))

Plug in realistic numbers. The cross-variant standard deviation of measured Sharpe on a 5-year window is roughly 0.30. Test 100 variants. Expected best by chance: 0.30 * sqrt(2 * 4.6) = 0.91 Sharpe units. That means even if every variant is secretly identical to the baseline, you should expect the best one to look 0.91 Sharpe better than the worst one. This is why every blog post promising a Sharpe-3 strategy should be treated with extreme skepticism until somebody runs it forward in real money.

Across the full project I tested approximately 78 distinct strategy variants across the 29 phases. I have to assume some fraction of any "winning" result is order-statistic noise from that search, and I discount the headline numbers accordingly.

The three tests I run on every candidate

📊

Block bootstrap CI

Take the candidate's daily return series. Resample it 2,000 times in overlapping 20-day blocks (block resampling preserves the autocorrelation structure that ordinary resampling would destroy). Each resample gives a Sharpe, a CAGR, and a max drawdown. The 2.5th and 97.5th percentiles of those 2,000 numbers form a 95% confidence interval. If the candidate's lower bound sits above the baseline's point estimate, the improvement is robust to sampling noise.

🎲

Sign-flip permutation

Compute the daily (candidate minus baseline) alpha series. The null hypothesis is that this series has zero mean (the two strategies are equally good). Under the null, the sign of any individual day's alpha is random. So I randomly flip the sign of each day 5,000 times, compute the mean of each permuted series, and count what fraction exceed the observed mean. That fraction is the p-value. This is the cleanest paired test available: it cancels out market-wide variance because both legs ate the same market.

📐

Deflated Sharpe (DSR)

Bailey and Lopez de Prado, 2014. Given N total variants searched in the project, compute the probability that the observed Sharpe genuinely exceeds the expected maximum-by-chance under the null. The deflated Sharpe ratio bakes the multiple-testing penalty directly into a single number. DSR above 0.80 means I're at least 80% confident the result isn't just the best of N random draws.

What v0.7-candidate actually scored (Phase 21)

GateObservedVerdict
G1 (CI separation)v0.7 Sharpe lower bound 0.641 vs v0.6.4 point estimate 0.949FAIL (CIs overlap)
G2 (sign-flip permutation)p = 0.0094PASS
G3 (deflated Sharpe with N=30)DSR = 0.9266PASS
Overall2 of 3 gates passMeaningful evidence. Paper-trade required before live promotion.

G1 fails because individual-strategy Sharpe confidence intervals on 8 years of daily data are intrinsically very wide (roughly +/- 0.3 Sharpe units). For paired comparisons G1 is almost guaranteed to fail and it's not the right test to lean on. G2 is the test that matters and the p-value of 0.0094 means the observed alpha is significantly different from zero at the 1% level. The deflated Sharpe of 0.93 means I're 93% confident the result isn't the order-statistic of my search. Together that's strong enough evidence to keep moving forward, but not strong enough to skip paper trading.

Bootstrap Sharpe distribution
Block-bootstrap Sharpe distributions, 1,000 resamples each. Gray is the v0.6.3 baseline. Indigo is v0.7-candidate. The v0.7 distribution is shifted meaningfully to the right but the distributions overlap, which is why G1 "fails" while G2 (the paired test that controls for shared variance) comfortably passes.

The save: Phase 29 caught a normalization artifact that nearly fooled me

This is the single best illustration of why the rigor framework matters. In Phase 28 I tested a variant called T5 that swapped the standard 12-1 momentum window for a 6-1 window. The result looked spectacular: +2.25 percentage points of CAGR and +0.202 Sharpe over the prior best. I almost promoted it.

Phase 29 went back and ran the test more carefully. It turned out that T5's two implementations were normalizing their technical scores differently. When I forced both to use the same normalization, the "win" collapsed: real lift was +23 basis points of CAGR and +0.048 Sharpe. The sign-flip permutation gave p = 0.464, meaning the difference was indistinguishable from zero. I had been about to promote a numerical artifact to production.

Without the rigor framework, this would have shipped. Every researcher reading this knows the feeling of seeing a result they want to be true and not poking at it as hard as they should. The framework forces the poke. T5 is the reason I trust the +1.32% CAGR from the insider layer in Phase 19, which survived identical tests.

06What I tested and what survived

Twenty-nine numbered research phases, roughly 78 distinct variants tested across them, three genuine wins, one variant that fooled me before the rigor framework caught it, and a long list of honest failures. The failures matter as much as the wins because they map the boundary of where alpha actually exists in this universe. Below: every result, what the experiment was, what it produced, and why it survived or didn't.

Phase journey
Walk-forward CAGR delta vs the prior best, for each phase. Green bars survived rigor and were promoted. Red bars regressed or failed gates. The orange bar is Phase 28's T5 result that initially looked like a win and got caught as an artifact in Phase 29.

The three things that actually moved the needle

Phase 16b: concentration +0.97% CAGR

The experiment: a 2x2 factorial varying the number of positions (10 vs 20) and the weighting scheme (equal vs inverse-volatility). Same scoring stack across all four cells. The clean comparison let me isolate concentration from weighting. Result: dropping from 20 names to 10 with inverse-vol weighting added 0.97 percentage points of CAGR. Why: the top-ranked names by composite score systematically outperformed the names ranked 11 through 20. Holding both dilutes the edge.

Phase 18: Russell 1000 universe +2.94% CAGR

The biggest single discovery in the project. The experiment: identical strategy code, two different universes (S&P 500 vs Russell 1000 PIT), same 2017-2024 walk-forward window. Result: the R1000 version produced 2.94 more percentage points of CAGR and meaningfully better Sharpe. The mechanism is institutional capacity. Quant funds large enough to matter cannot deploy meaningfully in mid-caps, so the mid-cap cross-section is less arbitraged and signals retain more predictive power.

Phase 19: insider buys on R1000 +1.32% CAGR

Adding the Form 4 insider-buy layer described in section 3.6. Result on R1000: +1.32 percentage points of CAGR and +0.09 Sharpe over the no-insider baseline. Phase 21 then put this result through the rigor framework: the sign-flip permutation gave p = 0.0094 and the deflated Sharpe was 0.927. That's the strongest survives-everything result in the project. The same signal had been rejected at the S&P 500 level in Phase 5, which tells me the alpha is specifically in mid-cap insider activity, not megacap.

What I tested that didn't work, and why each failure was informative

L-014 technical confirmation rejected twice

Tested in both Phase 6 (S&P 500) and Phase 20 (R1000). The signal was a combination of ATR breakouts, Mansfield relative strength, and signed ADX, the kind of stack a technical-analysis textbook would recommend. Failed in both universes. Diagnosis: by the time I observe a Friday-close breakout and rebalance the following Monday open, the breakout move is already over. The signal has no edge at weekly rebalance frequency. Useful finding: the failure is universe-independent. It's not "wrong universe," it's "wrong cadence."

GDELT news sentiment rejected twice

Phase 15 tested a 30-day rolling average news tone score from the GDELT Project as a layer. Phase 15b tested a narrower event-window version, specifically because the user (correctly) pushed back on whether the first rejection was implementation-specific. Both failed. Why: the 8-K event layer already captures the categorical material news that moves stocks. General-purpose news sentiment is almost all ambient noise that adds no marginal information.

Analyst revisions -1.23% to -4.36%

Analyst grade upgrades and downgrades as a layer. Result was negative across two different implementations. Why: analyst rerates are lagging. The stock has already moved by the time the analyst publishes. Buying on a sell-side upgrade is buying after the smart money has already positioned. I'd be the dumb money.

PEAD (post-earnings drift) -2.75% CAGR

The classic Ball and Brown (1968) earnings-surprise drift signal. The finding has been documented in finance literature for fifty years but has decayed substantially since 2000 as algorithmic traders started arbitraging it. My test confirmed: it not only doesn't add lift, it actively hurts at weekly cadence. There's also overlap with the insider signal (insiders buy after good earnings reports), so adding both would just double-weight the same names.

Monthly and quarterly rebalance -4.7% / -8.5%

Phase 22 tested whether I could save friction by rebalancing less often. Monthly rebalance cost 4.7 percentage points of CAGR. Quarterly cost 8.5. Why: the signal stack has an empirical half-life of roughly 30 to 60 days. Slower rebalancing means trading on stale signals. The friction savings (maybe 50-80 bps a year) get dwarfed by the alpha decay (4-8%). Weekly is the right cadence for this stack.

Vol-targeting overlay Sharpe-neutral

Tested a layer that scales total exposure to maintain a constant realized vol target. Worked as designed: vol came down, max drawdown came down. But return came down proportionally, so Sharpe was flat. The existing per-position drawdown caps plus regime overlay already capture most of the available vol-management benefit. Adding a third layer of vol-control was redundant.

T5 mom_6_1 momentum window artifact, caught

Phase 28 swapped 12-1 momentum for 6-1. Headline: +2.25% CAGR, +0.202 Sharpe. Looked like a winner. Phase 29 rerun caught that the two implementations were normalizing differently. On matched normalization the real lift was +23 bps and the permutation p-value was 0.464. The single clearest illustration of why I run rigor on everything that looks good.

Knob tuning sweep no improvement on 4 axes

Phase 25 systematically swept four hyperparameters (rotation buffer, sector-tailwind weight, insider weight, insider lookback window) around their current values. The current values turned out to be the local maximum on every axis. That's reassuring: I picked them based on first-principles intuition and they happen to also be the empirical optima. There's no easy +1% sitting on the table from tuning.

The pattern: leading information works, lagging information doesn't

Look at the wins and failures together. Insider buys: officers acting on private information before the market knows. Works. Sector tailwind: persistent multi-month leadership I ride before it ends. Works. Mid-cap universe shift: capturing a less-arbitraged slice. Works.

Now the failures. Analyst revisions: reaction after the price moved. Doesn't work. PEAD: reaction after the earnings print. Doesn't work. L-014 breakouts at weekly cadence: reaction after the breakout already happened. Doesn't work. News sentiment: descriptive about what already happened. Doesn't work.

The pattern is consistent enough across 29 phases that I'd bet on it for any future hypothesis: if the signal arrives before the price reaction, it has a chance. If it arrives after, the market has already priced it in and there's nothing left for me.

07How it actually performed

Every number in this section comes from a 4-fold walk-forward backtest, not a single in-sample fit. The 2017 to 2025 window is split into 4 folds. For each fold I use the prior data to set parameters and the held-out year to score. Stitching the four out-of-sample returns end-to-end gives the equity curve below. This is as close to "what would have actually happened" as a backtest can get without being a live forward run.

v0.7-candidate CAGR
+16.71%
v0.6.3 baseline by the same method: +10.60%
Walk-forward Sharpe
1.213
Baseline: 0.910
Worst drawdown
-13.98%
SPY drew down roughly 25% over the same window
Sortino
1.544
Sharpe but only penalizing downside vol
Calmar
1.195
CAGR divided by max drawdown magnitude
Annualized alpha vs SPY
+0.01%
Out-of-sample, beta-adjusted
Equity curves
$100,000 starting in mid-2017 grows to $371k for v0.7-candidate, $235k for v0.6.3 baseline, and $325k for SPY (dividend-adjusted) by end-2025. Plotted on log scale, which makes equal slopes correspond to equal CAGR. The candidate beats the baseline almost everywhere and beats SPY by a widening margin through 2020-2021 and 2022.
Drawdown
Underwater chart: distance from each curve's running peak, plotted as a negative number. SPY's worst was about -25% during the 2022 bear market and the recovery took about 18 months. v0.7-candidate's worst was shallower and the recovery was faster because the regime overlay had already pulled exposure down to roughly 70% by the time the selloff got bad.
Per-fold performance
Per-fold out-of-sample CAGR with alpha vs SPY annotated above each bar. The candidate beat SPY in 3 of 4 folds. Fold 4 (2022 to 2025) was the laggard, which makes mechanical sense: SPY's outperformance in that window was driven almost entirely by the Magnificent Seven megacaps, exactly the part of the universe where my PIT value and insider layers tend to be underweight. Beating SPY in 3 of 4 non-overlapping windows is a stronger signal than any single-window result because the alternative explanation (lucky window selection) is much harder to sustain across multiple windows.
Rolling Sharpe
252-day rolling Sharpe ratio. The v0.7-candidate line sits consistently above SPY's across the full window, including through the 2020 COVID volatility and the 2022 Fed tightening. The dips in early 2020 and through 2022 are visible in both curves because both ate the same market stress. The point is that the candidate's Sharpe stays both higher and less volatile than SPY's across regimes, which is the result that's most likely to persist forward.

08Why the live number will be lower than the backtest number

The walk-forward backtest produced a CAGR near 15%. The honest forward expectation is closer to 13%. The 2-point gap is not because the backtest is "wrong." It's because there are four specific things the backtest cannot fully account for, each of which costs a known amount. Below: each discount, quantified.

CAGR decomposition
Waterfall from the backtest CAGR through the four discount factors to the live expectation, and from there to a comparison with SPY's long-run average of roughly 10%.
  1. R1000 reconstruction error, around 50 bps. My point-in-time Russell 1000 reconstruction is 95.1% F1-accurate against the true Russell. The missing 5% are mostly names right at the size cutoff. Some of them would have been picks the strategy missed; some are spurious inclusions whose returns are noise. On net I estimate this costs roughly 0.50 percentage points of true CAGR. A paid Norgate subscription would eliminate this entirely.
  2. Multiple-testing pollution, around 70 bps. Across all 29 phases I evaluated approximately 78 distinct strategy variants. The math in section 5 says that even under the null hypothesis, the order-statistic of 78 trials inflates the best-found Sharpe by a known amount. The deflated Sharpe ratio of 0.93 implies a small but real probability that some of the headline result is search-noise. I discount by ~0.70 percentage points to be honest about this.
  3. Mid-cap execution slippage, around 50 bps. The friction model uses 5 bps slippage plus 3 bps half-spread per side. That's tight for mid-caps, where the actual bid-ask can be 5 to 15 bps wider on the less liquid names. Realistic live trading will burn an extra 0.50 percentage points of CAGR through this friction gap. The cost gets worse as the account scales because larger orders push prices more.
  4. Regime mean-reversion in SPY, around 113 bps. Over the 2017-2025 backtest window, SPY itself returned about 13.7% per year. The very long-run SPY CAGR is closer to 10%. If forward equity returns mean-revert toward the long-run average, my absolute backtest number was helped by an unusually strong market. The strategy's alpha vs SPY should persist, but the absolute CAGR drops by roughly 1.13 percentage points if SPY does 10 rather than 13.7.

The sum of those four discounts is about -2.83 percentage points. Backtest CAGR of 15.33% minus 2.83 = roughly 12.5%. I round to "~13%" as my honest forward expectation. That's still meaningfully above SPY's ~10% long-run average and represents the genuine alpha I believe the strategy can deliver.

Lift attribution
Building the ~13% live expectation bottom-up. Start from the v0.6.3 baseline (10.80% in the walk-forward). Add 0.97% from going concentrated (Phase 16b). Add 2.94% from moving to R1000 (Phase 18). Add 1.32% from adding the insider layer (Phase 19). That gets to ~16% gross. Then subtract the 3.5% of combined discount factors above to arrive at ~13% as the realistic live expectation.

09What can actually happen, with probabilities

The numbers below are subjective probabilities calibrated against three reference points: SPIVA's annual report on how active funds fare against benchmarks (most lose over 10-year windows), Bessembinder's 2018 paper showing that 4% of stocks account for all of the long-run equity return above Treasuries, and Lopez de Prado's work on multiple-testing and backtest overfitting. They are estimates, not promises. Anyone who promises you a probability of beating SPY is selling something.

ScenarioEstimated probability
Beat SPY on Sharpe (risk-adjusted return) over a 10+ year window70 to 75%
Beat SPY on raw CAGR, pre-tax55 to 60%
Beat SPY on raw CAGR, after-tax, in a Roth or traditional IRA55 to 60%
Beat SPY on raw CAGR, after-tax, in a taxable brokerage account30 to 40%
Hit the original stretch goal of 15% CAGR or better25 to 35%
Underperform SPY by 5+ percentage points on annualized return10 to 15%
Catastrophic underperformance: 10+ points behind SPY5 to 8%

Why is the Sharpe-beat probability much higher than the CAGR-beat probability? Because Sharpe is more stable. SPY's annualized return swings widely based on where you measure (15% in some decades, 5% in others). The strategy's edge in risk-adjusted terms is more stable than its absolute return advantage. Beating SPY on Sharpe is a much more defensible claim than beating it on raw CAGR.

The account type matters more than any signal I could add. Weekly rebalancing with 10 mid-cap positions means most exits land inside the one-year holding period, which means short-term capital gains. In the US that's taxed at ordinary income rates, which can be 35%+ at the top federal bracket. In a taxable account that tax drag eats roughly 2 to 3 percentage points of annual return. In a Roth IRA or traditional IRA it eats zero. The same exact strategy will look like a clear winner in a Roth and a close-to-coin-flip in a taxable account. If you run this, run it in a tax-advantaged wrapper.
The biggest single variable in your actual outcome is your own behavior. The DALBAR studies have shown for thirty years that the average retail investor underperforms the funds they own by 3-4 percentage points per year, almost entirely because they sell after drawdowns and buy after rallies. This strategy will draw down 20% or more at some point in any multi-year holding period. That's not pathology; that's normal equity behavior. The rules-based system removes the temptation to override decisions, but only if you actually let the rules run. The model doing the right thing isn't enough on its own.

10Ten things I learned

If I had to compress 29 phases of work into ten transferable lessons for the next quant project, these would be them. Most are not original; I picked them up the hard way and verified each one by accidentally violating it first.

1. Where you fish matters more than what bait you use

Moving from S&P 500 to Russell 1000 added 2.94 percentage points of CAGR. No single signal I tested on the S&P 500 came close to that. If you can't find alpha, try a less-arbitraged universe before you try a cleverer signal.

2. Leading information beats lagging information, always

Every win in this project (insider buys, sector tailwind, R1000 mid-caps) is the strategy acting before the market fully prices something. Every failure (analyst revisions, PEAD, technical breakouts at weekly cadence) is the strategy reacting after. If your signal arrives after the price moves, you have no edge.

3. Multiple-testing pollution is not abstract

Phase 28's T5 looked like a +2.25% CAGR win and turned out to be 90% a normalization artifact. I came within one rerun of shipping it. The discipline of running every promising result through the rigor framework is the single most important habit in the whole project.

4. Anomalies decay once they're published

PEAD has roughly halved since 2000 as algo desks arbitraged it. Analyst revisions are effectively dead at retail cadence. If an effect was first documented in the 1980s and is in every textbook, assume it's been priced away and test before deploying.

5. Most "risk management" is already built in

The vol-targeting overlay failed in Phase 25 not because it didn't work but because the per-position drawdown cap and the macro regime overlay had already eaten the marginal vol-management benefit. Adding more risk control on top of a system that already has it just costs return without buying Sharpe.

6. Knob tuning gives diminishing returns very fast

Phase 25 tested four hyperparameters around their current values. All four were already at the local maximum. The 5 to 10 basis points you might squeeze out by tuning further is dominated by the multiple-testing noise the additional tuning itself introduces. Time is better spent on a new signal than re-tuning an existing one.

7. Negative results are evidence too

Roughly 25 of the variants I tested across 29 phases failed. That's not wasted effort. Each failure tightens the boundary of where the alpha actually lives. After 25 rejections, the surviving v0.7-candidate is much more credible as a genuine local maximum than it would be after one cherry- picked win.

8. The backtest is always optimistic, in known ways

Universe reconstruction error, multiple-testing inflation, friction under-modeling, and regime-period normalization between them cost approximately 2.83 percentage points of CAGR vs forward expectations. Bake those four discounts into every backtest result you quote. The 15.33% becomes 12.5% becomes "roughly 13%" forward.

9. Sharpe is more defensible than CAGR

CAGR depends heavily on the specific window: SPY's number changes by 5+ percentage points depending on the decade. Sharpe is more stable. Quoting "v0.7-candidate has Sharpe roughly 1.25 vs SPY's roughly 0.85" is a more replicable claim than quoting any specific CAGR.

10. The last test is forward data, not more backtests

I've extracted approximately as much signal as a careful researcher with 29 phases of effort can extract from the historical record. The remaining unknown is whether the edge survives outside the training period in real conditions. The only way to answer that is to run it forward in paper trade, which is what I're doing for the next 30 days before any promotion decision.