Cold Email A/B Testing: What to Test, How to Measure & Common Mistakes

Most cold email A/B tests fail not because the variables were wrong, but because the test was designed badly. Wrong sample sizes, testing multiple variables at once, stopping early when one variant looks good — these mistakes produce false conclusions that get baked into your sequences and silently kill performance for months. Done correctly, email split testing is one of the fastest ways to compound reply rates over a quarter.

Key takeaways

Test one variable at a time — subject line, opening line, CTA, or send time — never two simultaneously.
You need a minimum of 100 sends per variant; 200–250 is more reliable for detecting real differences.
Measure reply rate as your primary metric, not open rate — opens are influenced by Apple MPP and are increasingly unreliable.
Run tests in a fixed order: subject line first, then opening line, then CTA, then send cadence.
Even a perfectly optimised email underperforms against a poorly targeted list — fix targeting before obsessing over copy.

What variables should you A/B test in cold email?

There are four variables worth testing in cold email, in roughly descending order of impact: subject line, opening line, call to action, and send timing. Everything else — font, signature format, email length in the middle — moves the needle so marginally that testing it is a distraction until the core variables are optimised.

Subject line

The subject line determines whether the email gets opened. Until your open rate is consistently above 35–40%, every other optimisation is irrelevant. Test subject line length (short vs. mid-length), personalisation style (company name vs. competitor reference vs. no personalisation), and framing (question vs. statement vs. referral hook). Keep each variant meaningfully different — testing "Quick question" vs. "Quick question for you" is not a real test.

Opening line

Once opens are healthy, the opening line determines whether the reader continues. The most impactful tests here are context-specific openers (referencing a hiring signal, a recent funding round, or a tool they use) versus generic benefit-led openers. According to Woodpecker's cold email benchmark data, emails with highly personalised first lines achieve reply rates 2–3x higher than those with generic openers — but the personalisation must be genuinely specific, not just a first-name merge tag.

Call to action

CTA tests tend to have a narrower effect than subject line or opening line tests, but they matter at the bottom of the funnel. Test single low-friction asks ("Worth a 15-minute call this week?") against open-ended questions ("Is this something you're thinking about right now?"). The former works better for warm lists; the latter often outperforms for colder contacts who need to feel discovered rather than sold to.

Send timing

Timing is the most overrated variable in cold email. Day-of-week and hour-of-day effects are real but small — typically a 10–15% difference in open rates at best. Test it last, after the message itself is optimised. The one finding that does hold consistently: Tuesday through Thursday mornings outperform Monday and Friday sends in B2B, but by a margin that rarely justifies rebuilding your sequence around it.

How do you measure cold email A/B test results correctly?

Reply rate is your primary metric. Open rate is secondary and increasingly unreliable as a standalone signal. Apple's Mail Privacy Protection, introduced in 2021, pre-loads tracking pixels regardless of whether a human opened the email — which means open rate data for any list with significant Apple Mail usage is inflated and partially fabricated.

Structure your measurement around a simple hierarchy: reply rate (primary), positive reply rate (secondary), meeting booked rate (outcome). A variant that produces more replies but fewer meetings isn't a winner — it's generating noise. Track all three.

"We ran what we thought was a winning subject line test for six weeks before realising our 'winner' had a higher open rate but a lower reply rate. We'd been optimising for a metric that Apple had already broken."
— Head of Sales Development, 60-person B2B SaaS company

Use a dedicated A/B testing view in your sequencing tool (Outreach, Salesloft, Apollo, and Lemlist all support this natively). Avoid running tests manually by splitting your list in a spreadsheet — human error in list segmentation creates confounding variables that invalidate the result.

How many emails do you need for a valid cold email split test?

The practical minimum is 100 sends per variant. Below this, random variance dominates — you can flip a coin 50 times and get 35 heads, and it means nothing about the coin. At 100 per variant, you can detect large differences (10+ percentage points in reply rate) with reasonable confidence. For smaller differences — which is what you're usually testing once the obvious wins are captured — aim for 200–250 per variant.

This has a real implication for SDRs at smaller companies: if your total weekly send volume is 150 emails, running an A/B test correctly takes two or more weeks. That's fine. Rushing to a conclusion with 40 sends per variant is worse than not testing at all, because a false conclusion gets locked in as "best practice."

A useful heuristic from Harvard Business Review's analysis of online experiments: the most common mistake in split testing across industries is stopping early. Teams see a variant leading after 30% of the planned sample and declare a winner — but early leaders reverse in roughly 40% of cases as sample size grows. Commit to your sample size before you start, not after you see early results.

What order should you run your cold email experiments?

Run tests in a fixed sequence based on where in the funnel each variable operates. Testing a CTA before your subject line is sorted is like optimising your close rate before fixing your no-show rate — you're working on the wrong layer.

The correct order:

Subject line — until open rate is consistently 35–50%
Opening line — until reply rate is consistently above 5%
CTA and email body — to improve positive reply rate and meeting conversion
Send timing and cadence — marginal gains once the above are locked in
Follow-up sequence structure — number of touches, days between, tone changes

Document every test with the same format: what you tested, the hypothesis, sample size per variant, dates, and results. A shared testing log prevents your team from re-running tests that have already been answered, which is more common than it sounds in fast-moving SDR teams.

What are the most common cold email A/B testing mistakes?

The most damaging mistakes in cold email split testing are not technical — they're procedural. Teams skip them under time pressure and end up with a body of "learnings" that are statistically meaningless.

Testing two variables at once

If you change both the subject line and the opening line between variants A and B, and variant B wins, you have no idea which change caused it. Next time, you'll guess — and guesses compound into bad habits. One variable per test, no exceptions.

Using different audience segments for each variant

If variant A goes to your fintech contacts and variant B goes to your SaaS contacts, the result reflects audience difference, not message difference. Segments must be randomly split from the same audience pool. Most sequencing tools handle this automatically — make sure the feature is actually enabled before you start.

Optimising for the wrong metric

Open rate is easy to measure and tempting to optimise. It's also the metric least connected to revenue. A subject line that creates curiosity gaps can spike open rates while tanking replies because the email body doesn't deliver on the implied promise. Always trace your metric chain to a downstream revenue outcome.

Declaring a winner on partial data

Addressed above, but worth repeating: the single most common mistake. Decide on your sample size and end date before you start. Write it down. Stick to it.

Not controlling for list quality across variants

If one variant goes to companies that are a tighter ICP fit, it will outperform regardless of message quality. Random assignment solves this — but if you're building lists manually from different sources, audit them for ICP consistency before splitting.

Why does list quality matter more than any single variable?

The most consistent finding from teams running structured outreach testing is that list targeting has a larger effect on reply rate than any copy variable. A mediocre email sent to a highly relevant list outperforms a brilliant email sent to a generic one. The implication for A/B testing: if your baseline reply rate is below 3%, the problem is probably targeting, not message. Fix that first.

The highest-performing lists share one characteristic: the contacts on them have a confirmed reason to care about what you're selling. Companies that are actively using a competitor product are the clearest example — they've already validated budget, understood the problem, and are familiar with the category. If you want to test cold email variables on a list that will actually detect signal rather than noise, this is the list to start with. Tools like Stealery let you search for companies using a specific competitor and filter by size, location, and hiring activity — so you can build a focused list before a single test begins, rather than discovering halfway through that your sample was too mixed to interpret.

The practical workflow: before any A/B test, audit your list for ICP fit. Remove contacts that are obviously wrong-size or wrong-industry. Run a quick deliverability check. Then run your test. The cleaner the list, the more clearly your test results will reflect the variable you're actually testing — and the faster you'll accumulate real learning instead of noise.

Frequently asked questions

Start with your subject line. It determines whether the email gets opened at all, which means every other variable is irrelevant until your open rate is healthy. Once open rates exceed 40%, move to testing your opening line and CTA.

Most practitioners use a minimum of 100 sends per variant, but 200–250 per variant is more reliable for detecting meaningful differences. With smaller lists, focus on effect size rather than statistical significance — a 5-point difference in reply rate is worth acting on even without perfect sample sizes.

Run the test until each variant has reached your minimum sample size, or at least 5–7 business days. Stopping early because one variant looks good is the most common A/B testing mistake — early leaders frequently reverse as sample size grows.

No. Testing more than one variable at a time makes it impossible to know which change caused the result. Change one element per test, run it to completion, record the winner, then move to the next variable.

A realistic benchmark for well-targeted B2B cold email is 5–10% reply rate. Highly targeted lists — such as companies already using a competitor — can reach 12–18%. Below 3% usually signals a targeting, deliverability, or messaging problem, not just a subject line issue.

Ready to build your first competitor list?

Type in any competitor and see every company using it — filtered by size, location, and hiring signals.

Try Stealery for free →