Most app A/B tests produce conclusions that aren’t valid. The test ended too early. Multiple things changed at once. The sample wasn’t large enough. The result looked significant but wasn’t.
Running a proper A/B test isn’t complicated, but it does require following a few rules consistently. Skip them and you’ll optimize based on noise.
Before you start: calculate your sample size
The most common mistake in app A/B testing is starting a test without knowing how long it needs to run to produce a trustworthy result. If you stop it early (because one variant “looks like it’s winning”), you get a false positive.
Before running any test, calculate the minimum sample size required. This depends on your current baseline conversion rate, the minimum effect size you want to detect (a 10% improvement? 20%?), and your desired confidence level (most teams use 90–95%). Free calculators (Optimizely, VWO, or any A/B test sample size calculator) give you this number in 60 seconds. If you need 5,000 users per variant and you get 500 visitors per week, the test needs to run for at least 10 weeks. Starting and stopping it at week 3 is not a valid test.
Test one variable at a time
This cannot be overstated. If you change your paywall headline, the pricing layout, AND the CTA button color simultaneously, you cannot know which change caused any movement in conversion. You might keep a combination that’s held back by one poor element.
One variable per test. Always. The exception is a “big bang” test where you compare a completely different design to the current one — but in that case, you’re making a broad directional decision, not learning what specifically to optimize.
Run tests for a minimum of 30 days
Even if you hit your sample size before 30 days, keep the test running. You need to capture natural variation: weekday vs. weekend behavior, different acquisition sources who arrive on different days, and any weekly marketing cycle that affects who’s in the app.
30 days is the minimum. For seasonal products, consider running through both a peak and off-peak period before concluding.
What can you test on each platform?
App Store (iOS): Screenshots, app icon, and preview video only — no text. One experiment at a time. Tests run up to 90 days. You choose what percentage of traffic sees each variant.
Google Play: Screenshots, icon, short description, long description, and feature graphic. Up to 5 experiments simultaneously — a significant advantage that lets you run parallel tests on different elements. Results show install rate and retention.
Never apply results cross-platform. An icon that lifts downloads on Android may hurt them on iOS. User behavior differs between the platforms. Test each independently.
How to read your results properly
A test showing “no significant difference” overall may hide real results in specific segments. Segment your A/B results by platform (iOS vs. Android), acquisition source (organic vs. paid vs. social), and user geography (results can differ dramatically by market).
A test that shows −2% overall but +15% for your top acquisition source is not a null result. Treat “no significant difference” as useful information too: it tells you that element isn’t what’s driving or hurting conversion. Move on and test something higher-leverage.
What to test first (prioritized by impact)
First screenshot or paywall screen (highest conversion leverage) → CTA button text on the paywall → App icon (high visibility, affects click-through before the listing) → Paywall headline → Screenshot order → Description first sentence (especially on Google Play where it’s indexed).