A/B testing: a comprehensive guide to getting started

Over the past few years there’s been quite a bit of hype regarding A/B testing. From Obama’s election campaign, to Veggie Tales increasing revenue by 38%, I believe the hype is well-warranted and something that should be in every marketer’s toolkit.

I’ve had the opportunity to setup, instrument, and analyze numerous A/B tests for companies in different industries. I’ve made plenty of mistakes, and found some winners along the way. It’s been an eye opening experience, and the purpose of this guide (yes, it’s going to be a long one ~ 4000 words) will be to give you the most comprehensive A/B testing resource online (well, at least that’s the goal). I’ll start from the very basics, including organizational structure, to best practices (and things you should avoid).

What is A/B Testing?

There’s plenty of scientific descriptions of A/B testing, but I’ll begin with an example that hopefully you can grasp quickly.

You’re in a long line at a grocery store waiting to make a purchase. You notice that the cashier flips a coin for each person checking out, giving a $10 dollar gift card to people who land on “heads.” For those who land on “tails”, they get a $10 coupon booklet.

In this scenario the grocery store is running a promotion, and isn’t sure what will drive the most repeat purchases. They decided to take a sample of customers, and offer 2 different promotions.

In a few weeks, they will measure how many customers came back and made a purchase, and whether the gift card or coupon booklet drove more sales. The one that drove the most sales will be implemented on a more regular basis.

I understand that this example is extremely basic, but A/B testing on the web is nearly identical. The best definition I found is from a Wired article:

Using A/B, new ideas can be essentially focus-group tested in real time: Without being told, a****fraction of users are diverted to a slightly different version of a given web page and their behavior compared against the mass of users on the standard site. If the new version proves superior—gaining more clicks, longer visits, more purchases—it will displace the original; if the new version is inferior, it’s quietly phased out without most users ever seeing it.

Why most companies don’t run A/B tests

I’ve seen the positive results of successful A/B tests, and to be honest, I’m a bit confused why more companies don’t do it. No company wants to leave money on the table, so it has to be something much deeper. Here’s my theory:

A/B testing sits on the fence between marketing and engineering.

Marketing teams are notorious for their lack of understanding around technical requirements, and engineering teams have trouble with marketing. As a result, something very technical in nature (A/B testing) was avoided by marketing departments, while engineering teams would run experiments, only to become disheartened by the results (p.s. – constantly setting up experiments is a significant time investment.)

Implementation Woes

To add to the confusion, there’s questions that must be answered around implementation. Should the company dedicate resources and build something internally? Should they use “off-the-shelf” software, or an open source library?

Numerous options leads to analysis paralysis, and give companies another reason to avoid running A/B tests. The good news is that it’s not nearly as bad as you think…I’ll help answer these questions later in the post. Next, let’s talk about why you should even care about A/B testing.

Why you should care about A/B testing

There’s a variety of benefits to A/B testing, but I’ve distilled my top reasons into the following:

If you have customers who arrive on your website, signup, and then pay you money, A/B testing is a surefire way to make more money. There’s very little ambiguity around what path visitors take to signup/pay, which enables marketers to run various experiments to improve conversion (and revenue.)

Here’s an example that I come across quite often:

Giggle.ly is an online store and spends $10,000/month on Google Adwords, which drives $20,000 in sales. They’ve tapped out all the relevant keywords, and the conversion rate is 2%.

Giggle.ly has an extra $5k/mo to spend, and they have two choices:

Try other forms of advertising (i.e. – Facebook ads)
Invest company resources to start running A/B tests

Which example is more risky?

From my point of view, running Facebook ads is more of a risk. It could work, but you already know the following:

Google Adwords can drive purchases
Customers convert at 2%

Let’s say Giggle.ly decides to invest resources into A/B testing, and gets the conversion rate up to 4%. They company is now making $40,000 in sales, and increase of $20k/mo. Over the course of a year, if everything holds consistent, that’s $240,000 in additional revenue.

Less risk when making big decisions

Another major benefit to A/B testing is the ability to answer tough questions. By setting up reversible changes, you can quickly test if these changes will help or harm the business. Here’s a few examples of big questions that can be answered with these tests:

Packaging – How should we offer our products/services? What should pricing tiers look like?
Price Testing – what price point resonates with users?
Design – will this new design increase or decrease the conversion rate?
Promotions – will a 10% off coupon or free shipping encourage more purchases?

In the past, companies would setup focus groups and run surveys. While this isn’t a terrible approach, A/B testing has a couple significant advantages:

Data – running A/B tests allows you to quantify how much better (or worse) a proposed change is.
Context – focus groups and surveys typically fail to reach users at the right time (or in the users natural “habitat”.) For example, I could ask someone in a focus group if they would be willing to pay $35/mo, or instead run an A/B test on live traffic and measure results. The data from the A/B test is more accurate because the experience is much more real.

HiPPOs

How many times have you seen companies make changes, only to realize that their customers absolutely hate them? The classic story that comes to mind is the attempted Netflix company split, where they attempted to offer the streaming service under the name Qwikster, while keeping the DVD “rental” service as Netflix.

In hindsight it’s obvious what the correct decision should have been, but leaders of companies have to make tough decisions. If a terrible decision is made, the blame is placed on the CEO.

A/B testing should be the CEO’s best friend. It helps remove the risk associated with big changes, and has the power to turn long planning meetings into a set of actionable experiments that can be implemented quickly.

A/B Testing creates a mindset of nimbleness

In industries with stiff competition, agility is a competitive advantage.

If company #1 has a “waterfall-mindset”, it could take months to implement changes vs. company #2, who can quickly run tests, learn, and implement/remove the change.

I won’t harp on this point – it’s pretty self-explanatory. A/B testing shapes a mindset of constant experimentation, something big businesses desperately need.

Organizational Setup

If I’ve convinced you that A/B testing is worth your time, the next step is to begin implementing within the organization. While each situation (and company) is different, there’s a few “best-practices” I encourage you to follow:

Have a A/B testing “champion” within the company

I honestly don’t care if this person is an engineer or a marketer – make this person do everything they can to get the company running tests. Ideally this person should understand the immense value that can be delivered by testing, and hopefully is excited at the opportunity to make a impact on the bottom line.

Why one person?

If A/B testing was easy from an organizational perspective, you would probably already be running tests, but because you aren’t, it’s important that you make it as lightweight as possible. Hence, a single person.

Oh wait…

While I suggest having a single person being the point of contact, I do advise that the organization bring other stakeholders into the discussion as needed. For example, if a marketer is running tests, it would be a great idea to have an engineer be available when more technical questions arise. Long story short, if you haven’t run an test before, have one person be responsible for making this happen, and as this becomes a habit, involve more people in the discussion.

Leadership must be onboard

Leadership must be open to these changes in your tests. I suggest stressing the following:

These are reversible changes
Experiments are run on a subset of user-base/visitors
A/B testing improves decision-making
Optimization of known channels
“Is there anything you’d like to test Mr. CEO?” <- encourages involvement

Priority: time to first quick win

Your goal as the A/B testing champion is to get the first test up and running as quickly as possible. That’s why I suggest Optimizely or Visual Website Optimizer as your A/B testing software. Add one line of code, push it, and create your first test. This shouldn’t take very long.

Too often companies get caught up in small details. I understand this concern, but you need a quick win. For this first test, even something super basic like changing the text on a call-to-action. I typically don’t suggest small changes like this, but you need to remove the mental blockage to running the first test. Kill the mindset of “this is really tough and will take weeks of our time to instrument.”

In the future, you might want to invest significant time/resources into building your own internal solution, but to start things off, focus on getting the first test up and running. I can’t stress this enough.

A/B Testing Setup

I see three major categories of A/B testing solutions:

Internal
Open Source
Third-party

**Internal
**Creating a solution internally will take a significant amount of time and company resources. Typically larger companies invest the time, because they have very specific use-cases. The good news is that if you have the resources, you can customize this as you see fit.

**Open Source
**Open source can be a great option. Many packages like waffle (for Django) and split for Rails do a fantastic job at a very low cost. The downside is that the project may become stagnant. There’s a bit of ambiguity, but I highly recommend using an open source package. Don’t reinvent the wheel.

**Third-Party
**I’m a big fan of third-party tools like Optimizely & Visual Website Optimizer (if you couldn’t tell already.) The reason why I suggest it is because it breaks down the barriers to setting up and running tests. That’s probably the biggest challenge companies face. The other two options will require more engineering resources.

Conversion rate experts dives into specific packages if you’re interested something else.

Create structure from the beginning

Structure and process around setup/reporting is the toughest part of running A/B tests. First, you should create a repository of experiments that you’d like to test, and secondly, you want to create a library of tests that you have run (with the key metrics included.)

Give yourself time to develop proper structure. It will aid you well in the future. I don’t want to introduce complexity in initial setup, but keep this in the back of your mind. If you are interest in specific spreadsheets that I use, let me know by entering your email at the bottom of the post.

A/B Testing Best Practices

Now that it’s time to start running A/B tests, in this section I’m going to cover some best practices you should be aware of. I hope this next section isn’t overwhelming…

Qualitative feedback should be the bedrock

When running A/B tests, make qualitative feedback the foundation of your testing strategy. It’s easy to ask fellow coworkers for ideas of what to test, and while this certainly can work, it’s not nearly as effective as turning customer insight into fodder for tests.

I suggest Qualaroo for this – this is hands down the most effective way to solicit feedback at the optimal time. I also suggest using a service like Intercom for “personalized” emails.

A/B testing is like baseball

I’ve never played a game of baseball in my life, but a baseball player doesn’t hit a home run every time the ball is thrown at him. While your goal should be to setup winning tests, you should not expect to constantly hit home runs. This is simply unrealistic.

Most people won’t tell you this, but A/B testing can be very frustrating. It’s not fun to implement experiments and watch them underperform the baseline experiment, and that’s why I suggest using qualitative customer feedback. It increases your chances of hitting home runs.

A/B Testing is also like investing

Why do people suggest people start saving up for retirement in their 20’s? Because of compounding interest.

A/B testing is very similar – small incremental improvements over time add up in a big way. There’s very few experiments that will increase conversions by huge amounts…this is an outlier that many A/B testing articles tout. I highly advise against this “get rich quick” mentality, and instead, focus on making consistent improvements over time.

“I don’t have many visitors, should I still A/B test?”

Over the course of my career, I’ve been fortunate to work with companies that have huge amounts of visitor traffic. I haven’t had to decide if A/B testing is worth it, so please take the following advice with a grain of salt.

First, if you have a tiny trickle of traffic, I don’t think it’s worth your time to A/B test. There’s more important things you should work on. A/B testing is an optimization, and not a magical unicorn that can make customers appear out of thin air.

Let’s say you can increase conversion by 3x. If you are only receiving 5 signups/day for your app, is 15 signups/day going to revolutionize your business? If so, then maybe it’s worth your time to run tests, and if not, you should probably focus on something more important. P.S. – expect to spend lots of time running tests, so make sure you factor in your time creating, running, and reporting on tests.

One step further: Segmentation

Here’s something I see every day. You’re selling a product primarily to the US, and decide to run a test. You double the conversion rate, but then segment by country, and with US traffic, the conversion rate dropped like a rock.

I recommend that you do further segmentation, whether that’s by new/returning visitors, countries, or traffic source. Don’t be afraid to dig deeper.

Track the entire funnel

There are a few times when it’s important to test options (i.e. – choosing a pricing option) and then measure the CTR for a particular step, but when A/B testing, it’s critical that you track the entire funnel.

Fortunately there’s some great integrations software companies have to make this really easy. I personally use Optimizely + Mixpanel, but you could use KISSMetrics or Google analytics too.

The major benefit to using Mixpanel or KISSMetrics is the following:

I can track specific steps of the funnel (I’m pretty sure you can’t see this in GA)
With Mixpanel, the test variation is set as a super property, which means that I can segment these people into cohorts for further analysis. It’s a global property stored on the user.

Drive Business Value

This next section is more of a reminder, but make driving real business value a priority when testing. Getting signups is cool, and I guess it’s a key metric for Pinterest, but I encourage you to make revenue a priority. My goal wherever I work is to generate more value (money), than I consume as an employee.

A/B testing is one of the few areas where you can easily attribute a dollar amount to the work you do. Typically most business activities have many moving parts and stakeholders, so at the very least, keep track of the money you generate for the company 😉

Stop putting lipstick on a pig

This might be the worst analogy ever, but if I spent an afternoon dressing up a pig and tried to parade it around town….it would still be a pig.

Likewise, you can start optimizing a underperforming funnel, and made small improvements to the tagline, or changed the color of the call-to-action, it might add a small lift, but it’s still a terrible funnel.

In A/B testing, these small improvements made to the terribly-performing funnel is known as optimizing for the local maxima. Instead you should optimizing for the global maxima. In other words don’t be afraid to try experiments that are wildly different.

A/B Testing for B2B SaaS

This is something I will be exploring more in the upcoming months, but the lifeblood of many B2B SaaS companies is the lead generation form. This is a bit different, because sales typically gets involved after the form has been filled out, so typically the funnel gets much harder to track (not all sales people close deals at the same rate.)

Let’s work through an example.

B2B Landing Page Experiment Example Results

Baseline – 10 leads/week
Variation #1 – 20 leads/week

Which test wins? This is a trick question.

You can’t tell from these metrics. Lead quality is critical. What happens if those 20 leads are terrible compared to the 10 leads from the baseline? We need to track this. Since Salesforce (don’t get me started on how much I hate Salesforce) is the industry standard, we need to pass in experiment information into Salesforce. Right now, there’s an Optimizely/Bizible integration that makes this happen – it’s something I will be digging into shortly.

Depending on how quickly deals are closed, it’s very likely that you won’t have a clue which variation won for months. I wouldn’t get too wrapped up in the small details – instead I would just use variation information to distinguish the winner based on the information submitted.

Advanced: You should probably avoid “true” multivariate tests

There’s a difference between A/B tests and multivariate tests. An A/B test seeks to find a meaningful different between a single variable. The classic example here is the simple change of a button color. The color is the variable.

Let’s say you want to test the color AND the call to action (“Get started” vs. “Free Trial”) – you’ve now introduced 2 variables (color and call-to-action text), which means that you need to run 4 different variations in order to accurately find a winner. You also need a lot more traffic to find a significant result.

It gets a little technical, but my point is that you should probably stick to A/B testing. But there’s one “caveat.”

If you are trying a wildly different experiment (example, long-form landing page vs. a short, simple landing page that look completely different) you are trying to find a “big learning” , sacrificing the exact learning of what exactly caused the experiment to win (i.e. – “Get started” beats “Free Trial”.)

This is super confusing, but my point is that you should stick to 2 tests most of the time:

single variable a/b tests (intro headline, call-to-action, etc)
Wildly-different tests (here’s an awesome example)

TL;DR – if you want to learn if a specific change made an impact, run a true A/B test; these changes typically yield small increases/decreases in your conversion rate.

If you have no idea if your long-form sales page would win against a shorter page, run wildly different tests; these are typically where the home runs happen, but you also have a high probability of striking out.

Worst Practices

I’m leaving this section for the end because I read horror stories every day. I beg you to heed the following pieces of advice (coming from someone who has learned the hard way.)

Don’t let time become a variable

Let’s say you want to test a drip email campaign that’s sent to new signups. You decide to change the subject line and compare the new data with the data from last week. DON’T DO THIS.

It’s crucial that you not introduce time as a variable. I’ve done this before, and I see this happen all the time.

If you’re going to properly run an a/b test, you must have a baseline, and you must have a variant running at the same time. If not, time has now become a variable.

Read articles on A/B testing? Treat them as inspiration

Nothing irks me more than reading an article about how XYZ company increased their conversion rate by 6000%. First, most of these people don’t disclose things like sample size, so for all I know, the test could have been run on 30 people.

Would you like an example?
See this article where someone ran an A/B test on a total of 120 visitors (30 visitors in the baseline & 88 in the variation), with a total of 13 conversions. (update: Brian Lang pointed out that the testing software in this example used a multi-armed bandit test, so while it’s not as bad as I originally thought, you still need a larger sample size.)

Secondly, why should I care? I’m really happy that you hit a home run when testing, but there’s honestly very little that is relevant to my situation. These articles on A/B testing do provide some ideas for experiments you can run, just be very skeptical about the “winning results.”

Don’t expect tiny changes to produce massive wins

Don’t listen to people who hype making tiny changes – 98% of the time, it does not lead to a statistically significant winner. Here’s what usually ends up happening:

Baseline – CTA: “Signup now”
Variation #1 – CTA: “Try it now”

2000 visitors later, the results look something like this:

Baseline: 50 conversions
Variation #1: 52 conversions

I’ve seen this happen so many times – if you have traffic to burn, try this strategy. If you don’t have a huge influx of traffic, then don’t be afraid to try wildly different experiments.

Statistical Significance is Significant

I’m not a math guy, but most A/B testing solutions are a bit misleading. Many people see that a specific variation has a 70% chance of beating the baseline, and decide to stop the test early. Evan Miller writes about why this is bad, here’s a quick excerpt below:

If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. “Peeking” at the data is OK as long as you can restrain yourself from stopping an experiment before it has run its course. I know this goes against something in human nature, so perhaps the best advice is: no peeking!

He also has a tool to make sure you don’t fall into this trap. There’s also another great tool to determine statistical significance.

Deep Dives

Interested in learning more about A/B testing? I’ve included my favorite resources you should check out:

Patrick McKenzie Opticon 2014: Advanced A/B Testing
Evan Miller’s Blog (especially if you like math)
Optimizely Blog
Experiment!
Which Test Won? (good for inspiration)
The Math Behind A/B Testing
Determining Statistical Significance – 37 Signals

Wrapping things up

If you made it to the end of this post, thanks 🙂

If I could communicate anything, it would be that A/B testing is a wonderful (and not that risky) way to grow your business. If you aren’t A/B testing, you should convince your boss it’s important, and if you are, make sure you are doing it right.

ab-testing-tactics

Depending on the response to this post, I’ll create a more advanced guide (hands-on stuff) where I’ll dig into tactics like setting up experiments with Optimizely. If that interests you, I’d love if you entered your email below.** I’ll share spreadsheets as well as hands-on screencasts.**