Are you discouraged by your unexpected experiment outcomes? Is a feeling of despair engulfing you due to continuous experiment failures, making you believe that your ideas are inaccurate and wrong?
We have seen this as a general trend in start-ups, and we know that it’s an awful feeling!
Most managers are good at formulating innovative hypotheses but not so good at understanding the different challenges of randomized trials in this hyper-competitive business landscape.
“The formulation of a problem is often more essential than its solution, which may be merely a matter of mathematical or experimental skill” – Albert Einstein.
In the last post, we discussed the most familiar yet the most misunderstood topic in start-ups, i.e., experimentation. In this post, we will get into the different reasons and the details of why experiments fail.
Though start-ups are keen on experimenting to test their hypothesis in question from time to time, the results rarely give anticipated results. This blog will help you understand what not to do in experimentation and the why. While challenges that sabotage experimentation outcomes could be innumerable, let’s focus on the four most common reasons.
Why Experiments Fail
Let’s continue with the freemium dating application example from the previous post. A user can look at the profiles on the dating app, match, connect, and eventually go on a date. There are two types of users, the first is who use the free service, and the second is where they upgrade for some advanced features. Free registration limits the number of profiles you can interact with per day, while paid gives you unlimited access and other features. The app usually has questions for every new registration to find matching profiles. These questions may include your age, gender, interests, etc., to help make a recommended match for you.
Now, with all these features in place, let’s examine a few challenges in experimentation.
Let’s assume that the app’s product manager has a hypothesis to improve profile recommendations. They recommend adding one more question about their zodiac sign and using zodiac compatibility to make better profile recommendations.
The team decided to launch a test since the hypothesis sounds excellent theoretically.
Post feature launch, they observed no difference between the recommendation engines with and without the zodiac sign compatibility. The test was not fruitful as it had a weak hypothesis.
Sometimes, one thinks that they’ll get the directional results with an initial experiment before making significant changes to the application. After the first test, they will conduct another test for the significant changes or plan for a complete factorial change. But going in with a weak hypothesis will not likely give you any results, and you would have wasted development & design hours and won’t be able to follow up with consequent propositions.
The team was not precise about what they were looking for in their hypothesis, i.e., there was no well-defined success metric. Let’s assume that the match rate improved because the algorithm worked well. This higher match rate could result in fewer people finding matches resulting in a lower match count. In this scenario, if someone asks about the result of the experimentation, the typical answer could be, “I am not sure. Some indices went up, and some went down. I don’t know whether zodiac sign compatibility will improve the business.”
So that’s our second do not. Don’t experiment without a clearly defined success metric. One way to think about this could be – “If I am implementing this change, what metric do I genuinely want to influence?”
Ideally, an experiment should have a maximum of two success metrics, and one has to design the test for both of them. You can get lost within the various indices without a clear success metric. People often think – “This is a complex test. Some things will go up, and others will go down. How do I test it?”. One way to deal with this is to create a combined metric that you know very clearly, which you know that you have to influence to go up or down.
Another issue while defining success metrics is that the metric is a longer-term one. For instance, it could be a 90-day retention or engagement rate. How should you measure this by running the experiment just for six days? There is always a way in which you can forecast. You could use the 6-day data and forecast it out to day 90 and see whether that delta gets you the result or not. So there is always a way to build a robust success metric, but the key takeaway here is if you launch an experiment without a success metric, you are unlikely to get a successful outcome.
When free users can only view a few profiles daily, the profile matches will go down, and the free users might start avoiding the app. As a result, the daily active user (DAU) rate and other metrics could also decrease. The issue here is not assessing the potential risks before launching the test.
One must keep in mind that some experiments can have counter-intuitive outcomes that can be detrimental to your business. Hence, a risk assessment before experimentation will save you from surprises and help you make better decisions.
The moment most people think of experimentation and sample sizes, they think of online calculators for sample size calculation. The catch here is that the sample size calculation is only one aspect. Other elements, such as success metrics and criteria, also affect the experiment sample size. If you do not plan for it, you are not likely to get the answer that you are seeking.
So what next?
The above mentioned are the four primary mistakes that have to be kept in mind while conducting experiments. Now that we know what’s experimentation and how not to do it, we’ll next see how to carry out experimentation properly in the next post. Till then –