“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” – Ronald Fisher
Experimentation is a powerful tool for businesses to innovate and test new ideas, but few seem to be using this tool right. Erroneous experimentation setups lead to failed experiments and lost time, which causes businesses to give up on experimentation and innovative ideas. McKinsey’s survey mentions that 94% of executives feel that their firm’s innovation performance is unsatisfactory.
We’re here to change that. Understanding how to run experiments right will help you and your company put your innovative ideas to use and drive higher value for your business. Let’s understand this with the help of the same dating app example we are following in this series.
The Dating App Example
The dating app shows a user the profiles of other users, and they get an option to select or reject the profile. The free account of the app gives five suggestions a day to its users, and the user needs to upgrade to a paid plan to get more suggestions.
What was the need?
Suppose that the product manager working on the profile matching domain of this application came up with an idea to use Spotify favorites as a feature to filter suggestions for users. The assumption was that two people with similar Spotify favorite songs would have a higher likelihood of matching. With this assumption, they hypothesize the following – “Using Spotify favorites as a feature to filter suggestions will lead to a higher match percentage for users and a better user experience”. The team decides to conduct an experiment where the success metric is match percentage. The team launched an experiment with the above hypothesis and success metrics, with 50% of users receiving suggestions based on the Spotify favorites feature.
Initially, the experiment was a success as the cohort of users receiving suggestions based on the new feature had a statistically significant higher match percentage. Based on the results, the team adopted the feature for all users. After some months, the app started noticing higher churn rates. There seemed to be fewer users using the app. A decrease in the Daily Active Users (DAU) and the suggestions shown to users indicated a problem.
So what went wrong?
The product manager failed to realize that implementing the new Spotify favorites feature for suggestions had drawbacks. First, some users of the app did not connect to Spotify. Second, some users who did connect to Spotify had such unique tastes that their favorites matched with no other users. Thus implementing the new Spotify feature caused some users to get fewer suggestions than the typical five per day, which negatively impacted user experience and caused a higher churn rate. The idea was innovative and had merits, but we believe a different implementation would have reaped vastly different results. The product manager understood this through failure and implemented the changes in the following experiment. And guess what? It turned out well for the business!
How did they conduct a successful experiment?
The product manager discussed the pros and cons of implementing the new innovative feature this time. They identified edge cases and how the user behavior would change with the feature implementation. With a greater understanding of possible impact, they designed an experiment that would cater to different scenarios. The old algorithm would kick in to provide the rest of the suggestions for users with fewer than five suggestions based on the Spotify-based recommendation. Hence, even with the new feature, users will continue to get five suggestions a day.
As expected, the experiment was a success. It provided a user with a higher match percentage with the Spotify feature and provided five suggestions a day as before, improving the overall user experience. Estimates with improved match rates had shown an increased future revenue prediction to the scale of $8.5M to $11.5M.
Elements of a Successful Experiment
From the above case study, we can list the five critical elements to get the most out of your experiments:
Risk Assessment
We noticed that the failure to analyze the risk in the first experiment led to higher churn. Therefore, the product heads should always ask, “How do changes made to a certain part of the app/business affect the other parts? Will they result in adverse effects, and can we do something to mitigate them?”. Analyzing these risks and including them in the experiment design before launching an experiment would put you on the track to success.
Talking about success, how do we define success? So, this brings us to the next element.
Success Metric
Another critical aspect of successful experimentation is identifying an apt success metric. We saw that the success metric used in our case study was the match rate in the first experiment. However, in the second experiment, the team adequately planned and considered the secondary success metric of churn rate (impacted by the number of suggestions shown to the users) while preparing the experiment. Hence, defining the correct success metric led us to mitigate adverse effects and better control the experiment’s outcome. Here’s a bonus tip for you – in case you have more than one success metric, it is always a good idea to somehow combine the success metrics into one. In case you cannot connect the success metrics, you should measure both the success metrics together to decide whether the overall result of the experiment was positive or not.
Sample Size Calculation
While planning for an experiment, one has to calculate the sample size of every experiment correctly; every study will have an ideal sample size range. It’s suitable to keep the sample size to a minimum to reduce the risk of an experiment failure. Still, on the other hand, we also want the sample size to be large enough to minimize the error, thereby ensuring that we can measure the slightest change in the success metric. The sample size calculation becomes an essential element of an experimental design to keep it in balance.
The sample size calculation involves three aspects
- Type-I Error
Also known as “false positives”, a type-I error indicates the likelihood that the feature, if rolled out, will not be successful given that the experiment says it will be successful. - Type-II Error
Also known as “false negatives”, type-II error accounts for error when the experiment says it will not be successful but would have been successful if launched. - Effect size
Effect size is the value measuring the strength of the relationship between two variables in an experiment; essentially, the slightest possible difference that we can measure in the experiment results. And in cases where there are two success metrics, the effect size is taken into account such that the slightest difference between the two success metrics is observable.
Mutually Exclusive Experimentation
“All life is an experiment. The more experiments you make, the better.” – Ralph Waldo Emerson
In start-ups where multiple teams are moving fast and trying to learn through experiments, they tend to launch numerous experiments simultaneously. While having parallel experiments, it is vital to have separate sample populations, i.e., different experiments with their control and test populations. With the same cohort of users exposed to various experiments, there may be adverse impacts on the observed success metric of different experiments, leading to unpredictable outcomes. Hence, it is crucial to ensure that the cohorts chosen for all the experiments should be mutually exclusive.
Quantification of the Impact
The final step is quantifying the impact of the success metric on the business, typically revenue or profit. A rough estimate of how much revenue or profit the company stands to make with every increment or decrement in the experiment success metric will put the experiment results to comparison. Experimentation can be expensive, so by quantifying the impact of each experiment, we can prioritize the experiments and understand whether it is reasonable to conduct one. The confidence interval of the revenue/profit impact estimate can help answer questions around whether the cost of adopting the experiment justifies the gains.
How else can I improve my experiments?
Coming up with an idea is just the tip of the iceberg. Implementing your ideas through an aptly designed experiment framework with the five critical takeaways mentioned above will help you make the most of the time and resources, thus leading to maximum value for your business.
The following post in this series will discuss behavioral economics considerations while designing your experiments.
Happy experimenting!
[…] will increase the trial rate or cause churn. In such cases, it is always a good idea to conduct experiments with a sample of the user population to measure the impact of your campaign before rolling it out […]
[…] This way, you can test your campaign ideas to see which is the best. You can perform a similar test to compare different marketing channels, or test multiple campaign ideas along with several marketing channels through multivariate testing. To learn more about experiments, refer to the blog on how to run experiments successfully. […]
[…] The problem with predictive analytics is that it often requires a large amount of historical data to train and test models. Unfortunately, not all companies have that. So, an alternate approach is to conduct experiments. The experiments help us validate many hypotheses regarding products, customers, and marketing. If the hypothesis holds, you can use the information to plan your advertising campaigns. In fact, you can use experiments to see if your campaign will perform well or not, even before you run them. To learn more about experiments, refer to the blog on how to run experiments successfully. […]