Learning more about experimentation

What do researchers say about experimentation and metrics

Jan 26, 2022

Recently, I’ve been digging deeper into experimentation. Many organizations talk about data-driven decision-making. One way to make better decisions is by running experiments and learning from them. At Cornell Tech, we asked students to run experiments and collect user behavior data to validate product hypotheses. However, running experiments, especially pre-product or early-product is challenging. If the experiments are set up incorrectly, you’ll do a lot of work AND draw the wrong conclusions. I’ll identify some challenges and draw insights from reading what others have done when running experiments at scale.

Hans Zimmer & Lorne Balfe - In The Beginning - YouTube — In the beginning

Background

There are various challenges you’ll experience when trying to run experiments.

Setting goals. Picking the “right” goal for a new product is difficult. Part of that challenge is recognizing how goals can roll up or down. For example, I previously wrote about the difference between business and product goals. Others have called it strategic versus tactical goals. Thus, depending on what comes to your mind first, you might have to move up or down in your thinking.
Ladder of Abstraction: Sourced from Tom Barret; Thanks Jeff
Biased with our ideas. Experiments are attempts to learn something. For example, you start out with a goal and brainstorm some product/feature ideas that you hope will help you achieve the goal. But you don’t know. During this stage, it’s easy to get lost and forget that the idea is unproven. We become biased that our ideas will succeed, even though the vast majority don’t.
Picking signals, proxies, and metrics. You’ve heard the saying or its variation, “If you can’t measure it, you can’t improve it.” Measurements are necessary to test your idea and draw valid conclusions. But deciding what to measure to determine if the product/feature idea achieves your goal can be just as challenging as goal setting. Challenges include:
1. Evaluating and choosing metrics among many options
2. Acquiring sufficient data
3. Ensuring the high data quality during collection
Analyzing and interpreting metrics. Get past the above and you still have to analyze and interpret the metrics. While the software is more readily available to help, knowing how to use the tool to calculate confidence levels or statistical significance and interpret the results requires training in statistics and active thinking. This is tiring work, sometimes prone to error.
An example from a research paper. Consider the ratio metric Click Through Rate (number of clicks / number of impressions). Assume the metric increased to 25% from 20%.
Note that this can occur in 5 ways. "Among these five possibilities, except for (a), the goodness of all other cases are ambiguous.” Thus, if you’re using this metric to evaluate a change caused by a product/feature, the ambiguity doesn’t help you conclude if the product/feature is positive or negative.

What can you do? Here are some tactical tips.

Move up and down the “ladder” when thinking about possible goals.
While articles present goal setting as sequential, I’ve updated my thinking and recognize that for most practitioners (myself included), it’s more iterative. One technique that helps in iterating is ladders of abstraction. To move up, you ask why questions. To move down, ask how questions. Here’s an example:
1. A goal comes to mind: Increase subscribers.
2. Move up → Ask why? Answer: More subscribers will increase idea exchange with comments.
3. Reformulate the why answer into another goal statement: Increase interactions with readers.
4. Move down → Ask how? Answer: Produce higher quality content.
5. Reformulate the how answer into another goal statement: Increase quality of content
Notice that my goal statements aren’t using SMART formats at this time. This is intentional, to make thinking easier and more fluid. For example, you can have multiple “Whys” and “Hows” as you move up and down.
Translating ideas into hypotheses.
As you brainstorm ideas that can help you reach your goal, one technique to combat bias is to rephrase your idea into hypothesis statements: “I predict that [product /feature] will cause [impact].” For example, I could have an idea that creating a podcast would increase subscribers. My hypothesis statement would be: “I predict that making podcasts in addition to my writing will cause new visitors to subscribe.”
Have two goals: strategic goals act as guards; tactical goals guide decision-making.
Every article talks about the importance of setting long-term, strategic, business goals (e.g., LTV, revenue/user, overall conversion rate). Yet, research and personal experience have taught me that these goals and the associated metrics are difficult to directly influence. Not only that, using these strategic goals can drive actions that optimize the “wrong” behaviors. For example, I have a long-term revenue goal for my writing on substack. I’ve been approached to put advertising or promote products/people on my substack. But if I optimize for revenue, placing ads or promotions may damage the reader’s experience. Being conservative, I’ve optimized to select reader experience. In other cases, long-term goals such as user growth are difficult for me to directly influence. For example, I have a subscriber goal, but I can’t directly force a new reader to subscribe. While the number of subscribers has increased over time, multiple actions have contributed to its increase. Furthermore, a nominal increase in subscribers doesn’t tell me what to do next (other than a nice ego boost).
What I’ve learned instead is to think of these goals and the associated metric as a guard metric. For example, if I see users unsubscribing after a post, it likely signals I’m writing something that people aren’t interested in reading. The metric acts as a monitor.

That’s why the other side of this coin is identifying goals that I can directly affect, which tend to be tactical. For example, I have a goal of publishing consistently. The assumption is a consistent publication schedule maintains subscribers. Writing and publishing are in my direct control.
Focus on tactical goals/metrics, but evaluate the predictive power of tactical metrics to strategic metrics.
Because you can’t directly influence metrics associated with strategic/long-term goals, pre-product and early-stage products need to focus on the short-term goal and metrics. This statement is probably controversial but I think in PRACTICE, it’s what everyone does. You focus on the tactical goal/metric because it’s in your control.
What’s harder is figuring out if your tactical goal/metric actually predicts/influences your strategic goal/metric.
For example, I mentioned earlier my tactical goal of publishing consistently. Why did I do this? I’m heavily influenced by what Substack advises.
Text from Substack Growth email
But does writing consistently and publishing positively associated with subscriber growth? You might conclude yes, but the better way to test this hypothesis is what’s called “degradation of service”. This is where you intentionally do something that negatively impacts users to learn. In my cases, I tested the hypothesis by temporarily not publishing on my scheduled Wednesday to see if some users unsubscribed. Thus far, I’ve found no such anecdotal evidence. Instead, people who unsubscribe generally do so immediately after I publish a post.
In retrospect, this finding is unsurprising. No reader revolves their life around my small newsletter. Even though I have a 50% open rate, my email in your inbox is a small ship in the ocean of emails. But the point in all this is, you have to evaluate the predictive power of your tactical metric as it related to your strategic goal/metric.
Keep it as simple as possible. Simplicity applies both to experimental design and metric selection. As one researcher wrote: “Usually at the early stage of an online service, when it does not have enough users and the improvements are easy to observe, these two types of metrics [strategic and tactical] may be good enough.” Simplicity makes execution, data collection, and interpretation easier. This is a common mistake I’ve made. For example, if you had to choose between a product with just a sign-up with an email feature or one that had a sign-up requiring payment, it’s always better to pick the sign-up with email first. This simplifies both the product and data collection. While it is true, that not everyone who signs up will pay, the trade-off in time and effort is worth the exchange in learning.

Some ideas I’m noodling, but uncertain (aka, shit I don’t know).

While there’s considerable research on experimentation at scale for products with hundreds of thousands and millions of users at Google, Microsoft, etc., it’s not clear to me how to evaluate the relationship between tactical metrics and strategic metrics for pre-product or early-stage products with limited users. The common method recommended is via qualitative user interviews because of the small sample size. However, interpreting the results from user interviews is challenging. Two people who attend the same user interview can draw very difficult inferences, not to mention that the small sample size can significantly bias whatever conclusion you draw.

To address the small sample size problem, early-stage companies / founders sometimes solicit paid users for user interviews, but this can further skew feedback bias, not to mention more qualitative interviews don’t make interpretation necessarily easier. This situation seems to favor intuition in decision-making, which favors HIPPO and people with more life/work experience. So, how does one evaluate the relationship between tactical and strategic metrics at very early-stage companies? Or perhaps the better questions, is it even necessary to evaluate tactical vs strategic metrics at early-stage companies?

Have a tip on how you’ve conducted experiments with brand new products?

The Elements of Product Management

Discussion about this post