Bring order to chaos: Setting up a production troubleshooting and remediation process for š
[I love books, the physical kind. You canāt replicate the feeling of paper with a Kindle. So, Iām going to be taking the best of these newsletters, re-editing, and making a limit book run. Click here to learn more.]
Dealing with bugs, fixes, and patches is unavoidable for PMs working with software products at a startup. In the beginning, when there are few customers and features, itās simple. Identify the issue, investigation, and a chat with an engineer to fix it. Maybe even the technical cofounder makes the code change. But as the number of customers grows and the codebase gets more complex, itās easy for issue triage to consume all of a PMās time. You now need to set up a production troubleshooting and remediation process. Let me show you.
What is a troubleshooting and remediation process?
Itās actually two processes, best run sequentially.
Troubleshooting is the process of identifying and restoring immediate functionality.
Remediation is the process of āmaking upā for users or customers impacted by the issue.
Hereās an analogy. You show up to the ER with a broken leg and youāre bleeding all over. Troubleshooting is the doctor figuring out why youāre at the ER, whatās causing the bleeding and stopping it so you donāt die. Yes, you want to walk again. Maybe physical therapy (i.e., remediation) is needed, but you wonāt get there if you're dead.
The 5 components
Itās easy and youāve likely practiced these but didnāt know the vocabulary.
Identification and logging: How was the issue brought to attention (e.g., Slack, email, Trello, Jira, Linear, phone call)? What information about the issue is available (e.g., logs, screenshots, screen captures, steps to reproduce)
Communication: How are you going to communicate about the issue. Who do we need to inform, who needs to do what, who should we consult, and who will make key decisions?
Prioritization: Should we work on a reported issue? How will we decide which issue to work on if there are simultaneous issues?
Escalation: How do we handle disagreements (e.g., we canāt resolve among ourselves, what do we do?)
Closure: Do we celebrate? Should we investigate beyond? Do we need to perform a root cause analysis or a review?
Steps for Creating A Good Troubleshooting Process
Gather 3 -5 people and bring them along. Youāre introducing a new process, best to co-opt people into its creation. Pick people who are currently involved when dealing with bugs (e.g., sales, customer success, engineering, finance, even the CEO). Your goal is to create, document, and communicate the new troubleshooting process collaboratively. See SPADE decision-making process for more detail on how to execute.
Create and use a reporting template. As issues increase, you canāt scale if is āyelling the loudest gets attentionā. While people commonly complain about filling out issue forms because itās long and tedious, it doesnāt have to be. Hereās a simple 5 question template that you can use in any tool, whether Git, Jira, Trello, Asana, or Slack. The goal is to reduce 15 minutes of conversations where youāre a chatbot.
Define priority and escalation process.
First, the reporter MUST pick a priority when reporting an issue (see above). Itās the same reason why people decide for themselves whether to visit the ER or go to Walgreens. I canāt stress this enough. Some reporters will feel very uncomfortable making this decision because they are unfamiliar (āWait, I get to tell you to drop everything?ā). You can help by asking, āWhat type of response do you want from me?ā
Once the reporter assigns a priority, itās up to the PM (assuming the issue is sent to the PM) to determine if she or he agrees (e.g., just because you show up at the ER doesnāt mean a doctor thinks your issue is life or death. But a doctor is going to go through the process to check.).
If you agree with the reporter, move on. If you disagree, inform the reporter and downgrade (or upgrade). This is where the escalation process is needed to resolve disagreements. You need to define who has final authority (e.g. PM has final authority, issues go up reporting hierarchy and VP of Sales has final authority, reporters are always right, rock-paper-scissors). No perfect answer, but everyone needs to understand how to resolve disagreements otherwise, youāll have too many variations and lots of wasted time.Define how to communicate updates. Many issues wonāt be solved instantly. So, youāll need to give updates. Easy when itās five people, hard when itās 50 people who all need different information to do their jobs.
Pick a channel for communication and stick to it. I recommend email via group email distribution. Create a group (e.g., Google groups) and allow people to auto subscribe. This way, you can easily add and people can remove themselves. This becomes the form youāll use to communicate updates. Make this a broadcast, not an email chain. Use an eye catching subject line (e.g., āPRIORITY 0 ISSUE UPDATE: WEBISTE IS DOWNā) Keep your emails short with links to Notion or other documents for people who want more details.
For Priority 0 and 1 issues, provide updates even if the update is ānothing to updateā. When shit hits the fan, people from all over the company will reach out to you for information. However, sometimes, you literally have no further information to give: āWeāre still working on it.ā If thatās the case, provide an update stating so with a time for the next update (e.g., āNothing to update. Still investigating. Next update in 30 minutesā). This will free up your time to work on resolving the issue.
Define priority 2 and 3 issues as no direct communication. By default, these are issues that arenāt so urgent, you need to drop what youāre doing. So, set the standard ahead of time that you wonāt be providing updates. Itāll get hammer through your normal process of figuring out what to do next sprint.
Have closure and rest. After telling everyone itās all resolved, we forget the stress induced for handling priority 0/1 issues. People may be templated to jump right back into what they were previously working on, but Iāve learned from my own mistakes, itās better to celebrate as a team and take a 10 minute mental break. Yahhhh! Prior to remote work, maybe itās grab a drink or coffee. Now, maybe share some tunes or hang out. What ever you decide, but make it a fun habit.
Set a time in the future to review the process. People are going to complain about the ānewā troubleshooting process. It wonāt be perfect and as a way to help people adapt, set a time for people to review (i.e., literally schedule something on peopleās calendars 2 months ahead). Youāll want to wait 2 - 4 cycles so people will get a chance to try the process end-to-end. And when people complain, thatās okay. Invite them to the meeting in the future for them to vent.
What about the remediation process?
Having āstopped the bleedingā, itās now time to rebuild. Sometimes, itās as easy as asking the customer to āretryā. Other times, thereās more work that needs to be done, typically to fix under data issues. Perhaps a record was deleted and needs to be restored. Maybe some messages were stuck or a customer has to be refunded. Regardless, more work needs to be done.
Handling remediation should follow the same process for troubleshooting issue (i.e., use the template, log it, set a priority, determine if thereās agreement, escalate if needed). The difference here is you, the PM, assigns a starting priority. Then, youāll want to confirm that priority with typically sales, customer success, or customer support.
Here, itās important not to combine troubleshooting with remediation. Thatās because there maybe not ways to ācompensateā injured customers without engineering work. Talk to sales or customer success. While painful, itās sometimes easier and faster to ask the customer to start over because the effort to remediation may be more time consumer.
Other Lessons Learned The Hard Way
Donāt blindly try to perform root cause analysis. Shitās going to break. Processes that you designed arenāt going to work. Customers will be upset and so will your coworkers. Someone is going to suggest after an issue to conduct a root cause analysis under the good mantra: āWe learn from our mistakes so we donāt repeat them.ā Donāt fall into the trap! Hereās why:
a) You donāt have the time nor the information to conduct proper root cause analysis.
b) In an early stage, <100 people companies, your tech stack, people and process will change and some issues will no longer be applicable 5 months later. Change should be fast, so this is a waste of time.
c) Making fundamental changes is difficult (e.g., Sara worked too many hours and we didnāt have a backup infrastructure engineer. In a hurry, she made a production deploy with the wrong command. If only we didnāt work so many hours or so hard).
Instead, suggest some band-aids thatās easy to implement. People are solution oriented and probably have them top of mind. Pick 1 or 2 at most and make a change.
Offer 1:1 guidance when someone is filling out the template the first time. Next time someone pings you about an issue without submitting a template, say, āHey, you got 15 minutes to fill out the reporting template? I can jump on the video right now and guide you.ā Spend quality time answering questions. Donāt become the scribe, but give encouragement. After 15 minutes, then take over by updating the document once the reporter has tried filling it out.
Introduce technology so reporters donāt have to type. With tools like Loom, Jing, CamStudios, (plus features built-in to Mac and Windows for screen and video capture), reporters donāt have to be typing everything. They could be sharing a recording of the issue.
Donāt use a risk matrix (i.e., priority vs severity). If you Google, youāll discover the priority versus severity matrix.
Itās a neat intellectual framework, but terrible in practice. From my experience, there is too much variability in how people define along two axis. But donāt just take my word for it.
Risk matrices do not necessarily support good (e.g., better-than-random) risk management decisions and effective allocations of limited management attention and resources.
ā¦ the common assumption that risk matrices, although imprecise, do some good in helping to focus attention on the most serious problems and in screening out less serious problems is not necessarily justified.
[For severity] Users with different risk attitudes might have opposite orderings ā¦ As a result there is no objective way to classify the relative severities of such prospects with uncertain consequences.
In short, youāre already dealing with people who view priorities differently (e.g., I might go to Walgreens and self medicate while youāll visit the ER for the same issue). When you add severity as the second axis, it gets even more confusing. As a result, stick with just priority.
Timebox priority 1 issues to 15-30 min max to make a decision to downgrade or upgrade. When you only use the priority ordinal ranking, youāll end up with most reporters suggesting a 1, which may then get downgraded into a 2 if thereās already some reasonable workaround. Thatās normal and the way to perform this is to define ahead number of priority 1 issues youāll handle per week. If you start collecting more priority issues than you can resolve, it means you need to dedicate more resources on delivery quality rather than new development.
Handle bugs versus issues differently, but allow reporting to be the same. Reporters donāt care if the issue is a bug (i.e., a bug is something that deviates from expected behavior) or masked as a feature request. However, itās important for you to know how to separate the two. Thatās because bugs are āeasierā to solve because you have a defined expected behavior. This is something software engineering can fix because thereās a clearly defined outcome thatās currently broken. However, issues masked as features requests canāt be solved so quickly because you have to validate and define the new expected behavior.
Interesting additional reads:
Thank you for writing this! I read a number of PM blogs and this probably the most practical piece of advice I've seen in 2021. I am currently serving as the Head of Product, Engineering Manager, and sole QA-tester so I will never tire of practical suggestions.
I do all of this except for having the reporter pick a priority upon submission. Across my 10 years as a PM, I've always worked at companies where "ONLY PMS CAN TOUCH THE PRIORITY FIELD!!!" Asking the reporter to think about, and articulate a priority is so clever as a conversation starter! I'm eager to introduce that step to my company and see where it takes us.
I look forward to reading through your other posts as well as future ones!