Tips on managing team with long running pressure or incident

A few months ago, I was involved in one of the longest-running incidents we ever had at Intercom. Even though I wasn’t involved in managing or leading the work to mitigate the incident, I was people manager for a few folks from the core team and learned a few lessons that, honestly, I hope to use as rarely as possible. Other managers involved in the incident and I reflected together after that and wrote down our learnings.

First of all, I need to highlight that these tips and lessons are coming from a very specific type of incident or pressure on your team. The one that lasts for weeks, puts a lot of pressure on everyone involved and significantly impacts our work-life balance. I and two other managers involved were quite experienced senior managers and we made a lot of mistakes. Even though these lessons are stupidly simple, trust me - don’t underestimate them.

Under pressure, we focus on the most important challenge at the moment. We let other things run on autopilot. We are willing to make some decisions without thorough consideration even when the risks are bigger than we would usually accept. All of that because of stress. And when the things settle down, when you get to that last mile and go through the finish line, you are exhausted and happy at the same time. It’s easy to assume that when the long working hours end, the team is back to normal. They are not. And neither are you.

Even after the risks are mitigated, the incident is still running. The real end of any incident comes way later.

Tips during the incident

Do retrospectives weekly

They are the source of routine and a foundation that a team rely on. Retrospectives are especially important to give a vent of emotions and iterate on how you work in such a demanding time. Remember, all that’s currently happening is probably new for you and your team. Your team is not optimised to work under such conditions so probably the majority of your processes and ways of working are broken. Be transparent, candid and open talking about it on retros weekly and adjust to help each other go through it. It’s easy to stop doing retros or do them very rarely to save time - don’t do that.

Keep everyone aligned

Make sure that your manager and their manager knows about the incident, risks, next steps, people morale, current and ideal capacity - daily. Make sure that you are in your golden time of reliability - don’t miss even a single update and make sure that your team knows about that transparency. I’d recommend having a Slack channel where every day you post a detailed update.

Involve senior technical leadership

Make sure that you have support from at least a few most experienced, most tenured technical leaders, ideally a mix of individual contributors (like principal engineers) and senior managers. This helps with morale and gives a signal that the team and the problem matters. If these people are not willing to make an effort and support you, your incident and the pressure is probably not worth it.

Tips after mitigation to close the incident

Do the best incident review you ever did

Make sure that you spend a lot of time reviewing and reflecting this incident. Do your best to never let it happen again.

Force people to go on leave

Force your people to go on leave. You can’t send them all at the same time, so create a weekly schedule who takes time off when. No excuses - everyone needs to go, ideally in order of their mental health state. A lot of them will say they are fine and they don’t have to go. Trust me - I did exactly this mistake and it ended dramatically. Everyone needs to go.

Don’t call the forced leave holidays. They are not.

When you ask them to go on mentioned leave, don’t call it holidays. They are not going on holidays. They are going to recharge batteries after they committed way too much for your employer. They will most probably spend it sleeping most of the time, without even leaving their apartment too much. This is a leave to get back their work-life balance. Holidays are when they make a decision to go, not the conditions of your systems.

Recognise the work publicly

Make sure that adjacent teams or functions hear about the hard work publicly. All Hands (global or functional) is a perfect place to recognise a team after such an incident and make them feel proud.

Recognise the work financially.

Last but not least, don’t forget about recognising the work financially. You owe them and I believe that I don’t have to explain that.

Summary

I think that incidents and fires are situations during which everyone learns a lot. But this is not a sustainable learning process. So whenever they happen, we need to make sure that we really use this challenge well.

And if there is one closing note on all of these tips that I would like you to remember, it’s when the next incident happens, just open back this blog post. As I said, under huge stress, you probably won’t remember these tips so it’s good to remind them.

Subscribe for new posts!

I post every 2-3 weeks and always with lessons related to software engineering managers. I won't use your email in any other way!