Follow up incident of 27/05/2018

Secured
#1
Hello everyone,

After the network incident (= stall) that happened on the 27/05/2018 I'd like to bring couple of points for discussion:

1. Can we get a root cause analysis/postmortem of the incident?

2. Can we agree that a short root cause analysis should be produced for every important network incident that should occur? Within a week of the incident? In a dedicated forum thread? Who would be responsible to produce? Such process would demonstrate professionalism, that we are not trying to hide under the carper problems, or that we don't leave problems un resolved hanging around. The postmortem should probably also analyse the response time and time to recover. Can we improve it? Can we take corrective actions to mitigate impact the next time?

3. What is our incident response plan? We should write down some standard procedure that should be followed step by step as soon a problem is detected. That standard procedure should be bookmarked by all ANOs handy and follower every time. Here are some thoughts on that:
  • Step to confirm that it is a network problem and not a local node problem. Verify all the nodes you own + https://explorer.factom.com/? Contact ANOs that are online at the same time?
  • If problem is confirmed who should intervene? Right now we need to contact swarm manager. How? How to make sure (s)he wakes up?
  • Are there basic verifications/data check that ANOs can run and provide to speed up
  • Who can decide to engage ANOs? What condition should require ANO engagement?
  • https://status.factom.com/ should be updated once problem acknowledged. Who updates it? Other public channels?
Limiting variability in the procedure reduces tremendously time to recover and human errors, especially during emergency situations where everybody is stressed.

4. In this case it turned out that no involvement of ANO was necessary. If the case were to be more complicated, what action from ANO could be expected? What kind of recovery action should we be ready to perform?

Point 3 is by far the most important and we should have a clear process before the next incident. I'm willing to help build/review it.
 
Last edited:
Secured
#2
Thanks for starting this discussion, Luap.

Perhaps we should form a Network Liveness Committee*. The purpose of the Network Liveness Committee would be to create and implement procedures to ensure quick resolution of liveness faults. The committee would have the ability to wake a swarm manager and also wake other ANOs should operator intervention become necessary. A programmatic solution would be ideal for making calls via VOIP. I believe Stuart from TFA has an alert bot that he plans to open source. We could adapt that for use by the committee. The committee would additionally be responsible for conducting post-mortems.

The committee would need to be spread across timezones and have their own means of being contacted in the event of a stall. I am happy to be a member of this committee. It would be beneficial to have at least one swarm manager on the committee too.

*Liveness is the ability for the network to make decisions. A liveness fault is one where a decision cannot be made. In the case of Factom, a liveness fault may be the failure of an election to produce an outcome or the inability to agree on the next block. In both instances, the result is a network stall. Vitalik's brief explanation of liveness: https://github.com/ethereum/cbc-casper/wiki/FAQ#what-is-liveness
 
Secured
#3
Or perhaps the more generically named 'Network Uptime Committee' would be more suitable, because then it explicitly deals with anything that threatens uptime.
 
Secured
#4
Thanks @Luap for starting this discussion. I second the importance of follow up after major technical issues. Since both you and @Alex expressed willingness to spearhead this effort, I don't see the need to wait for a committee to be set up. Otherwise we have to wait until next month's ANO meeting. It makes more sense for you two to start the work now to answer #1, #2, #3, and any other recommendations. If we need to implement an Network Uptime Committee or any other recommendations, it can be decided at ANO meeting.
 
Secured
#5
This is from Quintilian:

I have done some thinking in regards to this a few months back ago already, and my idea was that the onboarding and network committee should handle the unscheduled restarts, alerting and review. There are a few reasons for this:

1) The group needs the emergency contact information from the different teams and I would like this to stay as private as possible. The current group consists of 2 guides and 3 factom employees.
2) Currently Factom Inc. is handling the updates and maintains the authority set, and the 3 main people at Factom that works with this operationally (Ian, Brian and Steven) is already in this group.
3) The review should be done by the group doing the work behind the scenes related to getting the network up and running again (i.e the people mentioned above).

My take has also been that we should make a google form that the committee fills out detailing the incident, and which would be made public automatically to the community (basically linked to a spreadsheet on the google drive)...

I have actually already made room for three documents in our "document tree" to handle these kinds of things, but have not gotten around to write them yet (this guide stuff kind of takes up a lot of time:

Doc 106
Factom Code updating & Network restart process

Doc 107
Authority Set - Rapid response process

Doc 108
Network Incident reporting process
I would like to include you and others in this work going forward, but we need to figure out the best way of doing it. Lets discuss it in the forums; and I'll join in when I have got my access back
 

jcheroske

Bedrock Solutions
Secured
#6
We just recently signed up for pagerduty, and I can't help but wonder if it shouldn't be a requirement for operators and factom employees to sign up and join a Factom Operators organization on there. If we had everyone added to an escalation policy, or several escalation policies if need be, we could define our incident response policy programmatically and in fine detail. Who gets alerted first? What happens if they don't pick up? How long do they have to pick up? Is it high priority or low? I also think that each individual would be able to choose their preferred contact methods and also maintain their own contact information. A master list of contact info might not need to be referred to during an incident. I think it would cost folks about $10/mo. And I think that, once you've signed up, you could use it for your own internal use as well, so all the operators would get intra-org and inter-org alerting for one fee.