Disclosure of critical bug fixed in v6.0.1

Disclosure of critical bug fixed in v6.0.1

As part of the September 27th network stall (https://docs.google.com/document/d/1l3jccyx8iHryX01vVzIvZDjzljaedHNSC5XoaCjbuXc/edit) a patch was developed as the main part of v5.4.3-fix1 (https://github.com/FactomProject/factomd/compare/v5.4.3...v5.4.3-fix1). The patch white-listed a single transaction. This is an entirely valid transaction, which passed all the protocol rules for inclusion. Some example rules would be: it didn't double spend coins, it was properly signed, it paid an appropriate amount for the fees, it had an appropriate timestamp, etc. The block that this transaction was included in (160181) was very different though. This block was the first one being created in a network restart.

A network restart is a rare process that is needed at this stage of the factom network lifecycle, where a kickstart is needed to get the servers in sync. While this is not part of the long term plan, it is the system we have today. Most of the time the network can handle large losses of federated servers (up to half) on it's own through the faulting process. This faulting process did not work during the Sep 27 stall. That failure was likely due to the minute 9 election bug (which has a fix pending in 6.0.2, described below).

Since the network did not recover gracefully, it stalled. Factom prioritizes Consistency in the CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem), as the alternative is forking, which is bad in blockchain systems. The block was being created while there were older process list items from previous runs (the root cause of which was found and patched later in v5.4.3-fix3/v6.0.0 (https://github.com/FactomProject/fa...0.0#diff-8e27b49c38baeedefdea56b9b003f90cR216) ) which caused lots of faulting in the network. As the Authority set was progressing forward through the faults, the block time stretched out more than an hour. A factoid transaction was included in the block (https://explorer.factom.com/transac...45cbcdf34c3958cf54e2388fd3f8ef907dca903ea2f2e) at a time beyond when the transaction timestamp was valid (beyond +-1 hour of the block start time, specified in the directory block header and DBSig of VM0). The Federated servers accepted it because it was +-1 hour of the current time and dutifully placed it into the blockchain. All the nodes that were following along with the process list (necessarily all federated servers) all agreed the transaction was valid. They all saved the block and progressed forward continuing to build blocks.

This highlights the difference between different modes of operation in factom. Blocks can be ingested by either following along with real time messages (via the process list), or downloading fully formed blocks from a peer (dbstates). The different modes of operation should validate things the same way, but when they don't it can cause problems. This is what happened in this case. When a node that was not online at the time, and downloaded the blockchain by blocks later would try to validate the block, would fail validation. The factoid transaction in question did not pass the rule about the timestamp. It was greater than an hour away from the block start timestamp, which is in the Directory Block header. Being a rule violation, a node would not accept the block, and would not progress past #160180. To compensate for this misstep by the federated servers, a new version of factomd was released which specifically allowed this transaction. In blockchain parlance, this was effectively a hard fork, since it was a loosening of the rules. It caused the need to release a new version of factomd, where anyone who wanted to keep up with the network needed to upgrade to. This was disruptive to the entire economy, as all exchanges, and all users of the protocol had to install new software to continue operations.

At the time it was thought that to have a single transaction hard fork the blockchain, an exceptional circumstance was required. Something like a network boot that was taking more than an hour to finish. The plan at the time was to abort a reboot that was taking more than an hour to prevent this happening again while the root cause was searched for. After more investigation, it turned out the risk was far greater. A hard fork could be triggered at any time, even accidentally! An average FCT wallet user's misconfigured clock could hard fork the entire network, repeatedly.

The FCT transaction timestamps are set by the user that creates the transaction. This is how factom efficiently prevents duplicates and doublespends. Transactions are only valid within that +-1 hour window. Outside of that window, they can be rejected. This allows factomd to only keep track of duplicates for a couple hours, minimizing memory usage at scale. This has the side effect of making a transaction definitely expire after <12 blocks of when it should have been confirmed. This is beneficial so if a transaction doesn't go through, after an hour you know it never will.

If a transaction was timestamped around the 1 hour boundary, so that it was almost too old, it could be sent over the network and would get confirmed. Nodes which were online would continue on unimpeded and would continue building the blockchain. Nodes that were booted later would get to that transaction and would see a rule violation, and would reject the block. This mismatched timestamp could be from something as simple as a misconfigured time zone on the wallet, which would put the clock off by the 1 hour danger point.

Needless to say this was concerning, and Paul worked throughout the weekend after he discovered the problem to solve the problem and test the solution. As with most complex systems, it took several attempts to find a strategy that was effective, so the final solution presented is only part of the effort. Paul also made some unit tests showing the before and after effects of the patch. He considered them so dangerous that did not push them to github, perceiving the risk of showing an attacker exactly how to disrupt the network too great.

After the problem and solution was identified, it was brought to the active members of the factom core committee to determine a path forward. The deployment strategy was formulated and debated. There were some voices that wanted to recommend an immediate deployment to mainnet, treating it as an extreme emergency. Other voices wanted to go through the testing processes involving testing on the Community Testnet, with some bake time to ensure stability over time. In the end, the decision to take a measured approach to the deployment prevailed.

The update was released as 6.0.1-RC1 and gained confidence on the testnet. On October 15 the release was announced for the ANOs to update. A couple days later, a majority of the ANOs had updated and the network was protected against this problem.

Due to the nature of the bug, only the Authority Set software is needed to prevent this. Non-authority nodes are fine running either 6.0.0 or 6.0.1 if they want to compile from source or use docker containers.

I want to thank the ANOs for quickly responding and upgrading to protect the network. This is unlikely to be the last security update which the Authority Set servers can protect against.

Appendix: Minute 9 election bug.

Part of what was discovered after the September 27 stall was the servers would exhibit some confusion over who was in the authority set. If there was a fault and an election in minute 9, the last minute of the block, the leader who was voted out would never accept the updated authority set. They would continue to follow along with minutes, and would appear to be up and stable as an Audit server. The problem is they would not be able to fulfill the role of an Audit server, and would not be able to step in to take the place of a failed Federated server.

When the confused Audit server is rebooted, then it would come back with the correct state and be ready to step in to take over, but this rebooting is a manual process that wouldn't have time to happen when lots of faulting is happening, and would result in a stalled network.

After this problem was observed and became repeatable (being repeatable is very helpful when debugging) it still took a while to figure out. Several different developers attempted to find the root cause. Several different theories were developed and proved unfruitful. Paul spent some time looking at this bug, but didn't discover the root cause. The task was taken up by Sam Barnes who was able to find the root cause. It turned out to be a one line fix, seen here:


While only a one line fix, it took several days, with different developers and several different theories of what was going on to actually find the bug. This fix is integrated into the pending 6.0.2 release candidate, which is undergoing more debugging on it's own.


Version 6.0.2 is shaping up to be a big and important release that contains dozens of fixes. I am looking forward to having these many things resolved on the mainnet.