Factom Protocol Refactor/Rewrite

Secured
#1
The Factom Protocol has a number of issues in its current implementation that limit the transaction rate that the protocol can support, and that lead to instability issues. Factom Inc. has been discussing and considering ways to address these issues. We want to publish our thoughts here, and open the opportunity to discuss and plan to the entire community.

Some observations about and around the current implementation, and what we need in a consensus algorithm
  • The current implementation has done something other consensus algorithms don't do:
    • Thin blocks: We don't pass around duplicate information quite as much as most consensus algorithms
    • Multi-leader: This means no one entity is in charge of the entire state for a block. In fact, most of the 28 ANOs contribute to every block today
    • Censorship Resistance: By committing to write an entry before knowing what the entity is, operators of Factom are issolated from applications running on Factom. This is both good for users (Censorship is harder) and for ANOs (They are not operators of those applications)
    • User Chains, Entry Blocks, and Directory Blocks: Data is organized to have provable data sets; most blockchains don't do this.
    • ANOs exist because of the multi-leader architecture
    • The Platform can scale because of the commit/reveal, multileader, and structured data approach of the consensus algorithm
  • The current implementation is implemented as a state machine. Go allows easy concurrent processing, which would greatly simplify the implementation of the current Factom consensus algorithm
  • Testing does point to the ability of the algorithm to carry us to 1000's of transactions per second, and more.
This is not a complete list of observations that could be made, but some interesting set of them that for us suggests we should move forward with the refactor/rewrite effort.

So far, our observations can be in the process of being designed and summarized in the following documents:
@Core Committee
 
Secured
#2
This is not a complete list of observations that could be made, but some interesting set of them that for us suggests we should move forward with the refactor/rewrite effort.
This is a lot of information and I will have more thoughts about this in the future, for now some quick questions:

Is the planned refactor/rewrite limited to the what is described in the document (the processes) or does it also include a lot of the supporting code like package structures, marshalling boilerplate, and things like that?

Have you identified the reason the current bottleneck exists, and does this proposed consensus rewrite fix that bottleneck?
 
Secured
#3
Not sure what bottleneck you are referring to exactly, but what the refactor does address is the fact that we do most of the work for consensus within one thread, and multiplex over many other tasks. After the refactor/rewrite we get rid of the multiplexing behavior and are able to process in parallel. This simplifies the code as well as ensures we can use all the computing resources of a system running a Factom Node.
 
Secured
#4
Not sure what bottleneck you are referring to exactly
The bottleneck that's preventing the theoretically infinite TPS. Is the consensus code what's currently limiting TPS or is it something else? Let's say the consensus code right now can handle twice as much TPS but it's not doing it because it's not getting messages from the network fast enough, then the bottleneck would be the network. Maybe the bottleneck is disk i/o.

I want to know if it's more of a "yes, there is a bottleneck in the consensus code. it can't handle X tps even though the network reuglarly goes up to 3X tps. a rewrite will improve that and when done we can demonstrate that we can handle 3X or even 5X tps" scenario than "well, it should improve tps".
 
Last edited:
Secured
#5
The bottleneck that's preventing the theoretically infinite TPS. Is the consensus code what's currently limiting TPS or is it something else? Let's say the consensus code right now can handle twice as much TPS but it's not doing it because it's not getting messages from the network fast enough, then the bottleneck would be the network. Maybe the bottleneck is disk i/o.

I want to know if it's more of a "yes, there is a bottleneck in the consensus code. it can't handle X tps even though the network reuglarly goes up to 3X tps. a rewrite will improve that and when done we can demonstrate that we can handle 3X or even 5X tps" scenario than "well, it should improve tps".
Right now we are not at a bottleneck with the network, in our estimation. We currently are bottleneck by multiplexing in a single thread for consensus across all vms, and consensus tasks. The rewrite allows us to use more dedicated threads, and for well defined tasks.
 
Secured
#9
Also, this is more the backend than the APIs. So if we have a hard fork it will be because we change messages or something about structures. But right now, I think what is on the table is changing the messages to be able to sort together commits and reveals in a shareded network. Right now, a commit doesn't reveal enough information to know where the reveal is going to show up.

There may be a few other things, like creating expansion fields in messages to support soft forks of features and stuff like that. But what we are saying here is that current efforts are not going to fork the messages, nor the APIs

@FactDev @Alex