Successful [Factom Inc.-18] Protocol Development

Was this grant successful?


Have not voted

Authority Nodes BI Foundation BI Foundation Blockrock Mining Blockrock Mining BlockVenture Consensus Networks Consensus Networks HashQuark Kompendium Kompendium VBIF VBIF

  • Total voters
    22
  • Poll closed .

Chappie

Factomize Bot
This is your grant tracking thread. Below, you will find information from your original grant.

Grant Proposal
https://factomize.com/forums/threads/factom-inc-18-protocol-development.2286/

Sponsor(s)
User: Nolan Bauer
FCT address: FA2oecgJW3XWnXzHhQQoULmMeKC97uAgHcPd4kEowTb3csVkbDc9
FCT: 300

User: Factomatic
FCT address: FA2944TXTDQKdJDp3TLSANjgMjwK2pQnTSkzE3kQcHWKetCCphcH
FCT: 300

User: PaulSnow
FCT address: FA3LwCDE3ZdFkr9nE1Keb5JcHgwXVWpEHydshT1x2qKFdvZELVQz
FCT: 300

ANO / Committee
Group: Factom Inc.
FCT address: FA3LwCDE3ZdFkr9nE1Keb5JcHgwXVWpEHydshT1x2qKFdvZELVQz
FCT: 38040

Total FCT Requested
38940

Start Date
2019-09-09

Completion Date
2019-12-09

Success Criteria
- Maintenance of MainNet - Pauses and Issues resolved
- Refactor Code Development - Continued Development of refactoring

Timelines and Milestones
Various factomd releases as stable updates are developed - 9/9/2019 - 12/9/2019

Budget
This grant asks for $50,720 per month over 3 months totaling $152,160 for the development and support portions. Assuming a FCT price of 4 $/FCT would give a grant amount of 38,040 FCT for the Factom, Inc. portion.

There is also oversight on this grant being provided by the Sponsors, there is 300 FCT each for a total of 900 FCT.

The grand total for this grant comes out to 38,940 FCT.
 
Hi @PaulSnow, @Brian Deery, and @Paul Bernier:
During the last grant round, Factom Inc committed to decentralizing the restart process. More specifically, Paul Snow committed to working with @Paul Bernier. For many ANOs, this commitment was a sticking point in regards to approving Factom Inc's grant, so I wanted to followup to see how much progress has been made.

Thanks!
 
I was waiting for Factom Inc. to give an update as this is their grant, but as I was explicitly tagged I can also give an update from my point of view:
  • I got access this Monday to a Web UI that displays data from the swarm and that is used to diagnose a network pause.
  • I need to schedule next week a call with Steven to get an overview of how to read and interpret this UI.
  • My understanding is that I will get access to another tool allowing to trigger restarts. It is pending the integration of a similar authentication system as for the other web ui.
 
Last edited:
Final Report for [Factom Inc.-18] Protocol Development grant.



The Sep 9 - December 9 2019 Factom, Inc. Protocol Development Grant (018) was successful and accomplished the goals set out in the beginning. Both of the metrics that were set out in the grant were met. A serious network pause was resolved and progress was made on the refactoring effort.



The biggest event during the grant period was a nearly daylong pause in the network starting November 21. A day or so before the pause, anomalies were detected on the network during the QA process for a future version of factomd. There were multiple federated servers that were being faulted out, and network followers were unable to catch back up with the network after being restarted. It was clear that something bad was going on, and analysis was in full force. By the end of the day, developers left the office apprehensive about the immediate network health.

The pause started around 7 pm Texas time November 21. This was the first network pause where the diagnosis and recovery discussions were held in a protocol stakeholder’s voice chat, allowing core committee members and others with important contributions to follow along with the debugging process.

This link shows the gap in time for the network pause. https://explorer.factom.com/dblocks/44cd12db574f4dc091c618b081e031f025ad0f95c49b8f8042596fade902d0ff

This branch contains some of the git commits leading up to and resolving the network pause. The talktalk branch name indicates that it was deployed onto the backhaul network. It has changes to increase p2p chattiness in order to let federated servers communicate after 1 hour elapses from the last block being built.
https://github.com/FactomProject/factomd/commits/FD-1251_MSGFilterTimeIssues_talktalktalk

The cause of the network troubles was narrowed down to a Pegnet mining pool running software which was writing Entry Commits with an odd choice of timestamp. The software was making Commits that were exactly 1 hour in the past. The messages in the code are only valid for an hour +- of the current time, so they being exactly on the edge was causing problems. As the messages propagated across the p2p network, arriving at different nodes at different times ahead or after the timeout, servers would disagree if the message were valid or not. This created consensus failures between federated servers resulting in them faulting each other out.

ANOs did not need to upgrade to resolve the pause, as the Pegnet pool operator was located and upgraded their pool software to use the current time rather than an hour in the past. This gave ANOs enough time to deploy the timing fix along with the grants release. https://github.com/FactomProject/factomd/pull/934 The pause was resolved and the network was progressing by 4pm the following day.

There were a few other pauses that were responded to promptly during the grant period. They occurred on September 28th, October 18th, November 24th. These later pauses were not as intensive of a debugging effort as the Nov 21 pause.

These were examples of how Factom, Inc was fulfilling the maintenance side of the Development Grant.

Another thing that was requested during the grant proposal process was to distribute the restart process. This objective was also met.

During the grant period, both a restart system and a diagnosis tool was setup to allow some types of diagnosis and recovery without needing assistance from Factom, Inc. engineers. While these tools can use some improvement, they are a step in the right direction and show that we are listening to the needs of the community.


Community development efforts also continued through this grant period. Weekly community developer standups helped coordinate efforts with non-Factom, Inc people. Regular attendees were Sander and Laurens from the Sphereon team, as well as Michael from Factable. David K also attended regularly in his capacity as a grant sponsor.

There were several releases during the grant period. While the releases were based on the current codebase, they included improvements that will feed into the refactored codebase. The last bug in Xuan was found. It was building coinbase transactions improperly.

The release codenamed Post-it (v6.4.4) was given a load test. It passed the test on the testnet at 20-25 Entries per Second.

The A4 release was started during the grant period. It contained another precursor to the refactor. It includes an improvement to the inter-process queue management.

The refactor itself also began during the grant period. An initial design was released: https://drive.google.com/file/d/19mNQJlV9ehgbDQphlOQh6hmKF5XOweT8/view
Implementation began on some elements of the refactor. The updates are going into the Wax branch. An effort that began during the grant period was to pull the leader functions out of the traditional code flows and separate them out to their own thread.
https://github.com/FactomProject/fa...=dc366a41a985e438f83fbd4553654e1b280b4334+104

The refactored code will rely heavily on a publisher-subscriber model. This can be visualized in this thread spawning graph. http://arborjs.org/halfviz/#/MTI2ODk

One of the parts of the refactor is to have code templates. Since inheritance is not a feature of golang, software to take templates and generate code based on those was used instead. This will help to ease modification and reduce mistakes when updating large portions of code. https://github.com/FactomProject/factomd/tree/FD-1220_go_generate_queues_clay/factomgenerate

The Sphereon team made significant progress on the livefeed API during this grant period, but it lacked a key feature. In order to use the livefeed API to deliver blockchain information if there is a network disruption or a system reboot, it needs to replay historical data. Factom, Inc. implemented an extension onto the livefeed that allows playback and increases the usefulness of the API. https://github.com/FactomProject/factomd/commits/AP-484-2


The buildout of the new refactored code progressed well over the grant period. There were multiple people working on the grant over the period, exceeding the 3-4 person estimate. During the grant period, employees working on the protocol grant were Clay, Steven, Matt Y, Paul, Brian, Veena, Mike B, Daniel R, Sam, & Justin.

Both the maintenance and the refactoring efforts were well exceeded during the grant period and our self-evaluated the performance of this grant as a 7. Thank you for your support.
 

Chappie

Factomize Bot
The final determination poll has been created, and will be open for 5 days. Use the following rubrik when scoring:

Exceptional (9.0 - 10.0) - Successful
Overachieved (7.0 - 8.9) - Successful
Achieved (5.0 - 6.9) - Successful
Underachieved (2.0 - 4.9) - Failure
Total Failure (0.0 - 1.9) - Failure
 
The Sphereon team made significant progress on the livefeed API during this grant period, but it lacked a key feature. In order to use the livefeed API to deliver blockchain information if there is a network disruption or a system reboot, it needs to replay historical data. Factom, Inc. implemented an extension onto the livefeed that allows playback and increases the usefulness of the API. https://github.com/FactomProject/factomd/commits/AP-484-2
Just wanted to chime in on the above statements a bit. Similair functionality is available in the 2nd layer (subcriber) of the livefeed API system as we wanted to keep the impact and state keeping on the factomd side as minimal as possible and the other side of the livefeed API doing the multiplexing and state keeping. Not saying one is better than the other, but we made a deliberate choice in the design, so adding some of it to factomd as well adds some more possibilities, but I would not classify it as adding usefulness to the API. It is just different approaches, as we designed the system mainly for 2nd layer solutions.
 
Final Report

This grant was officially finished on December 9, 2019 and covers the period in the three months prior to that date. Our final report contains a summary of the work completed, a timeline of major events, and our recommended score as sponsors.

The grant application is available here. There were two stated goals and success criteria:
  1. Respond to any issues impacting the network (as priority) to ensure the network continues to operate.
  2. Refactor code continued development

Factom Inc’s self-determination of this grant is a success (7/10). There is one technical report available for the month of October, which can be viewed here. The final report is available here.

Major Events

Sep 10, 2019 - Start of Grant period
Sep 28, 2019 - Network Pause + Resolution
Oct 11, 2019 - Release of Xuan, v6.4.3
Oct 18, 2019 - Network Pause + Resolution
Nov 21, 2019 - Network Pause + Resolution
Nov 24, 2019 - Network Pause + Release of v6.4.5 (fixing underlying reasons for the network pause)
Nov 28, 2019 - Release of v6.5.0 (grant activation)
Dec 9, 2019 - End of Grant period

General Impression

As sponsors, we are concerned about the continued network pauses, of which there has been 4 (including one which lasted almost 9 hours) throughout the grant period. There is undoubtedly plenty of room for improvement in this aspect and we expect to see the network in a more stable state as the refactoring progresses.

Factom Inc. has been very responsive throughout those pauses and has managed the recovery of the network in a timely manner, thus completing one of the stated goals of this grant.

During the period of this grant, Xuan was also released, which is a major update containing several network stability fixes, including improved sequential elections. A subsequent release (v6.4.4) underwent a load test on testnet passing with ~20-25 Entries per second.

Refactoring has also been progressing, although the major improvements post-Xuan are yet to be deployed on mainnet. One of the main undertakings is isolating different aspects of factomd, such as consensus related functionality, for example, into separate threads.

Thread management is one of the main strengths of Golang and this change is expected to make the code simpler and more performant and also easier to reason about for developers and maintainers. This work is also a necessary precursor to sharding. It is also a complex process, which included a lot of analysis and evaluation of the existing codebase, planning and architecting before the implementation could commence. We can attest to the significant effort that is being put towards the refactoring improvements in factomd.

Scoring

As sponsors, we believe Factom Inc adequately completed the stated goals during this grant period and a score of 6 is appropriate, meaning “Achieved”.


Thank you for reading,

David, Nikola, Nolan & Valentin
 
Last edited:

Chappie

Factomize Bot
The final determination poll has now closed. The final score is 6.48, with 21 total counted votes. The grant has been determined to be successful.
 
Top