Testnet Framework

Hi @Matt Osborne.

I'll answer as nobody else has stepped up yet.

Short version is that some of us have been discussing this in the core committee the past week(s).

Me and @Alex have created a very short document detailing some of the aspects we ought to test with new code.

It currently looks like this:

Load testing - steadily increasing TPS until network failure. The following items should be tested whilst load is high:
  1. Elections
  2. Brain-swaps
  3. Restarts
  4. Split versioning (i.e. how the testnet handles multiple versions of factomd)
  5. Verify entries are not getting dropped at higher load
Boot performance.
  1. CPU at boot
  2. Minutes after coming out of ignore.

In addition we want to verify that load-testing entries actually are included in the blocks being built and now dropped due to the high load.

We also need to figure out who should perform the testing (I'd like the testnet admin to take lead on this; or at least be responsible for delegating it and formally reporting to the core committee).

I'd also like @Brian Deery to sign off on the testing suite so we are all in agreement about what should be tested or not.


Further we need to create a spreadsheet test-plan that can easily be copied and executed. I have volunteered to do this and it is on my to-do list for this or next week. I'll also write a short process document describing the administrative side of testing.

Tor
 
I have created a draft of the test plan based on the items me and Alex identified.

Before locking it in we'd need buy-in from core committee, testnet admin and specifically @Brian Deery.

A short overview.

The spreadsheet is hosted on google drive, and contains a template sheet. The idea is to make a copy of this for every new factomd release that will undergo testing.
sheets.png



The spreadsheet is broken down into 4 general steps:

- Controlled update of 2 Authority nodes to determine if progressing with test.
- Testing on partially updated network.
- Testing on fully updated network (minimum 80% updated).
- Reporting back on findings and results


overview.png


The different sections are broken down into steps/tasks:

test1.png



Each task can be expanded to reveal specific sub-tasks which should be executed as written, and then marked passed/failed:

1_1.png



The most comprehensive testing is done first on a partially updated network (50%), and then on a "fully updated network" (80% or more):

2.png




As Rene wrote above; the Testnet Admin should be responsible for coordinating and ensuring that the test plan is executed; but may of course delegate specific tasks to testnet-ANOs/others as convenient and applicable.


As of now this is a draft, and I would appreciate to get the appropriate feedback prior to 6.2.3 is released so we can start using it.


The spreadsheet is available here
 
I support that suggestion. We originally had minimum spec prior to M3, as I seem to recall. There were no problems or complaints at the time and I do not see any problems arising if we start to enforce it again now.
 
I'll take it to the core committee.

It would remove 10 servers from the current authority set of 23 servers which is a substantial amount. I don't have the technical insight to know if a smaller but more robust set of servers is better than a larger one with a higher number of nodes.
There's followers that can be added to the Authority Set (Factomize has one) and I'm sure at least a few of the testnet servers would upgrade.
 
There's followers that can be added to the Authority Set (Factomize has one) and I'm sure at least a few of the testnet servers would upgrade.
Yes, I'm aware of that. I was thinking in a "short term perspective"; i.e if it was enforced right away. If core doesn't want to take a 10 server loss right away we could have a 2 week period where people retired under specced servers and replaced them with new ones.
 
Top