Hello Steemians, it’s been a long couple of weeks which is precisely why it was so important that we hold an engineering retrospective while important events were fresh in our heads.
For those who aren’t already aware, we perform monthly retrospectives during which we systematically reflect on how we function as a team with the goal of continuously improving our processes. We want Steemians to have as much insight into what we are doing as possible, so today we’d like to share with you a summary of what we discussed in our most recent retrospective which covered the past month. If you would like to see last month’s retrospective, go here.
All retros use the same format in the same sequence, starting with “what went well,” so if you just want to read about what we think we did wrong, you can feel free to skip to that section ;)
What went well?
- We continued to make good progress on SMTs, remaining ahead of schedule
- Most of the backend work in Hivemind for Communities was completed
- Preparation for the front end development for Communities began
- HF 21 occurred (certainly more about this later)
- We released video interviews with the some of our engineers which were well received
- Testing for HF21 was much better than HF20 (or any other previous hardfork) in that it unearthed a number of bugs that would have made hardforking even more difficult
- Despite the difficulties associated with the hardfork, the community seemed less anxious about the temporary interruption of services. We believe this was because the changes were so heavily directed by the community, and because communications were so much more extensive leading up to the hardfork
- The economic changes already appear to be having a positive impact on Steem
- The proposal system seems to be inspiring users to come up with new ways to add value to Steem
- Whether due to the changes included in the hardfork, or the intent behind those changes, it would appear that a non-trivial number of inactive users, including influential users, have become active once again
- We feel that our relationship with the Witnesses has become more collaborative and improved generally. A consequence of this is that we are better able to work together to come up with solutions, form a consensus, and implement necessary changes. This enabled us all to respond to the delegation bug extremely rapidly by releasing HF22
- Tests performed on our seed node (or “exchange node”) proved useful
- MIRA in memory replays actually work on our account history config (as opposed to a full node) and are surprisingly fast
- Communications on twitter and Steemit during the outages were better than they have been in the past
What could have gone better?
- Communications can always be better, especially during a crisis
- CI Issues for steemd caused longer build times
- SPS API calls could be easier to work with. It would have been great to have a separate service that could handle the data on release day. Another option might be to handle a lot of this in client libraries
- Overflow on what we thought were safe calculations were actually not - this led to a chain halt and problems with certain operations on chain.
- For the purposes of improved debugging, newer code could have been wrapped in FC_CAPTURE_AND_RETHROW
- The growth of the chain has resulted in reindex times taking a very long time
- While in memory MIRA replays were surprisingly fast, migrating state to disk took much longer than expected, effectively neutralizing the unexpected benefit that could accrue from in memory MIRA replays
- The challenges that have arisen out of hardforks has placed an abnormal, and unacceptable, burden on engineers. This is not only unfair to the engineers, but also leads to fear and anxiety about future hardforks. While Steem’s facility with respect to system upgrades is a feature we believe should be exploited, we must dedicate more effort to ensuring that this can be done in a way that sufficiently considers the psychological well being of not just engineers, but community members, stakeholders, users, exchanges and Witnesses.
- Tests should be instrumented to exercise integers with higher values that could possibly trigger overflow situations
- Only saving state files dating back 5 days is insufficient as we are leading up to hardforks
- We should consider setting up a system to archive historical state files for a very long time
- @vandeberg and @gerbino need more fast local storage so that they can debug live nodes locally
- Platform independent state files, which were already part of the SMT spec, would have dramatically reduced downtime
- MIRA in memory replays should be further optimized
- We need to profile reindexes and consider optimizing the business logic
- MIRA itself could benefit from further optimizations
- We should explore how we can optimize reindexes or engineer future releases so that reindexes are not needed
- We need better testnet infrastructure. Tinman should be copying values that are as close to 1:1 to the mainnet as possible. Delegations should also be copied to the testnet
- We must review SMT vesting calculations via tests and code inspection to ensure there is no overflow
- We should separate production deployment code from the steemd repo to prevent requiring a rebuild for config/deployment changes
- We should investigate whether a debug build for a seed node is capable of keeping up with the live chain to a degree that will be useable
- We should consider on-call rotations for coverage to alleviate other team members
- The blockchain team should take some time off as soon as they can, and consider planning on taking time off immediately prior to hardforks to be sufficiently rested in the event of a worst case scenario
- We should explore ways to expose more of our engineers to steemd code, including those who do not work on the back end. One way to do this might be regular “brown-bags” led by @vandeberg
This was by far our longest, and most extensive retro yet, and for good reason. Few months have included such exciting developments, and such difficult circumstances. We remain extremely excited about how Communities and SMTs are progressing, and believe that the preparations for HF21 were better than ever. That is part of what makes the downtime as a result of HF21 so disappointing. That being said, we do feel that we’ve come out of this experience with priceless information that can help ensure that the SMT hardfork proceeds more smoothly.
This post is only intended to summarize the results of our recent retrospective. We will continue to think very deeply about HF21/HF22; what went wrong, and what we can do better next time. We look forward to communicating more about this soon, so be sure to follow @steemitblog for more information.
Thank you for keeping calm and Steeming On.
The Steemit Team