Steemit Retro: August & HF21/22

2개월 전

Hello Steemians, it’s been a long couple of weeks which is precisely why it was so important that we hold an engineering retrospective while important events were fresh in our heads.

Retro Recap

For those who aren’t already aware, we perform monthly retrospectives during which we systematically reflect on how we function as a team with the goal of continuously improving our processes. We want Steemians to have as much insight into what we are doing as possible, so today we’d like to share with you a summary of what we discussed in our most recent retrospective which covered the past month. If you would like to see last month’s retrospective, go here.

All retros use the same format in the same sequence, starting with “what went well,” so if you just want to read about what we think we did wrong, you can feel free to skip to that section ;)

What went well?

  • We continued to make good progress on SMTs, remaining ahead of schedule
  • Most of the backend work in Hivemind for Communities was completed
  • Preparation for the front end development for Communities began
  • HF 21 occurred (certainly more about this later)
  • We released video interviews with the some of our engineers which were well received
  • Testing for HF21 was much better than HF20 (or any other previous hardfork) in that it unearthed a number of bugs that would have made hardforking even more difficult
  • Despite the difficulties associated with the hardfork, the community seemed less anxious about the temporary interruption of services. We believe this was because the changes were so heavily directed by the community, and because communications were so much more extensive leading up to the hardfork
  • The economic changes already appear to be having a positive impact on Steem
  • The proposal system seems to be inspiring users to come up with new ways to add value to Steem
  • Whether due to the changes included in the hardfork, or the intent behind those changes, it would appear that a non-trivial number of inactive users, including influential users, have become active once again
  • We feel that our relationship with the Witnesses has become more collaborative and improved generally. A consequence of this is that we are better able to work together to come up with solutions, form a consensus, and implement necessary changes. This enabled us all to respond to the delegation bug extremely rapidly by releasing HF22
  • Tests performed on our seed node (or “exchange node”) proved useful
  • MIRA in memory replays actually work on our account history config (as opposed to a full node) and are surprisingly fast
  • Communications on twitter and Steemit during the outages were better than they have been in the past

What could have gone better?

  • Communications can always be better, especially during a crisis
  • CI Issues for steemd caused longer build times
  • SPS API calls could be easier to work with. It would have been great to have a separate service that could handle the data on release day. Another option might be to handle a lot of this in client libraries
  • Overflow on what we thought were safe calculations were actually not - this led to a chain halt and problems with certain operations on chain.
  • For the purposes of improved debugging, newer code could have been wrapped in FC_CAPTURE_AND_RETHROW
  • The growth of the chain has resulted in reindex times taking a very long time
  • While in memory MIRA replays were surprisingly fast, migrating state to disk took much longer than expected, effectively neutralizing the unexpected benefit that could accrue from in memory MIRA replays
  • The challenges that have arisen out of hardforks has placed an abnormal, and unacceptable, burden on engineers. This is not only unfair to the engineers, but also leads to fear and anxiety about future hardforks. While Steem’s facility with respect to system upgrades is a feature we believe should be exploited, we must dedicate more effort to ensuring that this can be done in a way that sufficiently considers the psychological well being of not just engineers, but community members, stakeholders, users, exchanges and Witnesses.

Escalations

  • Tests should be instrumented to exercise integers with higher values that could possibly trigger overflow situations
  • Only saving state files dating back 5 days is insufficient as we are leading up to hardforks
  • We should consider setting up a system to archive historical state files for a very long time
  • @vandeberg and @gerbino need more fast local storage so that they can debug live nodes locally
  • Platform independent state files, which were already part of the SMT spec, would have dramatically reduced downtime
  • MIRA in memory replays should be further optimized
  • We need to profile reindexes and consider optimizing the business logic
  • MIRA itself could benefit from further optimizations
  • We should explore how we can optimize reindexes or engineer future releases so that reindexes are not needed
  • We need better testnet infrastructure. Tinman should be copying values that are as close to 1:1 to the mainnet as possible. Delegations should also be copied to the testnet
  • We must review SMT vesting calculations via tests and code inspection to ensure there is no overflow
  • We should separate production deployment code from the steemd repo to prevent requiring a rebuild for config/deployment changes
  • We should investigate whether a debug build for a seed node is capable of keeping up with the live chain to a degree that will be useable
  • We should consider on-call rotations for coverage to alleviate other team members
  • The blockchain team should take some time off as soon as they can, and consider planning on taking time off immediately prior to hardforks to be sufficiently rested in the event of a worst case scenario
  • We should explore ways to expose more of our engineers to steemd code, including those who do not work on the back end. One way to do this might be regular “brown-bags” led by @vandeberg

This was by far our longest, and most extensive retro yet, and for good reason. Few months have included such exciting developments, and such difficult circumstances. We remain extremely excited about how Communities and SMTs are progressing, and believe that the preparations for HF21 were better than ever. That is part of what makes the downtime as a result of HF21 so disappointing. That being said, we do feel that we’ve come out of this experience with priceless information that can help ensure that the SMT hardfork proceeds more smoothly.

Stay Tuned


This post is only intended to summarize the results of our recent retrospective. We will continue to think very deeply about HF21/HF22; what went wrong, and what we can do better next time. We look forward to communicating more about this soon, so be sure to follow @steemitblog for more information.

Thank you for keeping calm and Steeming On.

The Steemit Team

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
STEEMKR.COM IS SPONSORED BY
ADVERTISEMENT
Sort Order:  trending

This is really well done.

I'd like to see a communications plan revamped in escalations. 24hrs between tweets when down seems less than ideal.

Also, a lot of the ecosystem is dead even if nodes are up if they aren't Steemit nodes. This place is too centralized on your guys and that needs to change too. It's great that you're reliable for a year and people trust the service, but it's bad that things aren't working if you're not up and running.

Thanks for your hard work on this. This was rough, but still could have been much worse, and the down time is the price we pay for being able to upgrade the chain.

·

I'm not a widely active Twitter user myself, but the downtime did make me realise how large of a following Steem/Steemit has on there. As well as more activity during downtime, it would be great to see more communication in general.

Also, completely agree on the centralisation issue. Everything shouldn't crumble because Steemit Inc's stuff goes down. The more true decentralisation, the better.

·
·

Decentralization is good. What pushes Steem/blockchain into having such an incredible technological leap forward is our ability to create platforms that can/will improve how humans communicate with each other. Although, It's still very early days and experimentation with these new systems is compulsory.

Can we have a noobs guide to setting up a mira node?

·

I'd love to see that as well!

·

me too :-D

·

There is already kind of one, steem in a box by @someguy123 supports mira and it's as easy as just set a parameter to true and you'll be running MIRA.

  • Most of the backend work in Hivemind for Communities was completed

that's pretty cool

  • Preparation for the front end development for Communities began

that's huge!

·

Thanks for continuing to put yourself out there and sharing the retrospective.

As I care about the future of Steem I am pleased with what I read here - being really judgy when I can't talk (with my poor Engish graces), but it could use a little word smithing..just things like using the word should, needs to be more detailed..why is it just a should? please assume I'm not real technical - I assume the 'shoulds' mean It's desirable and a lower priority but you will do or do you mean you'll just consider it more....if so when? I know your a pro team and have a concept of these things, but if you don't share ....I feel bit bad as can see such improvements coming through, but you need a bit more push from us users and lovers of steem. Pls continue to ignore the crappy complaints and keep taking the reasonable ones on board and you have my vote.
I really appreciate the commitment to continual improvement. We as a community also need to improve and help you...Pls help us to help you
If you give me a UAT test or if you want me to create one even, happy to if it helps, just ask. Even better post one with your announcement of heightened chance of problems change HF window.

From the top of my head without enough techo background my only other feedback/thoughts are:

If you have a HF, the backout plan should include another fast HF
...your change window should include UAT testers throughout the community and its ok to suggest something like 'during the first week of a HF, can the community pls report problems as we have heightened risk of outage and/or speedy HF fix'...something like this. It's our blockchain as well, let us be part of its future success and give you immediate and helpful feedback

You simply then just need one person collating all the UAT feedback and engaging with the community during the heightened risk of outage/problems one week change window (I think 1 week of people being more attentive and reporting issues and expecting an outage for a global blockchain in rare case of HF is reasonable - let the entire eco system be your UAT testers as we are all beneficiaries. It's also a great way to keep the two way contact up in a way we feel more useful to support

Cheers and keep Steeming on!

Any info on deposits being changed?
I went to transfer some steem from my binance account and its been rejected. Is there a new method that I am unaware of after the update?

As a professional who ran a software development firm for 20 years and has been in the field consulting to hundreds of companies large and small on Mission Critical Systems.

My Assessment from the outside without even having to be in the design, planning, development and business meetings was this migration was a complete and abject failure.

You have a run or catch fire attitude towards development with out the infrastructure to support an AGILE Development infrastructure.

Your methodologies are those of hackers, with no adherence to ISO test standards.

I don't know who is in charge of QA there if anyone.
There obviously was not a well thought out test plan.
There obviously was no regression testing, a little unit testing maybe.
You don't have an adequate Test environment.
You had no bridge back and store and forward plan.
No one is looking at the overall system integration because things like reindexing times, can easily be calculated and planned for. And knowing how your intial new account staking works was not thought of.

The worst part was because you had no falll back procedures you implemented another fork patching stuff, instead of rolling back and testing.

This cost your user base lose of time and money.

On top of it all your communications continues to suck and your flagship product on the blockchain SteemIt.com has not had a single needed MAJOR enhancement in 2 years. Jeez you can't even change the color of your background.

And to top it off your stuff not working and you complain that the "abnormal, and unacceptable, burden on engineers" ...... Which I might point out was caused by your own actions.

If you had slowed down and tested enough. Released incrementally and had a temporary fall-back to regroup planned in case of a major SNAFU (which in this case happened). YOUR ENGINEERS WOULD NOT BE OVERWORKED AND COMPLAINING.

Whoever the managers and upper level decision makers on this blockchain are obviously not knowledgeable in what it takes in a full SDLC.

You are in a mature running system with real peoples money involved here.
This is unprofessional and unacceptable.

I could send you simple admins from companies in the 200-300 person level who don't know squat about this or any other blockchain, who would have had a better migration and they earn only about 120k a year!

This is your retrospective... YOU just had an unmitigated disaster on your hands and your white washing it with this post!

·

Hi @richatvns, I will try to tackle these points one by one.

You have a run or catch fire attitude towards development with out the infrastructure to support an AGILE Development infrastructure.

We actually follow the AGILE software development methodology very closely, especially when it comes to blockchain development. We conduct regular daily standups, backlog grooming, story pointing, and engineering retrospectives (the post you read here was a conversion of the results of such a meeting). Our blockchain development backlog / project board is public and can be seen here.

There obviously was not a well thought out test plan.

This Hardfork was tested both internally and externally by our witnesses and developers for over two months. Issues that otherwise would not have been caught were found and addressed as a result of this testing. Of course there are some things to add to future testing which were mentioned in this post. Mistakes can happen, but it is very important to make sure you aren't making the same mistakes twice.

There obviously was no regression testing, a little unit testing maybe.

Steem has some pretty extensive tests that are regularly updated as part of our software development process. You can take a look and see for yourself here.

No one is looking at the overall system integration because things like reindexing times, can easily be calculated and planned for. And knowing how your initial new account staking works was not thought of.

And this reindexing time was planned for - what could not be planned for was the unexpected necessity of reindexing after a failure had occurred. In all cases we try to write bug fixes for the Steem blockchain that do not require a reindex when possible, but, unfortunately, this one required one. For the original rollout, it was very well coordinated (the release was available 30 days in advance and the requirement to upgrade was well communicated to all major node operators).

As mentioned in this post, some things have already been planned to prevent long reindex times and they are part of the spec for the next major update, SMTs.

The worst part was because you had no fall back procedures you implemented another fork patching stuff, instead of rolling back and testing.

What you are referring to would be a chain rollback. On immutable blockchains, that would be highly undesirable. Once a hardfork has occurred, and new transactions that are part of consensus have been added the ledger, it would be unethical to remove them unless there was not another option. This type of 'fix' should only be done in extreme circumstances. This is why 'Ethereum Classic' exists for example - I don't think anyone wants a 'Steem Classic' scenario. Further, ALL nodes would have to decide to also 'roll back' and ignore transactions that had already occurred - it would not just be us to make such a decision.

As far as the traditional front end and microservices development that we do, sure, in the event of a bug that makes it to production we can and do immediately rollback to the previous version. Blockchain development is quite a different thing though and cannot be treated the same way.

The blockchain entirely halting in the event of a bug that would break consensus logic or otherwise be detrimental to the chain is absolutely the correct behavior. The bug should be fixed, and the chain should be restarted and restored.

I could send you simple admins from companies in the 200-300 person level who don't know squat about this or any other blockchain, who would have had a better migration and they earn only about 120k a year!

No, I don't think that someone who doesn't know anything about any blockchain would do a better job for the reasons listed above. This is not a database migration or a simple deployment, this is a coordinated hard fork on an immutable blockchain. These things are not the same.

I think the tl;dr here is that a lot of your assumptions do not translate directly to blockchain development. If we were 'only' a traditional software development shop, I think some of your assumptions would be much more accurate. Anyway, I welcome criticism and wanted to address some of your points here today for the community. Thanks @richatvns

·
·

I think it is important to realize that going full in on agile for a system that aims to provide (or should aim to provide) High-Availability is a no-go. You need to strike a balance between the simulation, shadow-run and DTAP requirements posed by the HE setting and the desire for velocity that an agile CICD setup promises.

The important thing to realize is that being a blockchain project doesn't so much bring you into virgin territory on this as you might think. Modern non-blockchain high availability shops have been balancing the same scale for a long time and you guys should really find some friendly faces in shops like these to show you what they did to maintain HE while carefully moving to a slightly more agile approach.

Some things you might want to talk about with HE shop people:

  1. HE-DTAP
  2. Building representative high-volume test sets
  3. Building simulation/event-generator setups with model driven feedback loops
  4. Setting up an event-fork based simulation infrastructure for partial parallelization of the A and P.
  5. "A" centered CICD possibilities and limitations.

Basically, stop thinking you need to do things differently because you are a blockchain shop and realize you need to do things differently because you should be a HE shop. Talk to non-blockchain HE shops that have managed to combine aspects of CICD with HE-DTAP and see what you can learn from them.

My own knowledge on this is mostly centered around #3 and I'm most definitely no expert on the other four, so as much as I would like to help out on the bigger picture, my help won't be of much use until you guys get the bigger picture sorted. For that, some face to face time with a modern HE team I feel has the potential to really make a huge difference.

I see Steemit Inc is based in NY. I'm pretty sure there will be quite a few modern financial and tech sector HE shops there to provide you guys with some ideas on how to successfully marry Agile CICD with HE-DTAP. Try to find one or two willing to give you guys some insights into the way they try to strike this balance for their own shop.

·

I have no idea if the things you listed are accurate, but can you get a job at steemit team in Texas?

·

I think you have to be more careful, for once, as @justinw pointed out, rolling back is just not an option on the blockchain. On the other side, if you ran such a firm for such a long time, you should know that you're never able to catch 100% of the bugs before you deploy it in practice, the most important thing, though, is to make sure you don't make the same mistake twice and be prepared for the future for those tests.

As you pointed out in your comment and as Steemit pointed out in their post, in terms of re-indexing after a failure they should've been more prepared and should've had more recent backups on their hand to be able to react more quickly.

Similarly, as also Steemit wrote, the environment on the test servers should map more closely the production environment.

I do agree on the Steemit updates though, as the flagship of the Steem apps there should be more focus on its development. Currently they have their devs working on communities which is probably why they don't have any free hand for the Steemit main page.

Unfortunately, most of these things should've happened already years ago, but because a certain CEO spent more time with his hair than with his leadership nothing got done.
This year I'm more positive since Steemit Inc seems finally to be learning from the mistakes and are trying strongly to do a better job, include the community and communicate.

·
·

You are delegating to a downvotebot which downvotes me for nothing! Thats why i have to downvote your comments! Best regards!

·
·
·

I delegate my steempower over dlease.io. Just downvoting my comment to revenge yourself against this downvote bot without considering the quality of my comment, without mentioning who this downvotebot is and without trying to contact me first is a pretty slick move. I hope you continue like this, we need more of this on this platform.

·
·
·
·

Yep-we need definitly more of that ones who lease their SP to downvote bots for something like 19% via dlease, hope also you continue like this, we need more of this on this plattform.

·
·
·
·
·

And he still didn't tell me who this alleged downvote bot is =D

·
·
·
·
·
·

Should know who you are delegating to, i am not your nanny.

·
·
·
·
·
·
·

Didn't even know how terrible downvote bots are, they must be making a lot of money on this platform nowadays.

·

Not a complete whitewash. I feel this one provides hope for the willingness to learn from HF21 things they failed to learn from HF20.

"The challenges that have arisen out of hardforks has placed an abnormal, and unacceptable, burden on engineers. This is not only unfair to the engineers, but also leads to fear and anxiety about future hardforks. While Steem’s facility with respect to system upgrades is a feature we believe should be exploited, we must dedicate more effort to ensuring that this can be done in a way that sufficiently considers the psychological well being of not just engineers, but community members, stakeholders, users, exchanges and Witnesses."

As @justinw states "Mistakes can happen, but it is very important to make sure you aren't making the same mistakes twice.". While we could claim they did make the same mistake twice now, with both HF20 and HF21 failing to do sufficiently thorough testing and resulting in massive down time (from a HE perspective), you could also look at it that the mistake was that they aparently learned the wrong lesson(s) from HF21, and are now ready to learn the right lesson.

As I wrote in my response to @justinw, I think Steemit Inc should go visit a few modern HE shops and see how they deal with agile, testing and a CICD in a HE-DTAP setting. Part of their problem I think is that they believe they are somehow in close-to virgin territory when it comes to finding the golden balance between HE and agile paradigms because they are a blockchain shop. I think once they figure out they are not, they should be ready to lose what from a distance pretty much resembles high-school science project approach to deployment.

Excellent retrospective, love the transparency specifically under escalations; It is essential that they do get prioritized to avoid similar issues in upcoming HFs.

You forgot something. Communication and support for exchange nodes. I assume it is non-existent as it has been in previous forks and chain interruptions leaving exchanges without the ability to transfer STEEM in or out of the exchanges.

·

We are in constant communication with all active exchanges whenever required updates are necessary and are here to support them with anything that they may need. In general, most of them have been very quick to respond. We also took time earlier this year to update our exchange node setup guide and associated deployment scripts to prevent common issues. Further, we provide real time support for exchanges that are even in opposite timezones from us.

#newsteem on

·
·

That is the first time I've heard of this. Thank you very much for informing us on this. I've seen in the past exchanges taking 2 weeks up to 6 months to get their node back in operation which prevents their STEEM wallets transfer and receive STEEM. One exchange has never recovered from February of 2018. Witnesses can get their nodes back in operation in a matter of hours. Exchanges should be able as well provided that coordination is maintained. In fact, it should be easier for exchanges to keep their nodes running sas they don't need to keep track of social media info (only STEEM transfers). They should have a stripped down version of STEEM node software that is very easy to replay. Clock is ticking. How many days has it been since HF21/22 and exchanges don't have nodes operational? I appreciate all you do and your thoughtful response. I just want to know if we are going to be waiting additional hours, days, weeks, months, or years for exchanges to get back on line. I was reading a post a couple days ago by someone that just bought a bunch of STEEM. They were very excited to get in at this price and power up, but they then discovered that they cannot move it from the exchange so they are in a waiting game. Eventually that person is going to get pissed off. You know STEEM community can take advantage of a hard fork. People read about the news and want to join, but then realize that they can't transfer their newly purchased STEEM to their account and get discouraged. It is a real shame that this has happened regularly when there has been a disruption of the STEEM blockchain and especially during a pre-planned hard fork.

·
·

@justinw do you know if Binance or Bittrex have planned or disclosed ETAs for when they will make the upgrades necessary on their end? Keep us posted if you are provided any details. Thank you!

so are we any better off in the long-run ??

👍
~Smartsteem Curation Team

Personally, I will never delegate to voting bots, or knowingly accept support from bots. I am just a pure content deliverer, who is (admittedly) now posting a lot less energy-intensive stuff because my rewards have been further slashed.

With the price continuing to lag after a short-lived bounce, what can you point to as a true positive, stat-wise?

TIA.

Also, can we get a really detailed explanation of what HF21 did with regard to serial downvoting? I've got one (@bloom) that downvotes EVERY SINGLE POST I make, and I'd like to know if there will ever be any relief.

TIA.

@steemitblog,
Whatever happens at that crisis day, the team did a great job. And it improved our trust about the chain as well!
$trdo

Cheers~

·

Thanks! We appreciate the support!

·

Congratulations @theguruasia, you are successfuly trended the post that shared by @steemitblog!
@steemitblog got 6 TRDO & @theguruasia got 4 TRDO!

"Call TRDO, Your Comment Worth Something!"

To view or trade TRDO go to steem-engine.com
Join TRDO Discord Channel or Join TRDO Web Site

All things considered, Steemit team did a great job on the hard fork.

We should consider on-call rotations for coverage to alleviate other team members

We highly recommend this. While blockchain never sleeps, humans unfortunately needs to.

The blockchain team should take some time off as soon as they can..

Happy holidays, any chance we'll be seeing them lounging in Thailand come Steemfest? They really deserved it.

Great report and on-call rotations is a very good thing to look into thanks keep up the good work steem on

Thanks for the great work!
There must have been a few smoking heads; I hope you make holidays the priority it has to be!

Any idea when the Binance exchange will allow Steem transfers?

Screenshot_20190905-202454.png

·

I came here to ask the same thing. Binance, Bittrex, OpenLedger all confirmed locked, I think this is holding the price down as buyers then become impatient and dump unable to withdraw. wrote a bit about it on my blog. Will post ETAs if I am able to determine any, please let me know the same. Thanks.

https://steemit.com/steemit/@minerthreat/the-answer-to-the-steem-price-mystery-could-be-exchanges-with-wallets-in-maintenance-mode

That's quite alot tasks on Steemit.inc's plate. Glad to see what the team is working on to make things better.

I don't really recall a lot of Steemians on twitter during the down time. It seemed like very few people were talking about it in the first 24 hours in twitter. Felt almost like a regular website not a blockchain

Having our fund inaccessible even after the blockchain was up is unacceptable. 3 years and a half and Steem is still dependant on a single entity. That's just proof Steem is still not ready to onboard the masses, that's not the fault of the engineers. It's really the fault of the founder and CEO @ned

How long will it take approximately to fix the issues you uncovered in your retrospective?
Have you made a prioritization of these issues yet?
Will it take resources off the development of SMTs and Communities to fix the issues?

Please take 5 minutes every few hours to communicate in case of a crisis!!! The downtime was not a problem, but the communication was.

The search feature does not appear to be working. Anyone else having this problem?

·

Should be fixed shortly, thanks @jondoe

·
·

Good deal, thanks Justin. Do you guys have longer term plans to make steemit.com more sustainable going forward? Selling large chunks of steem every week to pay salaries is not going to last very long, though I am sure you guys are aware of that.

The economic changes already appear to be having a positive impact on Steem

can this be quantified?

I'm pretty new to Hivemind -- or participating in running a node for that matter.

I was looking at the git for it today, and it suggests the hardware requirements are:

Hardware:
* Focus on Postgres performance
* 2.5GB of memory for hive sync process
* 250GB storage for database

I see it says later 'good settings for a system w/ 16G memory'.

Is the storage requirement still ~250GB, and is 16G memory still adequate?


Side-question / clarification:

Setting up Hivemind (and sharing / making public) is considered adding/improving decentralization for applications (if the node is used)? Or does that require a witness/consensus-node?.... Or a combination of both?....

Or is that kinda sorta the same difference as bitcoins "nodes" and "miners" relationship?

I just came to read snarky comments.

·

But instead of doing that, you wrote one? ;)

·
·

nah..... just put a place holder to check back later for one.

·

I am available to go into the test-net next time and try those extreme variables.

Great Recap @steemitblog it seems you are eager to improve workflows on all major fields, thats awesome.
Keep up the good work and communication!

Posted using Partiko Android

·

Yup, we’re always trying to improve. We never claimed to be perfect, just that we’re doing things no one has ever done before and that no one else is doing. That inevitably comes with unique challenges. We appreciate everyone who sticks with us as we become better at improving the world’s most advanced Web 3.0 protocol.

How to post better topics .....

amazing job for sure! Keep up with the good work!

Keep up good work!