Starting at August 1, 2020 12am Pacific time, Hummingbot Miner experienced a 12 hour outage. Unfortunately, there was a 6-hour data hole during this period that prevented us from being able to allocate rewards for bots running during this 6-hour period.
What was the cause?
The issue was caused by heavy load on the Serializer component responsible for fetching order records from the database and reformatting them for other components to process rewards. As it was fetching millions of orders from the DB, it ran out of memory and cause the outage. Unfortunately, this memory loss and subsequent freeze persisted even when we replayed the order stream from our DB, causing the 6-hour data hole.
What steps are we taking to improve?
To prevent this issue from happening in the future, we have developed a process to restore the Rewards Engine without experiencing a data hole. In addition, we have increased memory available to Serializer and other components and will be more vigilant in monitoring them going forward.