Over the last 3 weeks, you may have noticed some instability with our 
Rankings tools through missing data and error messages stating some 
tools are unavailable. On Friday, we experienced a totally different, 
unrelated problem with our rankings data. We expect to have an updated 
prognosis for that problem by tomorrow, but we want to fill you in on 
what went down at Mozplex to cause these issues in the first place. To 
be as transparent as possible about what happened and how we're working 
to fix the issue, below is a summary of what was impacted, the work we 
did to get things going again, and what we’re doing in the future to 
make the system more resilient.
Database issues? What gives?
 Our SERP data subsystem (which runs on the distributed storage technology Riak) had a couple of nodes fail. To learn more about Riak, here's a blog post we wrote when we made the switch last year. The subsystem is designed to handle such failures; however, we did not handle the failure correctly. 
In the process of fixing our Riak storage, we disrupted some of our queues for SERP data processing. Given Moz's growth over the last six months and the number of SERPs processed in the Riak cluster, Roger can no longer recover from outages in a timely manner. In late 2011, we could recover the system in 3-8 hours and be caught up on data processing in a few days. This time around, it took us six days to get the system back up and another two weeks to catch up on the missing data and the inconsistent data states that resulted.
Impacted services
 Riak stores our SERP data (rankings data), so all the systems that depend on it were impacted. The impacted systems include:
- Custom reports
- On-page reports
- Historical rankings CSVs
- Rankings
- 
  Keyword Difficulty & Full SERP Analysis reportsWork completed to get things going againOur dev teams have been hard at work to restore all missing and inconsistent data post Riak malfunction. At a high-level, here's what we did to get Rankings and all its dependencies going again:- Created scripts to heal the different broken states of jobs
- Added more nodes to speed up processing and help in future failures
- Improved monitoring to get information about failures and performance bottlenecks
- Improved performance in a multiple areasFuture workIt took the team 20 days to fully recover from the cascading problems that resulted from the original issue. We know that this timeframe is highly unacceptable, and we apologize for not being able to recover quicker. We are now in the process of ensuring that the same failures do not occur in the future and to lessen downtime in the event something like this does happen again. Work has begun on multiple improvements to help us reach our goals, including:- Improving health checks and threshold monitoring of Riak nodes and subsystem dependencies
- Adding more Riak nodes
- Beefing up queue and job execution monitoring and alarming
- Creating a dependency matrix that indicates what’s impacted when something goes down
- Improving fault tolerance in parts of the system
- Providing additional excess service capacity
- Creating system operations documentation for dealing with emergency scenarios and how to recover
 So, what's the current ETA?Unfortunately, as you can probably tell, we have a lot of work to do to get Rankings back to 100%. We don't have an ETA quite yet. However, we hope to have a solid date in place by tomorrow and will update the post as soon as we know. Again, we apologize for the failure and any issues it has caused. We are working our butts off to ensure it doesn’t happen again!
 
 
 





 
0 comments:
Post a Comment