Follow

Postmortem - North American Data Storage Upgrade

Dear customers and partners,

In the last few weeks our system reached very high loads which resulted in various performance issues. I would like to share what we understand now about the situation, what we have done to address the root causes and what changes we are making to prevent similar from occurring in the future.

What Happened?

Transaction volume growth exposed limitations in two important storage services within our platform, one of which is essential to most of the features in our system and one of which is associated with Campaign Automation and Image Manager in particular. During peak hours, the volume of web activity, email events and campaign automation participants processed through the system caused us to hit the I/O limit of some of our Azure storage subscriptions, which in turn caused excessive requests to delay or fail. Our I/O capacity for the core service functioned comfortably at I/O volumes up to 20 million requests per minute. Over the past weeks we have seen sustained periods of as high as 44 million requests per minute.

What Have You Done to Address the Problem?

To reduce the I/O burden, we have redesigned our primary data service by separating our Azure storage to different subscriptions associated with logical areas within the system like web tracking, email events, etc. As different features consume resources at different rates, this separation gives us the better ability to monitor system performance and to tune our infrastructure. Over the last three weeks we have performed a series of carefully orchestrated releases with the last of those completed in our North American Azure locations on July 20. We have been monitoring the service over the past 96 hours and have seen I/O requests across each of the separated services to normalize at about 2 million requests per minute.

We have also optimized the data service associated with Campaign Automation and Image Manager. We deployed that refactored service on June 18 and have seen improvements in that area.

We will monitor our performance over the next few days and plan to deploy the upgrades to both data services in our European and AsiaPac Azure data centers in a series of deployments which should be completed by August 6.

What Are You Doing to Make Sure This Doesn’t Happen Again?

As you can imagine, this situation has humbled us and led to a great deal of introspection. In simple terms we’ve out grown some of the processes and technology that have supported us through six years of extraordinary growth. In addition to the software and infrastructure changes outlined above, we are making some changes in our organization in order to better support and serve you.

  • In April, we initiated a project to re-engineer our Service Delivery and Support operations. We are accelerating this work and have added more internal and external resources dedicated to this transformation effort.
  • We have hired a new Director of Operations who will start September 1. We are also creating a dedicated Architecture team who will be focused entirely on performance and capability improvement.
  • Finally, we are conducting a retrospective to make sure we learn as much as we can so that we can be better in the future.

During this situation, we recognized that our communication with customers and partners could be improved. We are actively working to improve that area of our business as well. I welcome any feedback or advice you might have for us, and you can reach me at MikeD@ClickDimensions.com.

We truly appreciate the opportunity to serve you. The entire ClickDimensions team is Fully Committed to providing you with the platform and service that ensures your marketing success.

We look forward to continuing to work with you and becoming an even better partner in the future.

Sincerely,

Mike Dickerson
CEO, ClickDimensions

 

 

 

 

 

Have more questions? Submit a request

0 Comments

Article is closed for comments.