In a message dated 9/27/03 4:02:56 PM Central Daylight Time, firstname.lastname@example.org writes:
Subj: WestHost 2.0 Apology / Disk Space Increases email@example.com
Date: 9/27/03 4:02:56 PM Central Daylight Time
Sent from the Internet
Dear WestHost Clients,
We want to thank each of our clients personally for the enormous amount of patience during this very difficult time for WestHost. We realize that for some, this has been merely an inconvenience, yet for others, it has seemed like a disaster.
As we continue to work through remaining client support requests and return to normal operations, we felt it was necessary to offer this letter as a formal explanation of the problems that many have experienced as a result of our transition to WestHost 2.0. We also wanted to show our appreciation for your continued support. This weekend, we will be doubling the disk space quotas for all current WestHost sites and again want to say thank you for choosing WestHost!
But first, what went wrong?
We believe there are two major factors that caused the problems in our first phase of site upgrades:
1. Infrastructure problems resulting in major instability.
2. Bugs in the upgrade process.
Although many have attributed these breakdowns to insufficient testing, it is not as if the platform was launched untested. Indeed, the new platform underwent several months of rigorous testing prior to launch. We believe the breakdown occurred because no amount of preparation could have produced a test environment that resembled the real world. Current technologies in load simulation, while close, are simply incapable of generating the number of metrics required to simulate the real world with this volume. To the degree we were capable, extensive testing was performed and the green lights were given that everything was stable, even though it was not.
The second major breakdown occurred during the upgrade preparation. Several bugs made it through our testing and quality assurance procedures due to the fact that they only presented themselves when large quantities of sites were being upgraded, or in a very small percentage of sites.
The decision was made at the end of the testing period to proceed with the launch as all metrics were well within the parameters established as â€śexceptionalâ€ť. Our technology team had invested literally thousands of man-hours into the project and felt strongly that we were in a good position to launch. We proceeded by launching the transition with our smallest server in hopes that if there were problems with the procedures or the infrastructure, they would be noticed right away and impact the smallest possible number of clients. Immediately following the transition, each site was tested and verified using an automated testing tool.
A few days later, the transition of the first server was deemed a success. The numbers of support requests generated as a result were minimal (most were related to confusion about how to use the new platform, which taught us valuable lessons in how to present the help documents and corrections were made.). The few that were related to problems with their websites were investigated thoroughly, and changes were made and tested to the transition procedures to ensure we didn't hit the same problems going forward. At this point, we felt safe in moving 7 additional servers.
The upgrade of the next 7 servers, despite rigorous testing and careful transitional plans, proved that there were several bugs in our upgrade procedures that needed correction. We also realized that many factors we couldn't control (such as having to change passwords) were going to cause more overhead than we anticipated. Immediately, additional resources were brought in and trained as fast as possible. Further plans were formulated to help deal with the volume of support requests we were receiving and special assignments were made to make sure we could quickly identify â€śpatternsâ€ť in requests that would indicate bugs.
At this point we found it necessary to discontinue our phone and live chat support in an effort to improve efficiency and productivity in informing clients of changes to their account and resolving problems directly related to the 2.0 upgrade (they have since been reactivated). Rollback plans were discussed, but due to the nature of hosting, are not practical as they can result in corrupt or missing data.
Over the next several days our staff members worked around the clock, and had renewed confidence in the upgrade procedures as new bugs were identified and corrected. The decision was made a few days later to test the new procedures on a very small number of servers before preceding any further. A small group of servers were tested, things looked good, small corrections were made, and we proceeded with the next large block.
It was at this point that major instability issues became apparent in our hosting platform. We had approximately 9,000 hosting accounts on the new system and servers were crashing. Combine this with overall slowness and we had a full-blown disaster on our hands.
The decision was made immediately to halt the transition of any new servers until:
1. All bugs were removed from the transition procedures
2. The platform could be stabilized (Linux, touted to be one of the most stable operating systems on the planet, was behaving exactly the opposite).
3. Outstanding client support issues could be resolved
Multiple engineers were brought in to assist with the instability problem. Driver engineers from Intel, Linux Kernel developers, RedHat engineers, and various other industry experts were employed around the clock to resolve the problems relating to the kernel crashes we were experiencing. At this point, we were well aware of the consequences we faced with every minute that passed without a resolution. Finally, the stability problem was narrowed down a bug in the entire 2.4.20 series of Linux kernels (even some that had been out for almost a year, and including the 2.4.9 series from RedHat Advanced Server).
The major cause of the performance issue was resolved a few days later when it was learned that the Linux NFS v3 code had a seemingly unknown bug, which caused very poor NFS performance under heavy load. Once this was discovered, we reverted back to NFS v2 and the majority of the performance issues were resolved.
We reverted back to an older version and found immediate stability. We are currently working with RedHat and the Linux kernel development team to resolve these issues going forward.
At this point, resources previously dedicated to stabilizing the platform could be dedicated to resolving client issues (most of which were resolved once the speed and stability issues were under control). All efforts were, and are concentrated on working around the clock to answer each and every support request we received. In the last few weeks we have received 5 months worth of support requests. Nonetheless, we are gaining control and will soon have resolutions for everyone. Toll free phone support is once again active, as is live chat.
The remaining accounts to be migrated can expect a far better experience than our â€śfirst batchâ€ť. We have learned valuable lessons in communicating with our clients, the infrastructure is solid and stable, and our tech support department has been exposed to solving complex problems on a relatively new platform. We plan to upgrade each server on a one-by-one basis rather than in groups, and we now have a firm understanding of the human resources required to manage an upgrade of this size.
Please understand that our reasoning behind our decision to upgrade to WestHost 2.0 has always been driven by our dedication to offer clients the best solution possible. This transition has involved new hardware, new software, new infrastructure, and innovative concepts in hosting. Nowhere in the industry will you find the same level of service and features at the prices we have set. As a current WestHost client, the overall benefits are huge. As an example: Today we are one of the only hosting companies to offer true high-availability hosting; meaning we can withstand a full hardware failure on one of our servers without resorting to tape-backups.
There are many things we could have done differently with this upgrade. We've made mistakes along the way and sincerely apologize for those that have been affected. We genuinely value each of our clients and want you to be happy with your decision to stay with WestHost. We are making every effort to return to our normal high standards of service and support, and are committed to putting in place the resources, tools and services to ensure the same level of reliability you have come to expect from WestHost. Please accept our apologies and enjoy the increased disk space we will be adding to your account.
Chief Technology Officer
When you expect more from your web host