Gradwell Mail Infrastructure Rebuild
We wanted to update customers on the progress with our email systems migration.
Customers will be aware that we are currently engaged in the process of building a completely new email infrastructure and migrating customers mailboxes from a number of different systems onto a single, consistent platform.
The challenge has been to build a cost effective email platform which would allow us to offer a high level of uptime and protection against physical server and disk failure, for our low cost email hosting products. We have built a system using multiple storage arrays, on top of a VMWare Server cluster and have nearly completed the migration process.
Our email system consists of front end servers, which handle the connections from customers, using imap, pop3 and the sending of email, using SMTP. We also then have the backend servers, which store customers email files.
To recap:
- We announced the plan in December 2007, where we outlined the proposed developments for the first half of the year.
- In February 2008, we completed the deployment of 15 new servers, handling pop3, imap and smtp.
- One of the hurdles for deploying the new IMAP platform was that customer mailboxes needed to be restructured, so we completed a restructuring project of the 43,000 mailboxes we host, which was completed by 9th May 2008. Customers were largely unaffected by this process, although unfortunately a small percentage of customers had to reconfigure their email programs.
This completed the majority of the “front end” work, although we still have a handful of customers whom are logging into the old mail platform and we are going through the process of contacting those manually, to correct their settings.
The second phase of the project has been to migrate the backend of the system – the customers mailbox storage – to a more reliable storage platform.
- Previously, we put customers mailboxes on individual servers, however, this meant that it would take a long time to recover from a server crash and we were reliant on a single disk array.
- Therefore, we adopted a high availability file system from Redhat (GFS), which provided excellent resilience, but at the expense of performance, and it is still reliant on a single disk array.
- We have therefore migrated all customers mailboxes to a set of 9 file servers instances on top of our VMWare platform, which allows us to have quick recovery (sub 5 min) in the event of the operating system crashing, and, whilst the files are still stored on top of a single disk array, we can move them from one disk array to another with no down time.
During this process, the migration of customers mailboxes has been exacerbated by:
- Slow performance and unreliability on old servers.
- Capacity issues on our new VMware cluster, due to our accelerating the migration process, and licensing delays at VMWare, causing some of the new file servers to crash, particularly early in the morning, when the load has been highest.
- The need to migrate over a terabyte of storage in fairly short time windows.
During this migration period in May, customers have experienced short periods of not being able to access their email whilst we adjusted settings and moved mailboxes for which we apologise.
As a third phase of the project, we have designed a redundant backend server, which allows us to store customer mail files on two independent disk arrays (potentially in two locations) with sub-1 minute level fail over, and we plan to test this in late May, before retrofitting it to our new storage setup in early June if successful.
At the end of May had the following short term work to complete, and as at 2nd June, this work is done:
- We need to add servers and RAM to our VMWare cluster, to increase available computing power on the new mail cluster. Done.
- Migrate the remaining 200 mailboxes from our Redhat GFS Cluster. Done.
- Ensure all customers are using the new platform, and correct any old customer configurations. Done.
- Fixed an issue in the Linux Kernel, related to disk i/o timeouts under heavy load, which causes the file servers to crash. Done.
Finally, to take advantage of the new system customers should ensure they check their email from pop3.gradwell.net and use relay.gradwell.net as their outbound smart host.
Outstanding Issues
Out VMWare based solution is performing well, giving good speed and has resolved many of the issues customers were experiencing. However, a number of elements continue to need to be addressed Mid-Term.
- Implement the high availability solution for our storage servers, which we hope to complete by early July. We have a proof of concept in operation and are beta testing it (at 14th June).
- Implement a multi-site VMware cluster, giving us multi-site redundancy, which will be complete for autumn 2008. We have ordered the redundant networking equipment required to do this (at 14th June) and expect it to be implemented by mid-August 2008.
