Gradwell Blog

Monthly Archive December, 2007

Server Plans for 2008

Summary

The process of becoming one of the leading (and largest) Business VoIP Providers in 2007 has taught us volumes about how to engineer our services. In addition, our customer base and server loads have grown and we need a new platform for delivering email and web services, which needs to fix the design flaws now apparent in our current (2004) platform.

Using server virtualisation, and greater amounts of replicated storage, on top of a new core network, we believe we have developed a blueprint for a 2008 enterprise-class infrastructure. We have nearly completed prototyping this and are looking forward to expanding its roll out.

Introduction
Gradwell’s customer base has continued to expand and we have spent a lot of time in 2007 thinking about and exploring how to build a new platform for email and web hosting delivery.

This blog post sets out our current thinking on our infrastructure plans for 2008, and I would welcome feedback, either using the comment box below, or via email to peter@gradwell.com

Our web and email hosting infrastructure was largely built in 2003/2004 and whilst much of the physical equipment has been maintained, replaced, scaled and upgraded, it uses design concepts from the earlier years which are now beginning to appear dated.

In addition, usage of internet hosting has grown so that today many of our customers run significant businesses via their internet presence and equally, the pressures on the London data centres has grown so that they are operating at higher loads and in some instances, higher temperatures.

The largest change in mentality has been in the handling of minor faults. Our VoIP experiences teach us that customers can tolerate something being off/down - what they dislike is the uncertainty and instability of not knowing whether it is on or off. Reoccuring intermittent faults are worse than big outages, and so we have to close the gap on 100% uptime, whilst remaining a relatively low cost provider of services.

Looking back at 2007, we have also begun to appreciate that we need to build an infrastructure that is resiliant of supplier’s failures, and importantly, intermittent supplier faults. We would not have previously considered the need to cope with datacentre failure - but datacentres do have operational problems that are not sufficiently big to be a disaster, but do cause unnecessary disruption.

Finally, it may seem that the answer to the problems is only to use premium equipment and facilities. However, this belief is a falsity, because (a) we are using two of the best London datacentres (Telehouse and Telecity Sovereign House) and whilst we do use some premium grade servers, our business model does not support buying IBM for everything (and three years ago, when some of our servers still hail from, many of the redundant facilities we have today in our HP Proliant servers simply did not exist).

The Challenge

So, things have moved on greatly and therefore we set about in late summer trying to find a workable blueprint for our next iteration of server facilities. The challenge is to guard against the following:

  • Hard Disk failure, and the performance impact caused by the subsequent rebuild
  • Air conditioning failure - causing equipment to power down
  • Software Failure of an individual system that many things are dependent upon
  • Physical Server failure, e.g. RAM, CPU, Power Supply

In addition, we need to:

  • Maintain a large number of our existing server software configurations and operating systems (because customers have software written to support those), whilst replacing the physical hardware.
  • We need to reduce our idle power consumption from servers because it is expensive and wasteful.
  • We need to make our server load more evenly balanced. At present, we have some very busy servers, and some very idle ones.

Physical Improvements

We have made a number of improvements to our systems and network in 2007, to lay the ground work for this project. This predominantly covered the deployment of physical equipment:

  • New swtiched network topology and the first phase of some core switch upgrades on our Telehouse LAN
  • Deployment of a new Juniper Network Router platform, using 5 Juniper J4350 Routers.
  • 24, Low Voltage Quad Core 1.86 Gig HP DL380 Servers, with local SAS based storage, split across TH and Sov.
  • Additional disk storage, using 3 Infortrend iSCSI SAN arrays, with 16×250, 16×500 and 16×750 gig SATA disks, again, split between TH and Sov.

Physically, in early 2008, we have also planned:

  • Migration our Telehouse switch core to a Cisco switch platform (from HP) in a fully redundant configuration and also make the Sovereign House switch fabric more redundant.
  • Increase the RAM on our HP DL380 servers.
  • Add a second London inter-datacentre fibre link.

Software and Managements Improvements
Adding equipment solves some issues, but creates others. For example, our current web server environment (FreeBSD, Feb 2004) won’t run on the HP servers, and we could not migrate customer websites in one go to a new environment.

We also have many new services that we wish to deploy, and we plan to do those in a redundant fashion, potentially leaving a myriad of server infrastructure in place, but unused until we ramp up the customer base.

Therefore, we have been experimenting with a number of techniques to resolve this problem, mainly using virtualisation and have built an initial environment using VMWare, from which we have been experimenting, testing and evaluating. Results so far have been very positive.

Moving to a virtualised environment brings us the following benefits:

  • We will separate software systems from hardware and will be able to migrate “servers” from one physical environment to another without downtime.
  • We can migrate legacy operating systems onto new hardware without significant reconfiguration.
  • We can deploy many more software instances and operating system environments into our virtualised environment, than we could do with physical servers, thus allowing us to “double up” on all services and processes.
  • We can further delegate the operation of our services to software automation, allowing for even faster response to problems.

Improving Storage

One of our remaining key challenges is how to improve our storage, so that our new infrastructure platform does not collapse when a disk fails. Firstly, future developments in VMWare due in 2008 will allow us to move server file systems from one machine to another.

With regards customer file storage, we have spent time evaluating both the expensive (http://www.equallogic.com/) options as well as the less expensive (happily there is no cheap solution!) options (for example: http://www.datacore.com/products/prod_SANmelody.asp).

Our key requirement for storage is that we have high performance arrays, which can suffer multiple disk failure and rebuild without impacting live application performance, but on which we can cost effectively mirror and replicate the data across two datacentres (protecting against aircon and power failure).

Having deployed additional new storage, we are now progressing with our evaluation of the best option for management and backup.

In addition, we have also been working on the mechanism for distributing files from the hard disks to the web servers. Currently we use NFS,

Our Server Platform
Readers may be interested in the following brief summary of our anticipated server platform:

  • Front end web hosting servers x8
  • Caching DNS Servers x2 sets of 3
  • Primary DNS Servers x2 sets of 3
  • Call Routing DNS Servers x4
  • Load Balanced servers for Email x4
  • IMAP (Online Email Folder) servers x4
  • Pop3 servers x4
  • Email virus scanners (clamav) x10
  • Email spam filters (spam assassin) x10
  • Email forwarding servers (gwh) x10
  • Email outbound relay (exim) x4
  • Email inbound relay (gwh) x4
  • Email Quarantine database x2
  • Email Logging x2
  • Email List Distribution Servers x2
  • Jabber Instant Messaging servers x2
  • Mysql database server x3 plus MySQL Cluster x4
  • Zimbra Collaboration and Email servers x5
  • Usenet servers x3

VoIP Servers

  • In/outbound asterisk call processing servers (for iax + newsip) x10
  • Sip Registration x2
  • Sip Call routing x3
  • Mysql database cluster x4 for newsip to back off on
  • Prepay permissions servers x3
  • Voicemail servers x3

Datacentre Issues
Over 90% of the Uk Internet “population” is within London and we will always need to have a well connected presence there. However, it is increasingly apparent that we can operate more of our services from servers physically located out side of London, and indeed, our most recent Telecoms interconnects have been at a point of presence in Leeds.

With regards web and email hosting, we have identified Edinburgh as a suitable location and have completed business planning and agreed funding for expansion into Edinburgh in the first half of 2008, which is dependent on our having completed our succesful migration to our VMWare platform in London.

How does this help us deploy new services

We have a number of plans for new hosted services, including improved email with mobile integration (Zimbra) and online secure messaging using Instant Messaging (Jabber). We also want to expand our web hosting again, to support new languages (e.g. new versions of Ruby) and more hosted web applications.

By being able to deploy new services in a fully redundant configuration, with out having to expend significant monies on physical resources (i.e. VMware will let us setup 5 Zimbra servers, but only provide the physical memory when needed), we can build new services more quickly and to a higher standard.

Conclusions
As we have developed, in 2007, into one of the UK’s leading VoIP platform operators, we have identified a number of areas in which we could improve our email and web hosting product platforms. In doing so, we also wanted to solve a number of the problems that are faced by both ourselves and our peers in the industry, and end up with a blueprint that was suitable for our continued and rapid future expansion.

Whilst a number of questions remain unanswered and on the agenda for January 2008, we hope that this review of the work done in 2007 has assisted customers in their understanding of our plans for supporting their future growth.

Service Update for December 2007

We would like to take the opportunity to update our customers on service delivery for the last couple of weeks in December.

In general, the level of service we provided was high, but there are a few ongoing developments to update on.

Call Drops
Further to the network OSPF flap issue, we also replaced an ethernet switch that connects our main UK pstn gateway to the internet. This was showing some errors, and its replacement has dramatically reduced the number of call drops reported.

We do still have some customers experiencing call drops, and are working to identify whether these are related to their broadband link, or our asterisk platform, by completing some extensive stress testing of various versions of asterisk our call handling software. The initial prognosis from lab testing is that asterisk is not under normal circumstances dropping calls, and on test calls with snom phones, there is a tolerance of network conditions for up to 50% packet loss, for the asterisk to phone segment of the call.

We have done a significant amount of work on understanding call drops, and this will form the subject of a seperate blog post.

Web Hosting
We suffered a power supply failure in one of our web hosting file servers on 14th December. This was very unfortunate and caused a big outage for web hosting customers for the first hour of the 14th. However, we were able to quickly switch to a backup file server and then have an engineer replace the power supply, getting the whole service back online within 2.5 hours from the initial failure.

We will be setting out our 2008 server strategy in a seperate blog post, which will address a number of the questions that this issue raised for customers.

Mail Delivery and Mailing Lists
Our mail system continued to operate normally and we did extra work to improve spam filtering (Spamhause Zen list). We have also finished migrating customers mailman services to a our new mailman server which has improved the service.

Customer Support
Response times from customer support has been good with a number of long standing customer and 2nd line issues being resolved. Customer Support in the run up to the christmas period was reasonably quiet.

Platform Development
We have been working on our deployment of our VMWare platform, as well as migrating files to our GFS based SAN. We also completed development of our high quality broadband product and provisioned a number of test lines. Finally, we have been working

Conclusion
We have been able to continue resolving outstanding issues and increasing the resiliance of our services.

Broadband Support Engineer (£15k)

Are you fascinated by new technology? Do you like playing with electronic devices to find out how they work? Can you solve problems? Do you like helping people use computers? Can you explain how broadband works?

If you can, then you should think about working for Gradwell.

Gradwell dot com Limited is a rapidly growing Internet Services Provider who focuses on VoIP (www.gradwell.com/voip/), email and web service solutions. We are a pioneer in the exciting new world of Internet telephony and this section of our business is growing rapidly. We are developing a new broadband internet product and require a technical support engineer to assist our customers with this service.

Read the rest of this entry »

Industrial Placement 2008/9 - Computing Students

Gradwell dot com Limited is a rapidly growing Internet Services Provider who focuses on VoIP (www.gradwell.com/voip/), email and web service solutions. We are a pioneer in the exciting new world of Internet telephony and this section of our business is growing rapidly.

Read the rest of this entry »