Capacity Planning: Do Know (measure) your Environment


As I wrote in my previous post, capacity planning for IT business is a real challenge. There are no reasons to make it harder than it already is.
For this reason, you have to keep in mind that capacity planning isn’t some kind of esoteric guess, but only a disciplined engineering procedure.
Yes, I know. Appearing like a kind of Computer Guru making predictions with a four cpus Glass Ball can be really helpful in dating ladies. The scientific approach in analyzing reality hasn’t the same appeal. Unfortunately our first duty is to serve IT business, so the work has to follow a well disciplined procedure.

The first step when you are asked to plan the capacity of an IT system is to state how it is actually performing without any improving action. Are your customers satisfied or are they complaining because a transaction requires long time and/or freezes in the middle loosing all already imputed data?
It looks banal: you cannot say how much you have traveled if you don’t know your starting point.  It doesn’t really require explications. Contrarily, in my experience, several times I saw great capacity planning task force starting their intervention without deeply analyzing current configuration and, at the end, unable to verify the actual improvement of the system after their activity.
If you are really an experienced engineer, with a good capacity planning activity you can improve your IT system performances  by 20-30%. It is really difficult to reach a better result. On another hand, I saw in the past performances improving of 200%-300% simply rewriting an unfortunate DB query or reorganizing transactions to avoid too many RTTs on WAN (remember, you can’t reduce latency over a network if you keep hops number constant. Latency can became a thorn in the side). It means that the first step is to gently meet with developers and system/network administrators to verify if some bottle neck is nested in the code (usually in the DB queries), if to perform a single transaction several hundred of TCP connections are opened, everyone with its own TCP 3 way handshake and if data are logically stored to offer a good compromise between data integrity and performances requirements (have a look to this post for some useful information about data redundancy and performances). It isn’t always easy to perform this investigation. Some kind of social engineering is required to avoid contrasts and obtain full collaboration. Unfortunately, we don’t learn this kind of stuff at the college, but experience will help.
So, after the investigation and, almost surely after some improvement action on code, network and data, you can set your starting point making the first real measures on the system.
The usual tools are server statistic campaigns. Unfortunately they do not discover the whole scenario because is not always easy to use server metrics to characterize service usage. I.e., on a web server, processing X requests per second doesn’t really mean that you are serving X users per second. Probably you are serving Y users with Y<<X. To have an handy measure is often necessary to estimate Y, not X and, especially, how many K of Y are performing what: uploading files, browsing pages, filling forms, searching DB? Collecting these information and comparing with raw server statistics is necessary to establish a kind of relation between customers services usage and server metrics variation.
If this relation is well established, you have in your hand the first real powerful weapon to decide what you need if marketing department tells you that is expected that users uploads, db search or forms filling is going to increase by 20% in the next 6 months.  
If the problem is actual system performances, instead, a good function between users behavior and server statistics can be used to show to your management that improving some parameters (disks, memory, cpu) you can serve a page Z per cent faster.
In one of the following posts, I will explain some empiric systems to establish an affordable relation between service usage and server metrics.