Last updated by at .

Being a Facilitator of Chaos

As described in one of my previous posts, being a system administrator in the IT business world is not like to supervise an assembly line.
A software development factory or a data center offering IT services are very complex systems with many unpredictable events to manage and control.

In my personal experience, life in general is quite indifferent to our plans or wishes. Things tend to go as they want without a great respect for our needs. There must be a kind of moral in this story, but I’m a person too simple to treat this arguments, so i prefer to skip.

So, when you lead a SA group, you want to consider yourself a kind of ship captain. You and your crew are in charge of to set sail from the harbor A and to reach the safe harbor B. You aren’t driving a train on the tracks and to face storms and rough sea is part of your daily duty.

A complex system like a data center hides an unpredictable amount of entropy. Hardware failures, electrical power variation, conditioning stopping, software bugs, maleware attacks, new release of software, the unexpected behavior of intranet services, a badly configured automated backup or simply mice eating your network cables (It happened to me. I swear). A good captain knows that he can spread probes over the network, monitor servers behavior, insert/replace intranet services carefully, write down check lists (beware of people that don’t write check lists), prepare B (and sometimes C) plans, but, very important, he knows that is impossible to control entropy with strong and not flexible procedures.

When you will start to forget what you learned at college, may be you will start to understand that you cannot control the chaos, but you can facilitate it trying to drive the events on the route you are in charge to follow.
It means that your blueprint will be really good only if it isn’t carved in stone and you have experience and mental flexibility to face unexpected events with the required discipline and knowledge.

Too many times I saw people trusting on new monitor software or amazing intranet servers. In my opinion, every object (software or hardware) carries its own internal entropy rate EO and introducing it in your system can be considered an improvement only if the final entropy of the whole system is reduced. Otherwise you are only adding new chance of failure.

So, trust more on your experience and less on “solutions”.

Capacity Planning: Do Know (measure) your Environment

As I wrote in my previous post, capacity planning for IT business is a real challenge. There are no reasons to make it harder than it already is.
For this reason, you have to keep in mind that capacity planning isn’t some kind of esoteric guess, but only a disciplined engineering procedure.
Yes, I know. Appearing like a kind of Computer Guru making predictions with a four cpus Glass Ball can be really helpful in dating ladies. The scientific approach in analyzing reality hasn’t the same appeal. Unfortunately our first duty is to serve IT business, so the work has to follow a well disciplined procedure.

The first step when you are asked to plan the capacity of an IT system is to state how it is actually performing without any improving action. Are your customers satisfied or are they complaining because a transaction requires long time and/or freezes in the middle loosing all already imputed data?
It looks banal: you cannot say how much you have traveled if you don’t know your starting point.  It doesn’t really require explications. Contrarily, in my experience, several times I saw great capacity planning task force starting their intervention without deeply analyzing current configuration and, at the end, unable to verify the actual improvement of the system after their activity.
If you are really an experienced engineer, with a good capacity planning activity you can improve your IT system performances  by 20-30%. It is really difficult to reach a better result. On another hand, I saw in the past performances improving of 200%-300% simply rewriting an unfortunate DB query or reorganizing transactions to avoid too many RTTs on WAN (remember, you can’t reduce latency over a network if you keep hops number constant. Latency can became a thorn in the side). It means that the first step is to gently meet with developers and system/network administrators to verify if some bottle neck is nested in the code (usually in the DB queries), if to perform a single transaction several hundred of TCP connections are opened, everyone with its own TCP 3 way handshake and if data are logically stored to offer a good compromise between data integrity and performances requirements (have a look to this post for some useful information about data redundancy and performances). It isn’t always easy to perform this investigation. Some kind of social engineering is required to avoid contrasts and obtain full collaboration. Unfortunately, we don’t learn this kind of stuff at the college, but experience will help.
So, after the investigation and, almost surely after some improvement action on code, network and data, you can set your starting point making the first real measures on the system.
The usual tools are server statistic campaigns. Unfortunately they do not discover the whole scenario because is not always easy to use server metrics to characterize service usage. I.e., on a web server, processing X requests per second doesn’t really mean that you are serving X users per second. Probably you are serving Y users with Y<<X. To have an handy measure is often necessary to estimate Y, not X and, especially, how many K of Y are performing what: uploading files, browsing pages, filling forms, searching DB? Collecting these information and comparing with raw server statistics is necessary to establish a kind of relation between customers services usage and server metrics variation.
If this relation is well established, you have in your hand the first real powerful weapon to decide what you need if marketing department tells you that is expected that users uploads, db search or forms filling is going to increase by 20% in the next 6 months.  
If the problem is actual system performances, instead, a good function between users behavior and server statistics can be used to show to your management that improving some parameters (disks, memory, cpu) you can serve a page Z per cent faster.
In one of the following posts, I will explain some empiric systems to establish an affordable relation between service usage and server metrics. 

Capacity Planning Basics

This is the first, and I hope not the last, of several posts that I want to write about the activity that I consider the “state of art” for a System Administrator: the capacity planning.
As every activity that tries to define the complex systems behavior in the future, capacity planning requires deep knowledge,  discipline, science, a great experience and a bit of luckiness. Is not easy to find all these stuff in just one person, especially the last. So, in every company, there are many programmers, some SA’s and just few capacity planners.

Let’s imagine you are a in charge to manage the capacity planning for a factory that produces bricks. You can bet that the technology required to make bricks will be almost the same for many years. Bricks are simple components that mankind invented long time ago. So the “brick” product can be considered almost perfect. The same we can say for the procedure we need to implement to build it. The bricks market, on another hand, has about a predictable behavior due to the time required to build homes and little excesses of production can be easily managed just finding a good place to store unsold bricks. The bricks do not require to be refrigerated and are made to stay outdoor. So a warehouse buffer to manage the (rare) unpredictable bricks market peak requests will be a cheap solution, easy to implement and maintain.
Now, please, compare the old brick Heaven with young IT hell. Information Technology was invented this morning, about 8:00 a.m.. Just try to remember how many “final solutions” you encountered in your life: BASIC, Pascal, Lisp, C (this one really was “final” for a long time), Artificial Intelligence, Expert Systems, Hierarchical Data Bases, Mainframes, Client Server Architecture,  Distributed Computing, Web, desktops, laptops, Tablets, Smartphones and so on.  In the same time, market requirements change quickly according to new technologies available and scenario mutation. All the systems we are building now, will be obsolete (in terms of hardware, software and maintenance procedures) in few years. Some system are going to be obsolete in months or weeks. The hardware performance/price ratio growth is crucial in making the right decision on IT systems life cycle. Why I have to worry about data storage optimization with expensive software procedures if I can buy Petabytes at the same price of two years ago Gigabytes (a typical CEO silly argument that is always difficult to face)?
So, the very basic consideration about IT Capacity Planning is that it is a damn difficult task. Do not forget if you are in charge of planning or, especially, if you are asking for a capacity plan.
In the next posts I will explain the basic procedures to 

  • Define the starting point (evaluate the actual behavior quality of the IT system).
  • How to evaluate the needs to keep acceptable performance when the business grows using scaling points.
  • How to plan new infrastructures deployment to avoid to buy useless hardware/software power or be unarmed when you need it.
  • How to experimentally establish affordable ceiling values for your currently available resources. 

   
Stay tuned.

continues here

System Administration and Crime: The Really Hidden Data

Having competences both in system administration and in the military is sometime useful. Especially when I’m involved in Police operations to discover systems used by our fellow system administrators working for the dark side of the Force.
Yes, the first interesting thing is that crime uses Information Technology at the highest level not only to stole data but also to manage information about its business. If you still imagine mob guys acting like James Gandolfini in “The Sopranos” you need to update your eeprom. The lower level of crime organizations, people carrying guns and shooting each other, is only the tip of the iceberg. The real danger are people wearing tuxedo.

Talking about hidden things, In the past I had to work in a filed case where a business consultant was under Police oversight because they were sure that he was an accountant for an Italian crime organization. I was involved directly in the action because the Police didn’t want someone destroying data during the irruption operation. Usually, our dark side SA provide a lot of quick systems to destroy data in few seconds. And when I say “destroy”, I’m not talking of delete, erase or low level format, I’m talking of small appliances that release a powerful acid directly in data storage device to physically burn the surface and making impossible any attempt to recover data (that, just to be sure, are also strongly encrypted).
There are procedures to avoid this kind of problems (no, I won’t explain them over the Internet) so after we broke the door, I was sure that all computers and storage devices were safely seized by Police Officers.
We spent several days looking for crime data, but we didn’t find anything interesting. But, examining network cards configuration, I figured that an hidden wireless network was used in the seized offices.
So, we went back in the office and started looking for the “hidden” router. We didn’t find it. Police dogs couldn’t help us. It seems that routers don’t have a characteristic smell. It was only a kind of intuition when we realized that small walls are obstacles only for electromagnetic radiation between 790 and 435 THz, not for electromagnetic waves ranging in the IEEE 802.11 specification 😉
Yes, a small router and several small NAS storage have been physically buried into the office walls. No doors, no access. Only light bricks allowing signal to propagate, power cables and internal connections to air condition system to keep the required working temperature.
The (good) idea was to provide only a fast routine to immediatly detach remote mounted disks in the case of an unwelcome visit.
Starting from that visit, the actual surveillance and irruption procedures of Italian Police includes countermeasures for what we called the “in walls buried data”.
So, if you want to fight crime, stay tuned and keep thinking in an not conventional way. They have young, smart SAs, but we have experience and a bit of knowledge.
(Yes, I know, there are a lot of technical particulars missing, but it works in the same way of magicians. Your tricks are your money.)

MASELTOV: A System Administrator Helping Migrants

IT is all about data. That’s definitely true. But, what about people? I’m not talking of wealthy customers ready to get in line for a couple of days just to be the first silly dude owning the last gold plated smartphone model. This kind of people are well placed in the middle of the heart of our companies. That’s the business, baby. The business! And there’s nothing you can do about it. Nothing!
Sometimes it is sad to think that we spend so more time to assure five nines uptime in order to let someone update his/her status on facebook to let the whole world know of his/her new toy.

In my own case, working now also as a researcher, I can enjoy the pleasure to use my knowledge to improve transportation safety, via the e2Call project described here,  and working as work package leader in the European MASELTOV project, a smartphone application designed and (going to be) realized to help migrants integration in the European Community. MASELTOV is an eInclusion project 2012-2014 funded by the European Commission (FP7-ICT-7 No. 288587)

MASELTOV Project Logo
MASELTOV (Mobile Assistance for Social Inclusion & Empowerment of Immigrants with Persuasive Learning Technologies & Social Network Services.) recognizes the major risks for social exclusion of immigrants from the European information society and identifies the huge potential of mobile services for promoting integration and cultural diversity in Europe. Mobile – everywhere/everytime – persuasive assistance is crucial for more efficient and sustainable support of immigrants. MASELTOV researches and develops novel ICT instruments in an interdisciplinary consortium with the key objective to facilitate and foster local community building, raising consciousness and knowledge for the bridging of cultural differences.
MASELTOV realises this project goal via the development of innovative social computing services that motivate and support informal learning for the appropriation of highly relevant daily skills. A mobile assistant embeds these novel services that address activities towards the social inclusion of immigrants in a persuasive and most intuitive manner which is highlighted in MASELTOV with a representative application of most essential / beneficial information and learning services – such as ubiquitous language translation, navigation, administrative and emergency health services.
MASELTOV researches for and develops enabling technologies with the industrial potential to easily exploit and scale up the prototypical user shares within the embedment of already existing successful services with worldwide user coverage. The project with its scientifically, technically and socially relevant results will enable a massive social impact on the future with respect to more cooperative – more successful – integration of millions of (im)migrants living together with hundreds of millions cohabitating European citizens.
MASELTOV intends to motivate immigrants with persuasive learning services for the appropriation of the local second language, playful learning of cultural understanding and basic literacy. MASELTOV takes advantage of the interplay between learning and social computing in order to apply learning (i) through communication as well as (ii) in the situated context, i.e., right at the spot where it matters, therefore jointly reinforcing the learning effect and the fostering of social inclusion. (please, read more on the project website maseltov.eu)
In Telecom Italia Tilab task, we are developing a social network interface for smartphones, an inference engine to evaluate migrants needing via their status update on the social network (without collecting personal information), a social network trend analysis tool mainly based on Jacob Levy Moreno work and a GeoSocial Radar, a very interesting smartphone application well described by my colleague (and friend of mine, at least I hope so 🙂 ) Dr. Mark Gaved from Open University.

GeoSocial Radar is the MASELTOV app that links a MASELTOV user with nearby volunteers who are willing to help with resolving problems or tasks. It requires the GPS receiver to be enabled, so the app can indicate the proximity of different volunteers. On starting the app, the user can find who is available and how far away they are. A request for assistance can be made to an individual volunteer via a chat tool.
Sensor input:
Uses GPS
User Scenario:
GeoSocial Radar enables MASELTOV users to find nearby volunteers who can help them with a range of tasks. For language learning, this could include finding a native speaker who is interested in directly helping a recent immigrant practice their language skills, or incidentally support language learning by resolving an immediate need that introduces the learner to new vocabulary and phrases, such as finding a specialist food store, asking a doctor for medical help, or getting directions. Contextual support is a central element of GeoSocial Radar: it provides help to the user dependent on their immediate needs, taking into account where they are, and the time, as volunteers’ availability is time and place dependent.
Key information from the chat exchange between the user and the volunteer could be mined to identify terms which could then form the basis of future recommendations for language learning resources (“I see you have been seeking help about ‘doctors’ – would you like to practice a language lesson on health care?).

Taking care of people who need help is the most satisfying aspect of my daily duty. I like to think that this is a more esoteric way of doing system administrator tasks. We do not only deal with routers, servers uptime, boring backups and RTT, but also with real people that have no time to waste waiting on line for a new (sometime) useless toy.

Please like MASELTOV Project on Facebook and follow us on Twitter to support our work.

Data integrity and the Time Machine Restore

Information Technology is all about data. We collect raw data to build inferences on them and then store results keeping logs. We make decisions on strategical policies using data trends. We bill our customers on their service consumption data. Our bank adds and subtracts cash amount data (not bucks), etc. 
The 19th century was the century of steam, the 20th the century of transistors, this one is the century of data. The data give the power to manage trends, anticipate and influence consumers behavior, the control financial transactions and a lot of other stuff that all great companies well know.  
Please, do not to be fooled by your last generation smart phone, your tablet or your android stick. They will be obsolete in few months. Well collected and elaborated data are like gold. You can buy data or to mine them, it doesn’t really matter, but you have to keep them in a safe place.
If you want to be a good SA, your main duty is to take care of data, making elaboration fast using applications scalability, avoiding bottlenecks, keeping data transmission reliable over a well designed network, organizing and protecting your data storage. 
So, the main task will be the data. Servers, switches, routers, network appliances and disks are all useful tools, but all the ship must carry data stored in the better way you can.
Despite their value, data are usually stored in worst place in the world: the disks, At least until SSD units will became a practicable industrial large scale choice, up to now, disks have, may be, the worse MTBF in IT world. This is probably due to several factors: 
  1. Disks are mechanical appliances. there is an internal engine, platters rotating at very high (angular or linear) velocity around spindles, thin arms supporting electromagnetic/optical heads writing and reading on very small surfaces. Sometimes I think that the most amazing thing isn’t the fact that a disk fails, but that disks usually work.
  2. Disks are very subject to thermal variations, humidity and electrical power quality.
  3. To improve capacity and reduce costs, the common available disks are now, generally, of lesser quality than the old ones (in terms of reliability).
For these reasons a data integrity policy has absolutely to be considered by a professional SA. Data integrity policy is not only a matter of appliances or procedures. Ir is a kind of architectural issue. let me explain:
Redundancy of data is a good (and expensive) weapon to protect our wealth. So, NAS, SAN are all useful acronyms, but if you trust only on them, you are going to be the leading actor of the most terrific horror movie you can imagine. NAS and SAN are appliances, for this reason they can fail. This is a consequence of the law of conservation of energy (a perpetual motion machine of the first kind cannot exist) and of my own corollary: “also a non perpetual motion machine of any kind cannot exist”.  Generally, these “data servers” preserve data using a RAID architecture. This is very good thing, but I bet that your vendors are not going to explain you that, for example, to improve performances in the writing phase, RAID 5 often use an incremental parity update. In other words it means that at no point parity data are validated or recalculated so: 
“if any block in a stripe should fall out of sync with the parity block, that fact will never become evident in normal use; reads of the data blocks will still return the correct data.
Only when a disk fails does the problem become apparent. The parity block will
likely have been rewritten many times since the occurrence of the original desynchronization.
Therefore, the reconstructed data block on the replacement disk will consist of essentially random data.” [Ref. A]
For this reason, even if you have the newest and better performing data server, if you want really to protect your data a backup policy is necessary. 
Yes, I know. I say backup and you think at some old fella like me playing with 150 MB data cartridges and the funny “mt” Unix command. You are partially right. Backups are boring and require resources in terms of money and time. Also if you run a small business, a jukebox and a separate management network are almost mandatory.
In addition, you need a behavioral procedure to manage data cartridges, their storage in a place safely far from your DC (I read of SAs keeping copies of data cartridges at their home), a kind of data classification to decide different backup policies and regularly check your backups to verify that data you think you could to recover are actually readable
Yes, this is the really the bad point. I used to backup my DC data on a huge jukebox with an automatic management of data cartridges. To improve writing performances, jukebox’s designers wrote software routines to write data in a lot of data cartridges in parallel. So, also for a small file, the jukebox used to stripe it over three or four  data cartridges. It happened a couple of times that, trying to recover a file, one of the cartridge didn’t work. Unfortunately, the jukebox wasn’t so smart to have different file versions in different cartridges (especially when an incremental backup policy was implemented), so data were irremediably lost.
So, you can do your best but, at the end, you could face the fact that you deleted a file and, despite your file server, your RAID level and your backup policy, the file is irremediably lost. There is no solution. Or not?
The “Time machine restore method”
A long time ago, a software person working in my company, created and deleted an important file in the same day. At that time, we used to backup the servers every night manually “playing with 150 MB data cartridges and the old funny “mt” Unix command“, so every file created and deleted between 2 backup sessions was unrecoverable. 
When we tried to explain this complex concept to the software person, he come to us with an exciting and unexpected solution. He asked us to put system server clock at some hour before file deletion and then recover the file. We were young, but already trained to manage software people Hyperuranium vision of IT business, so we stayed serious pondering at the great suggestion.
I have to confess that we lazy SAs didn’t try the “time machine restore method”, so, honest, I can’t say it doesn’t work. If you try, please, let me know.
Ref A: UNIX and Linux System Administration Handbook. Evi Nemeth, Garth Snyder. Trent R. Hein, Ben Whaley. 

Bandwidth, Round Trip Trime (RTT) and Latency

In a world wide distributed computing environment, the large scale in terms of users and transactions spread over the whole planet becomes the main factor for performances.
Cheap hardware availability, strengthening of CPUs and growing sizes of RAM seem still have margins of growth, on another hand, network and disk spindles (we’ll talk about spindles in a future post)  are the worst nightmare for SAs.
You can scale web server front ends, create master and slave databases or federated them with the best hardware layer 7 load balancer, to cache data in a well organized LRU cache farm, but you can’t do too much on physical limitations of mechanics and electromagnetism.

Let’s start with some definition. The bandwidth is the amount of data that you can pump in a OSI layer 1 media (with other companion layers smart management) per unit of time. Latency is the time necessary to physically travel trough the media and depends (not only but essentially) on media length in term of distance that needs to be covered (keeping in mind that until someone doesn’t have better ideas than Mr. Einstein, the upper limit remains light’s speed).
Bandwidth and Latency calculation are left as exercise
I read a nice analogy on W. Richard Stevens‘ TCP/IP Illustrated, Vol. 3: imagine a garden hose. The bandwidth is the volume of water that comes out from the nozzle each second, and the latency is the amount of time it takes the water to get from the faucet to the nozzle. You can grow bandwidth buying a garden hose with a larger section (i.e. you can buy a Gigabit network) but if you want to shorten the latency the only thing you can do is to shorten the hose.
Both latency and bandwidth influence network Round Trip Time (RTT), that can be defined as the one-way time necessary to travel from a network source to a network destination plus the one-way time from the destination back to the source. The two one-way components of RTT should be quite the same in a well designed route.
Really RTT depends on packet size and many other factors also. In real life it is quite impossible that two network elements talk over a plan network. Switches and many level of routers have to be considered. Every network element causes delay especially if packets fragmentation is introduced by different MTUs on the path (in this case cut-through switching cannot be used) and network elements serialization of delay grows the RTT.
So, what we can do to improve the network performance of our application? Some advises follow:
Bandwidth:  It depends on two factors: your ISP offer and the size of your wallet. You can change ISP if he doesn’t have the necessary bandwidth. If you can change the size of your wallet, you are a lucky guy.
Latency: The only effective solutions is to get closer your customers. So, if you are offering services to the whole world, you need strategically distributed data centers housing your applications in different countries and keeping distributed databases aligned. This is a really big fight, and very often reserved to great companies, but it ensures data redundancy and a better uptime of your services also because if a data center fails, business traffic can be switched to nearest working one.
RTT: It can looks a bit strange, but the better that you can do to lower RTT is on the software side. If your services are offered over the Internet, you can’t control the routing path, hops and MTUs. The best you can do is to find a good compromise between database transactions complexity and number of times that client needs to ask information to server. You can have 200 small queries, everyone with its own RTT or just 20 bigger ones with ten times less equal value RTTs. In this case, you need to empower your computational ability and data access management, but you have shifted part of the performance problem from the freeway to your own backyard.
As always, there isn’t the grandma’s perfect recipe to manage the problem. Our business was born complex and now is still a very emotionally disturbed teen that we, as parents, need to manage using knowledge, experience and some madness (just to confuse the enemy).

The Underestimated Power of Copy and Paste

Someone, including me, thinks that one of the greater computer science inventions is the “Copy & Paste” stuff. Late in the 80’s, when I started programming in Pascal and AIM65 assembler language, I had to use VT100 and VT340 terminals. Yes, “vt100” isn’t only a value for the environment variable TERM. It is (it was) a physical terminal connected to a kind of serial hub called terminal server. I have evidences. One follows

Please, not be fooled by the big head of the fat boy. He was a dumb one. No RAM, no disks, no interfaces (except keyboard plug). He was just a typewriter in a kind of 70s outfit. 
Writing professional software requires a well defined namespace for variables, data structures, constants and functions. This is for maintenance purposes. A variable name should be meaningful so you aren’t free to name it after your sweetheart nickname. 
For this reason, having a lot of names to assign to several objects in a 10.000 lines of code, very precise rules have to be applied. So, it happens that you can have an array integer element like this:
GSZMAIN00_SelectCursor_FreeObjectList_Local[LeftSelection][RightMainSection]
And imagine to have to write this name a couple of hundred times in your code and then compile it 🙂
When X terminals and mice arrived on our desks, we discovered the “Copy & Paste” function, so we were able to write code ten times faster just avoiding long names’ typing errors.
I have to be honest. Like all powerful weapons, the “Copy & Paste” tool has a dark side. In the same way you can replicate the boring names, you can replicate errors and portions of code avoiding optimization and making software maintenance a nightmare. But this is the the eternal struggle of good against evil. Anyway, this is only on us. So, let’s sincerely thank +Larry Tesler the computer scientist that, inter alia, formalized the “Copy & Paste” procedure. All the people doing our job owe him many hours of our life.
The passage from video terminals to X terminals was a real Copernican revolution and I have a funny story to tell about this. May be in the next post.