Data integrity and the Time Machine Restore 1


Information Technology is all about data. We collect raw data to build inferences on them and then store results keeping logs. We make decisions on strategical policies using data trends. We bill our customers on their service consumption data. Our bank adds and subtracts cash amount data (not bucks), etc. 
The 19th century was the century of steam, the 20th the century of transistors, this one is the century of data. The data give the power to manage trends, anticipate and influence consumers behavior, the control financial transactions and a lot of other stuff that all great companies well know.  
Please, do not to be fooled by your last generation smart phone, your tablet or your android stick. They will be obsolete in few months. Well collected and elaborated data are like gold. You can buy data or to mine them, it doesn’t really matter, but you have to keep them in a safe place.
If you want to be a good SA, your main duty is to take care of data, making elaboration fast using applications scalability, avoiding bottlenecks, keeping data transmission reliable over a well designed network, organizing and protecting your data storage. 
So, the main task will be the data. Servers, switches, routers, network appliances and disks are all useful tools, but all the ship must carry data stored in the better way you can.
Despite their value, data are usually stored in worst place in the world: the disks, At least until SSD units will became a practicable industrial large scale choice, up to now, disks have, may be, the worse MTBF in IT world. This is probably due to several factors: 
  1. Disks are mechanical appliances. there is an internal engine, platters rotating at very high (angular or linear) velocity around spindles, thin arms supporting electromagnetic/optical heads writing and reading on very small surfaces. Sometimes I think that the most amazing thing isn’t the fact that a disk fails, but that disks usually work.
  2. Disks are very subject to thermal variations, humidity and electrical power quality.
  3. To improve capacity and reduce costs, the common available disks are now, generally, of lesser quality than the old ones (in terms of reliability).
For these reasons a data integrity policy has absolutely to be considered by a professional SA. Data integrity policy is not only a matter of appliances or procedures. Ir is a kind of architectural issue. let me explain:
Redundancy of data is a good (and expensive) weapon to protect our wealth. So, NAS, SAN are all useful acronyms, but if you trust only on them, you are going to be the leading actor of the most terrific horror movie you can imagine. NAS and SAN are appliances, for this reason they can fail. This is a consequence of the law of conservation of energy (a perpetual motion machine of the first kind cannot exist) and of my own corollary: “also a non perpetual motion machine of any kind cannot exist”.  Generally, these “data servers” preserve data using a RAID architecture. This is very good thing, but I bet that your vendors are not going to explain you that, for example, to improve performances in the writing phase, RAID 5 often use an incremental parity update. In other words it means that at no point parity data are validated or recalculated so: 
“if any block in a stripe should fall out of sync with the parity block, that fact will never become evident in normal use; reads of the data blocks will still return the correct data.
Only when a disk fails does the problem become apparent. The parity block will
likely have been rewritten many times since the occurrence of the original desynchronization.
Therefore, the reconstructed data block on the replacement disk will consist of essentially random data.” [Ref. A]
For this reason, even if you have the newest and better performing data server, if you want really to protect your data a backup policy is necessary. 
Yes, I know. I say backup and you think at some old fella like me playing with 150 MB data cartridges and the funny “mt” Unix command. You are partially right. Backups are boring and require resources in terms of money and time. Also if you run a small business, a jukebox and a separate management network are almost mandatory.
In addition, you need a behavioral procedure to manage data cartridges, their storage in a place safely far from your DC (I read of SAs keeping copies of data cartridges at their home), a kind of data classification to decide different backup policies and regularly check your backups to verify that data you think you could to recover are actually readable
Yes, this is the really the bad point. I used to backup my DC data on a huge jukebox with an automatic management of data cartridges. To improve writing performances, jukebox’s designers wrote software routines to write data in a lot of data cartridges in parallel. So, also for a small file, the jukebox used to stripe it over three or four  data cartridges. It happened a couple of times that, trying to recover a file, one of the cartridge didn’t work. Unfortunately, the jukebox wasn’t so smart to have different file versions in different cartridges (especially when an incremental backup policy was implemented), so data were irremediably lost.
So, you can do your best but, at the end, you could face the fact that you deleted a file and, despite your file server, your RAID level and your backup policy, the file is irremediably lost. There is no solution. Or not?
The “Time machine restore method”
A long time ago, a software person working in my company, created and deleted an important file in the same day. At that time, we used to backup the servers every night manually “playing with 150 MB data cartridges and the old funny “mt” Unix command“, so every file created and deleted between 2 backup sessions was unrecoverable. 
When we tried to explain this complex concept to the software person, he come to us with an exciting and unexpected solution. He asked us to put system server clock at some hour before file deletion and then recover the file. We were young, but already trained to manage software people Hyperuranium vision of IT business, so we stayed serious pondering at the great suggestion.
I have to confess that we lazy SAs didn’t try the “time machine restore method”, so, honest, I can’t say it doesn’t work. If you try, please, let me know.
Ref A: UNIX and Linux System Administration Handbook. Evi Nemeth, Garth Snyder. Trent R. Hein, Ben Whaley. 

One thought on “Data integrity and the Time Machine Restore

Comments are closed.