Availability – like an insurance – only worse
1. July 2010 von Ariel Rauch | no Comment
A few years ago, a customer asked us to improve the availability of his Oracle database. The customer ran an internet shop of sorts, and while his revenues were steadily increasing he became aware of the bad impact a system outage might have on his reputation and directly on his pocket…
A few months after the installation and a successful implementation of our solution, an SOS emergency phone call reached our service center at 1 am:
” … Our central storage system crashed and although we switched to the standby-database, the database did not start. Could you please help us …”
I think the fact that catastrophes in the IT sector always happen at night should be officially added to the Murphy law book. On the other hand, in times of globalization this phenomenon will probably die out.
In addition it is interesting to note that although this customer did not buy our SLA service he knew whom to call… Never mind!
Anyway, our professionals quickly found out that the synchronization process which was supposed to keep the standby database up-to-date stopped working two weeks ago. Although this database could be opened it would not have presented the actual data.
The customer was lucky in this case, as we were able to recover the data from the crashed central storage and after an all-in-all outage of six hours the site was up and running again.
Almost every technological solution aimed at increasing the availability of a system will automatically also increase the complexity of the underlying infrastructure, as well as the daily maintenance effort.
I am aware that it is quite difficult to explain to a finance department of a company why they should spend more money on addtional hardware, which is solely used to ensure the availability of some applications without gaining any additional benefit in terms of functionality or performance. One way to convince a CFO might be to compare it to buying insurance. An insurance policy will only be useful when something unexpected happens. This comparison unfortunately does not give any reasonable explanation for the higher daily maintenance cost. Let me be very clear: The higher the complexity of an infrastructure, the higher the maintenance cost. This additional cost is not used for the activation of a solution in case of a crash but is aimed to cover the additional effort necessary to keep the normal operation up and running.
In conclusion, and perhaps a bit provocatively I would like to state:
IT infrastructure that was expanded to comply with a certain higher availability will be more liable to outages than before the expansion, without the appropriate daily maintenance.
Yours
Ariel Rauch