Friday 2 December 2011

Danger of overcomplicating

Today there is many additionals to os and databases that will keep systems running or automatically fail them over if a problem is detected.  Well and good as long as everything run according to plan, which of course it never does
What of the undetectable problem.  There is no reason a well supplied and admin’d system should fail for a known issue.  (Unless that issue has been kept secret from the suppliers side).  If the issue is known it should have been patched.  But there are many possibilities for issues .  very few systems are exactly the same due to the manual ways of installing a system and the many permutations possible when it comes to server, storage, networking, os, database, application, adorns and  patching of them all.  This usually means that a change can at any time lead to an unexpected event.  The only way of taking fully height for this is to have another unchanged system, just in case.

Do not fall in the trap of creating more problems  than you guard against, resulting in more downtime rather than less. What these automatic additionals do is add complication.  More layers of things that can go wrong.  There is a lot to be said for the old manual failovers or restarts as long as there is a 24x7 human interface in place.  Yes they had a time delay in data replication, but this could be controlled by you the admin down to a, for the company, acceptable level. Most can live with that if it means higher security of the system = less dependency on the “no system available” manual routine.  And higher security regarding the maximum downtime.

Often the fastest resolution is a quick reboot or to fail over to a completely independent system running a bit behind the main system this can be caught in a non failed state.  If you make systems that can automatically failover you often have sample the databases running exactly in sync.  This can lead to that both db’s have the same error.  You can also have problems with the failover process  and a worst case scenario is that both systems ends up in a hung state.
Not that a manual secondary system is any guarantee.  It requires strict discipline by the admin to see to that it is fully updated to a runable state.  Regular testing will be required, and I would recommend regularly do planned switching between the live and the standby.  This to ensure that both are in a production capable state when you need them.

Automation and full synchronisation can give problems at time of upgrade or patching.  How do you patch in such a way that at least 1 full solution is available in a pre patched state.  And stays that way  until you know that the patch isn’t going to cause any issues.  What do you fall back to if you upgrade your live and your standby as one.  

No comments:

Post a Comment