Sunday 4 December 2011

When did the full dump ever help

Most advanced systems will automatically do a dump of their memory, or challenge you to do a dump if they recognise a failure has happened.  Some os software vendors also love the dump. Problem is that if the system was able to recognize the cause of the crash, it wouldn’t have crashed in the first place, and a dump is usually just a snapshot of what is in memory at that exact time and tells little about the action prior to the problem. 
If a system was able to recognize a crash situation it means the vendor when building it knew this could happened and included a way of logging it.  If they had known it could happen they would have put in measures to prevent it in the first place.  Most crashes is due to unforeseen circumstances and can therefore not be logged.

A downside with dumps is that they usually become very large and take a long time to extract.  If you successfully extract them the tools to analyze them are either longwinded or difficult to interpret the results.  This means you usually have to upload them to a os or hw suppliers site.  And internet connections have increased a lot in size but the amount of data in these dumps mean you will be clogging a link for a long time.
 
After all that 99% of the time you will get back, nothing found.  I will advocate that it is a lot better to do targeted log extracts.  As an admin you will usually have an idea on where the problem lay.  Work with the developers or suppliers of the application your run for finding the best tool for logging what is going on.  Then play with the parameters of the logging tool at the same time as you put load on your system.  This may take some provocation, like artificially increasing the load or reducing the capacity of the system.  Easy if you have a multi computer system – turn off some of the resources.  But even on a single system you can sample limit the number of processors used or run up an additional load (can be from an additional dummy program). 

There can be many different causes to system crashes / malfunctions.  I have experienced amongst other missing non-public patches, bent processor pin, bad programming and the reaching of system limits.  These last can and have been in  os, db, app and hw.   What I haven’t experienced is that any of them has been diagnosed correctly and the solution found from a full system/memory dump.  

No comments:

Post a Comment