Troubleshooting 101

I find that in troubleshooting people often go with a gut instinct rather than use standard troubleshooting methods. The following techniques are more effective and more definitive than a "best guess." It's really important to put aside your pre-existing theories and start from the beginning, methodically eliminating all possibilities no matter how unlikely you might think they are. I already hear the arguments..."I'm sure I know what's wrong" or "I don't have time for all this." or "I know it's not that." My response is "No, you don't know anything until you've proven it." Ultimately you will be more successful and your methods will be more defensible. If you are a lab manager, you have a huge advantage that the average user can't employ. There is often an identical machine nearby that is not exhibiting the problem and which you can use to test theories. Your goal in all of this is to divide and eliminate possibilities. Many inexperienced troubleshooters take a more scattershot approach. Sometimes they get lucky. Often not. Once you isolate and can clearly define the problem, you can follow that path to a solution. If you come to me with a "solution," I'm going to ask you to prove it. Even more important, I won't buy you a replacement part until you've convinced me that it's the problem. I might also want to know why it won't just happen again with the new part. And yes, I'll ask you to convince me of that too.


Gather Information:

Step 0. Check the obvious. Are all cables connected firmly, cards and RAM firmly seated? Look the machine over carefully (inside and out) first when off and then while powering up. Is it getting power? Do the fans spin? Do the drives spin? Does it make the expected sounds? Does it make any unexpected sounds? Did you remember to check the cables for damage and the connectors for bent or broken pins?

Step 0.1 Is the problem repeatable or predictable? Can you make the problem occur by performing a set of behaviors? Repeatable/predictable is better than random because it helps you narrow it down. Can you list the steps needed to create the problem? What happens if you remove steps or change the order of the steps (where possible)? Does the problem occur immediately or after some set period of time? How long?

Step 0.2 Get a history of the patient. When did the problem start? Did something change before the symptoms started? For example, was the computer moved? Was something new installed?

Step 0.5 On a Mac, suspect software first because the hardware is usually pretty good. It's almost never a virus. On a PC, first suspect a virus, then cheap hardware (unless it's a good brand) or bad software, in that order.

Step 0.9 Make a list of symptoms and try to predict something that will happen or will fail to happen if your working theory is correct.

Note that you haven't done anything yet. That's a good thing. The impulse is to jump in and try something, anything! Resist that impulse. When a person is lost, the best advice is to sit down, calm down, think and wait, not wander around trying to find one's way. Until you have gathered as much information as you can, you are going in blind and increasing the chance of looking stupid or making things worse. In a rush and/or in a panic, we make mistakes.

Real life example: Countless times, people tell me that the first thing they do after spilling water on their laptop, phone or other electronic device is to turn it on to see if it still works. This is the absolute worst thing you could do. Instead, unplug it, remove the battery (if possible) and dry it out as thoroughly as possible. Only after a week or more of drying time should you turn it on to test it.


Make a Diagnosis:

Step 1. Disconnect anything unnecessary including all peripherals, 3rd party cards, etc. and see if the problem persists. Yes, everything but a "stock" mouse and keyboard. Exception: On a PC, plug in some powered speakers so you can hear diagnostic beeps. Try booting up in "safe mode" if you are still having problems. (Mac: Hold down the Shift key at boot. Windows: hold down the F8 key at boot.)

Step 2. Assuming sufficient RAM, remove half of the RAM and test. Then put it back and remove the other half and test.

Step 3. Remove your boot drive and test the hardware from a freshly wiped hard drive with a clean OS install or from a bootable CD/DVD. Less drastic, you can create a new user account on the same machine. If the problem is only with one user account and not others, that's a good clue.

Step 4. If possible, move the suspected bad hardware component to an identical machine and test. The problem should move to the other machine. Do the reverse. Replace the suspect component with a known good component from a working machine. The problem should go away.

Step 5. Research the problem. Unless your machine is brand new, someone else has probably encountered the same issue. Use Google. Visit the support forums for the company.

Step 6. Unless you're sure you know what the problem is, resist the urge to do something that is potentially time consuming, possibly irreversable and drastic such as update firmware, apply OS patches, and update drivers. While it may not do harm it probably won't help, it will waste time, and may introduce new problems.

Step 7. Get a second opinion. Maybe there's something you haven't thought of? If you are stuck, take a break. If the solution is not coming to you, there is probably a vital element that you are missing.

Note that in forming a diagnosis, you should not do anything that is irreversable. Before you treat, you need to be sure you know what's wrong.


Perform a Treatment:

Step 1: Remember the Hippocratic Oath: First, do no harm. Running a disk utility can sometimes make the problem worse. Before you attempt a repair, first try to recover the data. Before you blow away a drive, are you sure you got everything of value off of it? Do you have a recent backup? Make a list of things you should grab before wiping a drive...documents, address book, sticky notes, music, photos, movies, e-mail, browser bookmarks.

Step 2: Perform your treatment. If your diagnosis was correct and the treatment was effectively applied, the problem should be gone. Is it? Is there a new or different problem?


Do Regular Maintenance:

In most cases, regular maintenance is a good thing.

However, when a new driver or software update first comes out, don't apply it immediately. Wait and see how others do with it. Too close to the "cutting edge" is the "bleeding edge." Let someone else do the bleeding.

If you have the luxury of multiple computers, put the update on a "test" machine before you apply it to important "production" machines. Work on the machine long enough (a week?) to do all the things you regularly do. If it all works, it's probably safe to proceed.

Run system updates regularly to patch security holes and fix bugs, especially on Windows PCs.

Run an anti-virus utility regularly, especially on Windows PCs.

On a Windows PC, run an anti-spyware utility regularly.

Run a disk utility frequently.

Don't let your system drive get too close to full.

Backup your data regularly so that if something catastrophic happens, you can get back to work quickly.

Before you do a major system update, backup your data and run a disk utility to check for problems. Repair any problems first.

After you do a major system update, run a disk utility to check for problems except... sometimes, third party utilities need an update to work with a new OS. In that case, don't run a 3rd party disk utility unless you are sure it is compatible with a new OS.

Special rule for server administrators: If it ain't broke, don't fix it. Back it up, but only apply patches and updates when there is a compelling need, such as a serious security vulnerability or compelling new feature. Before you upgrade, do your homework... make sure the upgrade is worth the pain. Choose a good time to do the upgrade. Have a plan for reverting to the pre-upgrade state in case something goes wrong.

 

Now print this and put it on your wall. This is your bible. You are a troubleshooter. Go fix something.


Some good troubleshooting resources and products:

Mac:

MacFixit

MacOS X Tips

Apple Support Forums

Great Mac Utilities

Windows:

Tom's Hardware

ARS Technica