Home Page
Posts > Always Confirm Potentially Hazardous Actions
Search:
Always Confirm Potentially Hazardous Actions
Also treat what others tell you with discretion
Time: 04/22/08 10:51:28 pm
Tags: Hardware [6], Linux [8], Diagnosing [13], Servers [31]
Most relevent 4 of 47 posts with shared tags:

So I have been having major speed issues with one of our servers. After countless hours of diagnoses, I determined the bottle neck was always I/O (input/output, accessing the hard drive). For example, when running an MD5 hash on a 600MB file load would jump up to 31 with 4 logical CPUs and it would take 5-10 minutes to complete. When performing the same test on the same machine on a second drive it finished within seconds.

Replacing the hard drive itself is a last resort for a live production server, and a friend suggested the drive controller could be the problem, so I confirmed that the drive controller for our server was not on-board (on its own card), and I attempted to convince the company hosting our server of the problem so they would replace the drive controller. I ran my own tests first with an iostat check while doing a read of the main hard drive (cat /etc/sda > /dev/null). This produced steadily worsening results the longer the test went on, and always much worse than our secondary drive. I passed these results on to the hosting company, and they replied that a “badblocks –vv” produced results that showed things looked fine.

So I was about to go run his test to confirm his findings, but decided to check parameters first, as I always like to do before running new Linux commands. Thank Thor I did. The admin had meant to write “badblocks –v” (verbose) and typoed with a double key stroke. The two v’s looked like a w due to the font, and had I ran a “badblocks –w” (write-mode test), I would have wiped out the entire hard drive.

Anyways, the test outputted the same basic results as my iostat test with throughput results very quickly decreasing from a remotely acceptable level to almost nil. Of course, the admin only took the best results of the test, ignoring the rest.

I had them swap out the drive controller anyways, and it hasn’t fixed things, so a hard drive replace will probably be needed soon. This kind of problem would be trivial if I had access to the server and could just test the hardware myself, but that is a price to pay for proper security at a server farm.


Comments
To add comments, please go to the forum page for this post (guest comments are allowed for the Projects, Posts, and Updates Forums).
Comments are owned by the user who posted them. We accept no responsibility for the contents of these comments.

No comments for this Post