The stories poured in on bright orange forms and were dutifully read by a board of judges whose primary reason for being selected was that they were not easily amused. (No easy winners for us!)
The results are in, and you can judge the First and Second Prizes for yourself:
Back in the Dark Ages of IBM 3090's and monster VAX clusters, I worked for a Nuclear Power company in England. We were a research lab and the main data center for the whole company. The machine room was huge - great halls filled with blue and grey cabinets, disk farms, rows of tape drives which looked like windmills on a distant hill. Pride of place in the middle of all the dull boxes was given to a Cray-2 which bubbled its Freon through perspex pillars back-lit with colored fluorescent light to impress the managers and justify the exorbitant price.
To operate this vast array of machinery an army of operators were employed. The main console area was named "The Bridge" since it resembled the Starship Enterprise with rows of green screens and keyboards, and hordes of people running about pressing buttons and speaking in a strange language with words like "IPL," "VTAM" and "TSO." These strange words were often followed by "is down." Anyway, in the quiet evenings after "prime shift," the teams of operators had very little to do other than watch cryptic messages scroll up a screen and wait to see if they could find a red one. To while away the hours people used to do things like driving their car into the machine room. The suspended floor could be rm'ed to make a temporary "pit" while you made a few repairs. One day creativity and the need for physical exercise after so many hours of sitting at consoles got the better of them, and they decided to pass the time enjoying the traditional English game of Cricket.
The halls in the machine room were a suitable size for a miniature cricket pitch, and the side of some of the slimmer machinery made a good wicket. The tube from a roll of graph plotting paper was an adequate bat and unwanted printout rolled into a ball and wrapped in Duck Tape was a reasonable substitute for a cricket ball, albeit rather lighter. There were long gaps between the machinery through which you could run, and on countless evenings the machine rooms echoed to the sound of laughter and, if not the crack of leather on willow, at least the dull thud of Duck Tape on cardboard. This was a popular pastime, with most of the operators involved, and great rivalry between the different ops teams. While managers and day staff were unaware of the nocturnal activities, managers actually commented that they had noticed a greater team spirit amongst Ops of late.
For those of you not familiar with the rules of cricket, basically someone throws the ball and the guy with the bat hits it. He then runs up and down a short pitch while the fielders try to retrieve the ball and throw it at the wicket while the batsman is in mid run. The object of one side is to get as many runs as possible, the object of the other side is to get all of you opponents "out" by hitting the wickets (or in this case a 3090 or similar) with the ball. When you've done this you all shake hands and drink tea and then swap sides.
The details of scoring are as convoluted as writing a C compiler in awk. The major win when playing cricket is for the batsman to hit the ball so hard that it reaches the boundary of the pitch, in which case you don't have to run anywhere and four runs are awarded. This is called a "four." When this happens everyone cheers and shouts "four." In a machine room filled with cabinets it is extremely difficult to get a "four" because all the machines get in the way. A variation on the "four" is the "six." A six is when the ball reaches the perimeter without touching the floor, and this is considered a major major win. This is easier in this particular variety of cricket due to the relatively light weight of the ball.
Obviously the ratio of the weight of the bat to the weight of the ball influences how much energy you can transfer from the bat to the ball. This led the Ops to experiment with different bats, made from cardboard tubes filled with a variety of materials to add strength and weight. All of this was to chase the highly desirable but relatively elusive "six." As the design of the bats improved, the frequency of sixes increased. One evening a particularly skillful batsman took a long swing at an easy ball. The ball sailed effortlessly over the top of a variety of mainframes, power supplies, HSM machines etc., and everyone got ready to shout "six." However the batsman's aim was remarkable, and the ball headed directly, almost as if attracted by magnetism, towards a large red button on the wall labeled "Emergency Powerdown." The ball struck the button with a dull thud, almost inaudible above the hum of the great machines. This was instantly followed by relays clicking and then a sickly silence. The entire data center had powered down . . . . This was the last game of cricket.
It was 1982, and I was working in the Pentagon. I had just completed the world's first port of UNIX V7 to a PDP 11/25 flown in from DEC. The serial number was 0007. The 11/25 had shipped with an exciting new technology, a small removable 5 MB (wow!) removable disk pack in addition to the internal 20 MB (wow!!) disk. Imagine, no front panel keying of the bootstrap loader!
At that time, we had a 24 by 7 support policy from DEC with a mandatory four hour escalation policy for each level of support. On the third day of operation, the mainboard failed. When the field service engineer arrived, he had "the spare" - not "a spare," but "the spare" -mainboard. The mainboard was connected to memory and peripherals as well as power by two ribbon cables. The cables were carefully labeled 1 and 2, but not "up" and "down," nor were the cables keyed. So, of course, when the FSE put the replacement board in, he attached the cables upside down, and the board smoked. Not failed, caught fire. Not only did he blow the only spare, but now the memory was dead.
Under the escalation procedure, DEC had to get us another board in four hours. So, somewhere in Maynard Mass, a field SUC manager hopped on a plane with two new cards. He got a taxi straight from the airport and just made the four hour deadline. He arrived, slammed the new boards into the machine, attached the ribbon cables, took a deep breath, and fired it up. Yes, he got it wrong and smoked both boards. A pile of dead boards was forming. The next tier of support was now within engineering at DEC. An engineer was found, parts gotten, and he jumped on a plane. When he got in, he analyzed the problem and announced: "I see the problem, the cables aren't keyed and you're putting them in upside down." He replaced the now melted ribbon cables, put in new boards, etc. Result? You guessed it, he smoked another set. At this point, total frustration was beginning to give way to comedy.
End of the story: The principal design engineer at DEC responsible for the 11/25 was put on a plane with a boxload of parts. He arrived and entered the huddle of gathered DEC employees. We were up to five or six by then, I happened to notice that all the connectors were now labeled with tiny little red dots on one side, as were the cables. We had been down 20 hours when we got a good boot.
Postscript - The next revision of the hardware service manual devoted ten (10!) pages to orienting ribbon cables on the 11/25.
I received an unexpected carton shipped from a notoriously "high maintenance" offsite user. When I called him to find out what it was, he said that it was a 4mm tape drive that was not working correctly and that he would like me to get it fixed. When I opened the box, I found that several loose parts were falling out of the drive mechanism. Upon further inspection, I noticed that the drive chassis had tire marks on it! I called him back to learn that it had fallen off the tailgate of his pickup truck and he had backed over it. I had to explain to him that I didn't think this was going to be covered by our maintenance contract!
(This story is not related to system administration, rather, it is related to the Keynote Address)
There was a S/W engineer I worked with around 1989 named Bob LaTouche, who had a "Smart House" (sort of). One July in Southern California, he went on vacation for 2 weeks. Sometime soon after he left, the thermostat in his house malfunctioned, turning the furnace on full blast. The interior soon reached 110 degrees or so. A heat sensor notified Bob's computer. The computer sounded an alarm in the house, and when no one responded, it took further action - it called Bob at work. But Bob was away, and no one answered, so it called Bob's secretary. She was quite confused, then annoyed, then she stopped answering her phone. In desperation, the computer started calling the neighbors. When they answered the phone, they would hear a recording; "Hello, this is Bob LaTouche's house. I am on fire. Please call the fire department." The confused neighbors looked out their window but saw no fire. Many of them didn't realize how persistent computers can be, and so the calls continued until Bob's return, weeks later. He was not very popular in the neighborhood after that.
This story happened a couple of years before I arrived at my employer and I'll keep names out.
We have keycard access to all areas of our workplace. Some exits, such as stairways, are alarmed if a keycard is not used first. Further, there are some that are simply alarmed, with no keycard option. Our computer room contains one such door. It also contained (at the time) a hundred or so Sun3 boxes plus another few hundred assorted Sun spare stations and Next boxes (black and white). Our computer room, like many others, is also equipped with a "Panic button." It just so happens that this button is located on the opposite side of the aforementioned alarmed door. A nameless individual had opened that door and tripped the alarm. Not knowing how to turn it off, he spied the shiny red, candy-like button and pressed it. That's when the real horror story started and this one ends.
As every System Administrator knows, reliable backups are a must. Because of this my team became suitably concerned when the operators handling our central database servers started to report "tape failures." The failures soon became regular, and required regular manual intervention to keep operational. In investigating the cause of this problem, corporate security and production floor rules forced us to depend on the operators for information. The operations staff placed the blame on the off-site tape-storage service's jostling tapes during transport, and requests for samples of failed tapes gave no indication as to the cause.
The root cause of the problem didn't occur until this had been going on for a couple of months. During a large system upgrade, my team was able to observe the operators at work. The operations staff had been out-sourced to a low-cost contracting firm that apparently contained a large percentage of fans of the local professional hockey team. The operators were skidding the 8mm tapes across the computer-room floor like a hockey puck instead of carrying them across the floor. Adding a rule prohibiting throwing, skipping, and sliding of backup tapes quickly restored backups to a reliable state.
This story was written on a particular bad morning after an even worse night to send to the rest of the administration team at a time when two people had become very punchy after working far too long on a job that would normally have been completed in 5-10 minutes.
A patient walks into an office to have a slight skin discoloration examined on the off chance that it might be cancerous. (translation: Ray goes to add two disks to server)
A lump appears underneath the region, which needs to have a biopsy. (translation: One of the disks won't be recognized by probe-scsi-all)
The doctor schedules some more extensive testing and blood work. The blood work comes out positive and the lump has to be removed. (translation: Swapping around disks and controllers didn't work at all; one of new disks still won't be recognized)
After a lengthy surgery, the patient develops complications, and what seemed like a simple procedure turns into a quadruple bypass as the patient goes into cardiac arrest. (translation: And then. . . 2 hours later. . .all hell broke loose)
The patient goes into convulsions and the heart stops. The Dr. grabs the paddles and yells CLEAR!! The patient is dying. (translation: Not only do the newly installed disks not work, but now an entire other disk dies and has to be restored from tape after being replaced)
The heart gets a normal sinus rhythm again. The doctor sighs relief.. (translation: probe-scsi-all finally works! Hooray (Pyrrhic victory))
During recovery, the patient doesn't waken, however, and the Dr. becomes concerned. The patient drifts into coma. (translation: The OS disk won't boot!! Bloody hell! It gives copyright notice and hangs hard - L1-a doesn't even work.)
The coma lasts for weeks and the patient has to be put on a heart lung machine and a dialysis machine as renal failure sets in. (translation: boot net works, tried a new OS disk that worked on another machine. . .Copy all /etc files. . .stick it in. . .boot. . .ARGH! Back to CDROM. . .rinse lather repeat. . .2 OS disks not working, Ray and I sit at two different machines for hours to see what's going on. . .The sun comes up. We've now been at a 10 minute job for 10 hours.)
Other people start to pass out, it looks like its contagious! Ebola!!! (translation: Ray hits power switch to previously well server by mistake instead of broken one during one of many swaps. . We still need to upgrade another server, too!
Naturally, the one that is not booting, that had the bad disk, that we can't install another disk on, is also the tftp and boot-net-install server, It still isn't available, we pilfer 6X CDROM to make things go faster)
Next of kin are gathered for organ donation to replace failed kidneys, liver, spleen, gall bladder, bone marrow, etc (translation: We're swapping parts madly. . .Taking OS disks from two other machines, swapping mother boards, RAM, CPU modules. Nada)
Then, the patient's arm twitches, and normal functions start to take over, but he's still unconscious after many many hours of transplants and surgery. (translation: Ray gets another OS disk and copies choice files one at a time, selectively, minimally, to /etc by hand. Even a restore from Sunday's level 0 didn't help previously)
The patient awakens, but has amnesia, and is not able to remember who he is, nor why he has hundreds of lacerations, contusions, stitches and staples where organs have been extracted and replaced by generous philanthropists and family members. (translation: The server comes up, but some files are missing, and it's not talking to the NIS+ replica correctly. . .Oh, did I mention that as well as being an NFS and boot-net server, it also is the root-master for NIS+? We get the original boot disk up on a separate machine and copy files over as needed. Still waiting on restore of failed disk to complete.)
Patient regains memory slowly, but has to undergo intense psychiatric therapy for post-traumatic stress disorder, and a hefty dose of Prozac and physical therapy. (translation: Keep copying over files from disk as needed. . .Restore takes a long time to a big old 5.24" Fujitsu 1.6GB drive)
Patient sues hospital for malpractice and wins 10 million dollars, but must still undergo outpatient therapy for quite some time. Patient donates old, bad organs to science.
(7:45am, morning shift comes in, unplanned disaster comes to end. We keep original OS disk to replace inevitably missing files later in the afternoon.)
Several months ago our team brought up a POP server for the Computing division. This machine has at least several hundred users and is working well. The other day, somebody from outside the team came by asking if we'd seen our extra machine. Well, it turns out that the server isn't actually ours. It materialized in our machine room, so we assumed it was ours & installed it. Anyway, now the real owners want their server back.
The newly remodeled computer room at the New Mexico Tech Computer Center featured an all-stop button right behind the door, so that if you opened the door too hard, the button was pressed. The button has since been moved.
Last Revised 11/15/96ah