Humble Pi : When Math Goes Wrong in the Real World
Humble Pi : When Math Goes Wrong in the Real World
Click to enlarge
Author(s): Parker, Matt
ISBN No.: 9780593084687
Pages: 336
Year: 202001
Format: Trade Cloth (Hard Cover)
Price: $ 37.26
Status: Out Of Print

One Losing Track of Time On September 14, 2004, around eight hundred aircraft were making long-distance flights above Southern California. A mathematical mistake was about to threaten the lives of the tens of thousands of people onboard. Without warning, the Los Angeles Air Route Traffic Control Center lost radio voice contact with all the aircraft. A justifiable amount of panic ensued. The radios were down for about three hours, during which time the controllers used their personal cell phones to contact other traffic control centers to get the aircraft to retune their communications. There were no accidents but, in the chaos, ten aircraft flew closer to each other than regulations allowed (five nautical miles horizontally or two thousand feet vertically); two pairs passed within two miles of each other. Four hundred flights on the ground were delayed and a further six hundred canceled. All because of a math error.


Official details are scant on the precise nature of what went wrong, but we do know it was due to a timekeeping error within the computers running the control center. It seems the air-traffic control system kept track of time by starting at 4,294,967,295 and counting down once a millisecond. Which meant that it would take 49 days, 17 hours, 2 minutes, and 47.295 seconds to reach 0. Usually, the machine would be restarted before that happened, and the countdown would begin again from 4,294,967,295. From what I can tell, some people were aware of the potential issue, so it was policy to restart the system at least every thirty days. But this was just a way of working around the problem; it did nothing to correct the underlying mathematical error, which was that nobody had checked how many milliseconds there would be in the probable runtime of the system. So, in 2004, it accidentally ran for fifty days straight, hit zero, and shut down.


Eight hundred aircraft traveling through one of the world''s biggest cities were put at risk because, essentially, someone didn''t choose a big enough number. People were quick to blame the issue on a recent upgrade of the computer systems to run a variation of the Windows operating system. Some of the early versions of Windows (most notably Windows 95) suffered from exactly the same problem. Whenever you started the program, Windows would count up once every millisecond to give the "system time" that would drive all the other programs. But once the Windows system time hit 4,294,967,295, it would loop back to zero. Some programs-drivers, which allow the operating system to interact with external devices-would have an issue with time suddenly racing backward. These drivers need to keep track of time to make sure the devices are regularly responding and do not freeze for too long. When Windows told them that time had abruptly started to go backward, they would crash and take the whole system down with them.


It is unclear if Windows itself was directly to blame or if it was a new piece of computer code within the control center system itself. But, either way, we do know that the number 4,294,967,295 is to blame. It wasn''t big enough for people''s home desktop computers in the 1990s, and it was not big enough for air-traffic control in the early 2000s. Oh, and it was not big enough in 2015 for the Boeing 787 Dreamliner aircraft. The problem with the Boeing 787 lay in the system that controlled the electrical power generators. It seems they kept track of time using a counter that would count up once every 10 milliseconds (so, a hundred times a second) and it topped out at 2,147,483,647 (suspiciously close to half of 4,294,967,295). This means that the Boeing 787 could lose electrical power if turned on continuously for 248 days, 13 hours, 13 minutes and 56.47 seconds.


This was long enough that most planes would be restarted before there was a problem but short enough that power could, feasibly, be lost. The Federal Aviation Administration described the situation like this: The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode. If the four main GCUs (associated with the engine-mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase. I believe that "regardless of flight phase" is official FAA-speak for "This could go down midflight." Their official line on airworthiness was the requirement of "repetitive maintenance tasks for electrical power deactivation." That is to say, anyone with a Boeing 787 had to remember to turn it off and on again. It''s the classic computer programmer fix. Boeing has since updated its program to fix the problem, so preparing the plane for takeoff no longer involves a quick restart.


When 4.3 Billion Milliseconds Is Just Not Enough Why would Microsoft, Los Angeles Air Route Traffic Control Center, and Boeing all limit themselves to this seemingly arbitrary number of around 4.3 billion (or half of it) when keeping track of time? It certainly seems to be a widespread problem. There is a massive clue if you look at the number 4,294,967,295 in binary. Written in the 1s and 0s of computer code, it becomes 11111111111111111111111111111111; a string of thirty-two consecutive ones. Most humans never need to go near the actual circuits or binary code on which computers are built. They only need to worry about the programs and apps that run on their devices and, occasionally, the operating system on which those programs run (such as Windows or iOS). All these use the normal digits of 0 to 9 in the base-10 numbers we all know and love.


But beneath it all lies binary code. When people use Windows on a computer or iOS on a phone, they are interacting only with the graphical user interface, or GUI (delightfully pronounced "gooey"). Below the GUI is where it gets messy. There are layers of computer code taking the mouse clicks and swipe lefts of the human using the device and converting them into the harsh machine code of 1s and 0s that is the native language of computers. If you had space for only five digits on a piece of paper, the largest number you could write down would be 99,999. You''ve filled every spot with the largest digit available. What the Microsoft, air-traffic control, and Boeing systems all had in common is that they were 32-bit binary-number systems, which means the default is that the largest number they can write down is thirty-two 1s in binary, or 4,294,967,295 in base-10. It was slightly worse in systems that wanted to use one of the thirty-two spots for something else.


If you wanted to use that piece of paper with room for five symbols to write down a negative number, you''d need to leave the first spot free for a positive or negative sign, which would mean that you could now write down all the whole numbers between 9,999 and +9,999. It''s believed Boeing''s system used such "signed numbers," so, with the first spot taken, they only had room for a maximum of thirty-one 1s, which translates into 2,147,483,647. Counting only centiseconds rather than milliseconds bought them some time-but not enough. Thankfully, this is a can that can be kicked far enough down the road that it does not matter. Modern computer systems are generally 64-bit, which allows for much bigger numbers by default. The maximum possible value is of course still finite, so any computer system is assuming that it will eventually be turned off and on again. But if a 64-bit system counts milliseconds, it will not hit that limit until 584.9 million years have passed.


So you don''t need to worry: it will need a restart only twice every billion years. Calendars The analog methods of timekeeping we used before the invention of computers would, at least, never run out of room. The hands of a clock can keep spinning around; new pages can be added to the calendar as the years go by. Forget milliseconds: with only good old-fashioned days and years to worry about, you will not have any math mistakes ruining your day. Or so thought the Russian shooting team as they arrived at the 1908 Olympic Games in London a few days before the international shooting was scheduled to start on July 10. But if you look at the results of the 1908 Olympics, you''ll see that all the other countries did well but there are no Russian results for any shooting event. And that is because what was July 10 for the Russians was July 23 in the UK (and indeed most of the rest of the world). The Russians were using a different calendar.


It seems odd that something as straightforward as a calendar can go so wrong that a team of international athletes shows up at the Olympics two weeks late. But calendars are far more complex than you''d expect; it seems that dividing the year up into predictable days is not easy and there are different solutions to the same problems. The universe has given us only two units of time: the year and the day. Everything else is the creation of humankind to try to make life easier. As the protoplanetary disk congealed and separated into the planets as we know them, the Earth was made with a certain amount of angular momentum, sending it flying around the sun, spinning as it goes. The orbit we ended up in gave us the length of the year, and the rate of the Earth''s spin gave us the length of the day. Except they don''t matc.


To be able to view the table of contents for this publication then please subscribe by clicking the button below...
To be able to view the full description for this publication then please subscribe by clicking the button below...