Contention for resources reaches its most dangerous heights when two or more processes use two or more of the same resources. The problems can occur even if there is only one computer involved, as long as that computer is running multiple processes concurrently. Most computers do that. However, the problem becomes potentially more severe and is harder to prevent when a distributed system is involved. As we shall see, many of the traditional deadlock prevention systems work only when all the competing processes are running on one computer and when that computer owns all the resources involved.
Let’s examine two processes that use the same two resources. Just for variety, let’s look at an example from the operating system itself.
When a program starts to run, the operating system creates an entry for it in a bookkeeping area called the process table. This entry is used to keep track of information about the process and must continue to exist as long as the program is running. While a program is running, it can also branch off another process (like a subprogram) to perform some subtask, and this new process must also get an entry in the process table. Its entry will be erased when the subprocess ends and control returns to the main process.
We’ll look at two algorithms carrying out different activities, one on behalf of the operating system (compiling reports of system statistics) and one on behalf of a transaction-based system, recording transactions in a log and then archiving transactions on a CD every so often. Both use a CD-R mass storage system, ion which the computer can mount CD-s in a CD-R reader-writer under program control. Both of these processes call subroutines to do some of their work. While subroutine calls are not always implemented by creating a separate process, let’s say they are in this example.
Here’s an algorithm that uses a subroutine (a separate process) to compile a report of system usage statistics that will then be written to a CD.
Get control of the CD writer
Mount a blank volume to the CD drive
Write volume ID information (date, time, etc.)
Call subroutine to compile system usage report
Write usage report to CD
Close volume
Release CD drive
Here’s another algorithm that uses a subroutine.
While the buffer is empty, waste time
When there’s something in the buffer
Copy it to the log file
If this is the 100th item in the log file
Call archive process
Here’s the archive process (subroutine)
Get control of the CD writer
Mount the archive volume
Copy the log to the archive
Release the CD writer
If these two algorithms are executed sequentially (one starts and finishes before the other starts) there is no problem. They can switch back and forth all day and there will not be a conflict. Conflicts arise only when they run concurrently; when one starts and then the other starts before the first one ends. Multitasking systems, whether distributed or not, allow this to happen, and now we can get into trouble.
Let’s look at a situation where a problem occurs. What if one operation gets control of the CD writer, but before the first operation can start its subordinate process the second operation fills the last slot in the process table? Without an open slot in the process table, the first operation will be unable to start its subordinate process and thus will not finish. It won’t release the CD writer until it finishes.
Without the CD writer, the second operation will not finish, and it won’t release the process table slot it holds until it finishes. Now we have two processes that are in deadlock. Neither can proceed without a resource held by the other, and neither will release its resources until it proceeds to a finish.
I don’t know about you, but I found that a bit hard to follow the first time I heard it. Let’s see that laid out graphically. To make things easier to follow, I’ll color the first algorithm purple and the next main algorithm blue. The second algorithm’s subroutine I’ll color green. I’ll write the steps in the order they are executed, but I’ll write any steps from the first algorithm against the left margin and any steps from the second (or its subroutine) to the right.
Get control of the CD writer
Mount a blank volume to the CD drive
While the
buffer is empty, waste time
When there’s something
in the buffer
Copy
it to the log file
If
this is the 100th item in the log file
Call
archive process
Write volume ID information (date,
time, etc.)
Get control of
the CD writer
Call subroutine to compile system usage
report
Write usage report to CD
Close volume
Mount the
archive volume
Copy the log to the archive
Release the CD
writer
This order of operations still works fine as long as the process table doesn’t fill. The first process gets the CD writer, then the second operation starts its subprocess, which blocks because the CD writer is busy. When the first process releases the CD writer, the second process’s subroutine unblocks and finishes.
If, however, the process table is filled by the process creation in the step marked below with a big red ‘X’, then the attempt to call the purple process’s subroutine in the step marked with a ‘Y’ will be blocked because no process table slot is available. Now we’re doomed. The purple process can’t proceed because no process table slot is available, so it will never release the CD writer. The green process (and the blue one that called it) are blocked because the CD writer is unavailable, so it will never release its process table slot. All these processes are stuck, and the CD writer and process table slots are lost to the system.
Get control of the CD writer
Mount a blank volume to the CD drive
While the
buffer is empty, waste time
When there’s
something in the buffer
Copy
it to the log file
If
this is the 100th item in the log file
Call
archive process X
Write volume ID information (date,
time, etc.)
Get control of
the CD writer
Call subroutine to compile system usage
report Y
Write usage report to CD
Close volume
Release CD drive
Mount the
archive volume
Copy the log to the archive
Release the CD
writer
How can we resolve this? Well, in this particular example there’s a chance that this deadlock will be resolved without any effort on our part. Remember, this is a multitasking system. If the entire process table is filled but is not filled only by these two processes, then it is possible that some other process we don’t know about will end and release its slot(s) in the process table. Now the purple system program can unblock, finish, and release the CD writer. Then the green log copier can unblock and finish, releasing its process table slot(s).
In general, of course, we can’t rely on other processes to clean up the mess. There may be no other processes that hold the resources we need. If there are, they may be blocked themselves, perhaps even waiting for things we hold but won’t release until our processes finish. We have to come up with a less random way of dealing with deadlock. Either we have to be able to predict (and thus avoid) it, detect and correct it, or make it impossible.
Unfortunately, there’s a problem. Although it’s beyond the scope of this course to show the proof here, computer scientists have been able to prove that it is impossible to either predict deadlock or to detect it perfectly. But what about making it impossible?
Computer scientists have determined that there are three conditions that are all necessary for deadlock to occur. They are:
1. Processes must compete for nonsharable resources. Nonsharable resources are those that must be held exclusively by one process, like a CD writer which can’t be shared because it can only be writing on one CD at a time. Files are unsharable if they are being modified by the process that holds them, although they may be shared if all the processes are only reading them and not making any changes.
2. Resources must be allocated to process on a partial basis. That is, a process may have some resources allocated to it and then, later in the process, get some more. Both main processes we saw in our example were like this. They got one resource (a CD writer or a process table slot) and then later asked for another resource (whichever they did not already have.)
3. Resources cannot be removed from a process until it releases them. This means that the operating systems cannot resolve deadlock by just taking resources away from one of the competing processes and give them to another process. While this would break the deadlock, it would probably cause some harm to the work being done by the process that lost control of the resources. Image what would happen if a process were in the middle of transferring data from a buffer to CD and the operating system determined that deadlock existed because another process needed more memory to complete. If the operating system took the buffer away before the data were completely copied, some data would be lost forever.
If we remove any of these three conditions, deadlock cannot occur.
For example, let’s say that we required all processes, as soon as they start, to ask for and receive all the resources they would need to complete their work. No resource would be allocated to a process until all the resources it needs are available and can be allocated to it together. A process would be blocked until it received everything it needed, but it could not be blocked after that by a lack of resources. Since no process would be unable to finish because it lacked a resource, there could be no deadlock.
This strategy, called preallocation has been used before. IBM mainframes used a language called Job Control Language (JCL) to specify the resources that a program would need to be completed. Only when all of these resources could be allocated to the process did the process start. When the program ended, all of the resources were released for reuse. There are, however, two problems with this strategy. The first is that resources are tied up for the entire duration of the program, even though they may only be in use for a brief period. That means that the overall efficiency of the system is reduced, since system resources spend more of their time idle than they would if the were requested and released as needed. Programs now often keep running for hours, days, or weeks waiting for user interaction or network traffic. We don’t want the resources tied up for all of that time unless they are actually being used all that time.
The second problem became more obvious as programs became more interactive and data-driven. When a program starts, it is not usually possible to accurately predict what resources it will need. The amount of memory needed will depend on the amount of data received for processing. Network connections may or may not be needed, depending on whether an interactive user asks for web-based help or not.
The bottom line is that preallocation works to prevent deadlock, but it is very difficult to implement and can result in expensive inefficiencies.
Well, lets look at another strategy. How about having the system software detect that a deadlock has occurred and trying to resolve it? The first problem is that detecting deadlock is basically the same problem as predicting it. Therefore, computer scientists have already proven that it is impossible to perfectly detect deadlock.
The second problem is that once you detect deadlock, you can only resolve it by taking away resources that have already been allocated to a running process. That removes the third of the three conditions necessary for deadlock, so it does fix the deadlock. However, it has very serious consequences for the process that loses its allocated resources. Usually, process use unsharable resources to make changes. If deadlock occurs while these changes are in progress, it is likely that the process will be locked up between the time that a change is started and the time when it is safely finished. When this happens, the partly-made changes become errors in the data processing system. Since the only process that understands the data is blocked and then killed, it has no time to check and repair the data, and the errors persist.
In the past, when programmers developed programs to run on one computer under one operating system that they understood well (or could even modify if needed) programmers could build in rollback points into their data processing programs. These were moments in the processing when the data were internally consistent. Whenever a process reached one of these points, it began making copies of everything that it did until the next rollback point was reached. If deadlock was detected, the operating system could tell the process to run ‘backward’, undoing everything it had done until it reached the last completed rollback point. Then the process could be safely killed.
That was difficult to do, and it required a lot of programming skill, time, and effort. It only worked through very close cooperation between the systems software and the application software, and that becomes increasingly difficult in distributed systems. When the distributed system is running over an open network, with different owners of the different parts and no coordinated IT department, it becomes essentially impossible. So is there anything we can do about deadlock?
What can we do to limit damage from deadlock?
Well, yes, to some extent. We cannot prevent it. We cannot guarantee that it will not occur in such a way as to corrupt our data systems. What we can do is to limit the number and variety of nonsharable resources that our processes need, and try to avoid having the resources overlap. We can try to use the resources for a brief a time as possible and release them as quickly as we can. We can place sharable resources (like queues) in front of nonsharable resources (like printers) to limit our hazards. Probably most importantly, we can keep backups of our data systems and of all our changes, checking them frequently to make sure they have not been corrupted. When we detect a problem (whether it was caused by deadlock, a power failure, or user error) we can restore the systems from the backups.