CICS recovery manager

The recovery manager ensures the integrity and consistency of resources (such as files and databases) both within a single CICS® region and distributed over interconnected systems in a network.  Figure 3 shows the resource managers and their resources with which the CICS recovery manager works.

The main functions of the CICS recovery manager are:

Figure 3. CICS recovery manager and resources it works with
 This figure illustrates, in the form of a wheel, the resource managers with which the CICS recovery manager works. At the hub of the wheel is the system log. Circling the log is the recovery manager, which provides the recovery interface between the log and resource managers that are positioned around the outer rim. The resource managers are shown in three groups: local resources managers; the resource manager interface (RMI); and the communications managers. The local resource managers are, in clockwise order, RDO, temporary storage, transient data, file control, and VSAM RLS. The RMI managers are DB2, DBCTL, and MQ. The communications managers are MRO, LU6.1, and LU6.2.

Managing the state of each unit of work

The CICS recovery manager maintains, for each UOW in a CICS region, a record of the changes of state that occur during its lifetime. Typical events that cause state changes include:

The identity of a UOW and its state are owned by the CICS recovery manager, and are recorded in storage and on the system log. The system log records are used by the CICS recovery manager during emergency restart to reconstruct the state of the UOWs in progress at the time of the earlier system failure.

The execution of a UOW can be distributed over more than one CICS system in a network of communicating systems.

The CICS recovery manager supports SPI commands that provide information about UOWs.

Coordinating updates to local resources

The recoverable local resources managed by a CICS region are files, temporary storage, and transient data, plus resource definitions for terminals, typeterms, connections, and sessions.

Each local resource manager can write UOW-related log records to the local system log, which the CICS recovery manager may subsequently be required to re-present to the resource manager during recovery from failure.

To enable the CICS recovery manager to deliver log records to each resource manager as required, the CICS recovery manager adds additional information when the log records are created. Therefore, all logging by resource managers to the system log is performed through the CICS recovery manager.

During syncpoint processing, the CICS recovery manager invokes each local resource manager that has updated recoverable resources within the UOW. The local resource managers then perform the required action. This provides the means of coordinating the actions performed by individual resource managers.

If the commit or backout of a file resource fails (for example, because of an I/O error or the inability of a resource manager to free a lock), the CICS recovery manager takes appropriate action with regard to the failed resource:

Note that a commit failure can occur during the commit phase of a completed UOW, or the commit phase that takes place after successfully completing backout. (These two phases (or ‘directions’) of commit processing--commit after normal completion and commit after backout--are sometimes referred to as ‘forward commit’ and ‘backward commit’ respectively.) Note also that a UOW can be backout-failed with respect to some resources and commit-failed with respect to others. This can happen, for example, if two data sets are updated and the UOW has to be backed out, and the following happens:

These events leave one data set commit-failed, and the other backout-failed. In this situation, the overall status of the UOW is logged as backout-failed.

During emergency restart following a CICS failure, each UOW and its state is reconstructed from the system log. If any UOW is in the backout-failed or commit-failed state, CICS automatically retries the UOW to complete the backout or commit.

Coordinating updates in distributed units of work

If the execution of a UOW is distributed across more than one system, the CICS recovery managers (or their non-CICS equivalents) in each pair of connected systems ensure that the effects of the distributed UOW are atomic1. Each CICS recovery manager (or its non-CICS equivalent) issues the requests necessary to effect two-phase syncpoint processing to each of the connected systems with which a UOW may be in conversation.

Note:
In this context, the non-CICS equivalent of a CICS recovery manager could be the recovery component of a database manager, such as DBCTL or DB2®, or any equivalent function where one of a pair of connected systems is not CICS.

In each connected system in a network, the CICS recovery manager uses interfaces to its local recovery manager connectors (RMCs) to communicate with partner recovery managers. The RMCs are the communication resource managers (LU6.2, LU6.1, MRO, and RMI) which have the function of understanding the transport protocols and constructing the flows between the connected systems.

As remote resources are accessed during UOW execution, the CICS recovery manager keeps track of data describing the status of its end of the conversation with that RMC. The CICS recovery manager also assumes responsibility for the coordination of two-phase syncpoint processing for the RMC.

Managing in-doubt units of work

During the syncpoint phases, for each RMC, the CICS recovery manager records the changes in the status of the conversation, and also writes, on behalf of the RMC, equivalent information to the system log.

If a session fails at any time during the running of a UOW, it is the RMC responsibility to notify the CICS recovery manager, which takes appropriate action with regard to the unit of work as a whole. If the failure occurs during syncpoint processing, the CICS recovery manager may be in doubt and unable to determine immediately how to complete the UOW. In this case, the CICS recovery manager causes the UOW to be shunted awaiting UOW resolution, which follows notification from its RMC of successful resynchronization on the failed session.

During emergency restart following a CICS failure, each UOW and its state is reconstructed from the system log. If any UOW is in the in-doubt state, it remains shunted awaiting resolution.

Resynchronization after system or connection failure

Units of work that fail while in an in-doubt state remain shunted until the in-doubt state can be resolved following successful resynchronization with the coordinator.

Resynchronization takes place automatically when communications are next established between subordinate and coordinator. Any decisions held by the coordinator are passed to the subordinate, and in-doubt units of work complete normally. If a subordinate has meanwhile taken a unilateral decision following the loss of communication, this decision is compared with that taken by the coordinator, and messages report any discrepancy.

For an explanation and illustration of the roles played by subordinate and coordinator CICS regions, and for information about recovery and resynchronization of distributed units of work generally, see the CICS Intercommunication Guide.


1.
Atomic. A unit of work is said to be atomic when the changes it makes to resources within the UOW are either all committed or all backed out. See also ACID properties in the Glossary.

[[ Contents Previous Page | Next Page Index ]]