Unit of work recovery

Units of work and transactions

A unit of work in CICS® is also the "unit of recovery"--that is, it’s the atomic component of the transaction in which any changes made either must all be committed, or must all be backed out.

A transaction can be composed of a single unit of work or multiple units of work. In CICS, recovery is managed at the unit of work level.

For recovery purposes, CICS recovery manager is concerned only with the units of work that have not yet completed a syncpoint because of some failure. This topic discusses how CICS handles these failed units of work.

The CICS recovery manager has to manage the recovery of the following types of unit of work failure:

In-flight-failed
The transaction fails before the current unit of work reaches a syncpoint, as a result either of a task abend, or the abnormal termination of CICS. The transaction is abnormally terminated, and recovery manager initiates backout of any changes made by the unit of work.

See Transaction backout.

Commit-failed
A unit of work fails during commit processing while taking a syncpoint. A partial copy of the unit of work is shunted to await retry of the commit process when the problem is resolved.

This does not cause the transaction to terminate abnormally.

See Commit-failed recovery.

Backout-failed
A unit of work fails while backing out updates to file control recoverable resources. (The concept of backout-failed applies in principle to any resource that performs backout recovery, but CICS file control is the only resource manager to provide backout failure support.) A partial copy of the unit of work is shunted to await retry of the backout process when the problem is resolved.
Note:
Although the failed backout may have been attempted as a result of the abnormal termination of a transaction, the backout failure itself does not cause the transaction to terminate abnormally.

For example, if a transaction initiates backout through an EXEC CICS SYNCPOINT ROLLBACK command, CICS returns a normal response (not an exception condition) and the transaction continues executing. It is up to recovery manager to ensure that locks are preserved until backout is eventually completed.

If some resources involved in a unit of work are backout-failed, while others are commit-failed, the UOW as a whole is flagged as backout-failed.

See Backout-failed recovery.

In-doubt-failed
A distributed unit of work fails while in the in-doubt state of the two-phase commit2 process. The transaction is abnormally terminated. If there are normally more units of work that follow the one that failed in-doubt, these will not be executed as a result of the abend.

A partial copy of the unit of work is shunted to await resynchronization when CICS re-establishes communication with its coordinator. This action happens only when the transaction resource definition specifies that units of work are to wait in the event of failure while in-doubt. If they are defined with WAIT(NO), CICS takes the action specified on the ACTION parameter, and the unit of work cannot become failed in-doubt.

See In-doubt failure recovery.

Transaction backout

If the resources updated by a failed unit of work are defined as recoverable, CICS automatically performs transaction backout of all uncommitted changes to the recoverable resources. Transaction backout is mandatory and automatic--there isn’t an option on the transaction resource definition allowing you to control this. You can, however, control backout of the resources on which your transactions operate by defining whether or not they are recoverable.

In transaction backout, CICS restores the resources specified as recoverable to the state they were in at the beginning of the interrupted unit of work (that is, at start of task or completion of the most recent synchronization point). The resources are thus restored to a consistent state.

In general, the same process of transaction backout is used for individual units of work that abend while CICS is running and for in-flight tasks recovered during emergency restart. One difference is that dynamic backout of a single abnormally terminating transaction takes place immediately. Therefore, it does not cause any active locks to be converted into retained locks. In the case of a CICS region abend, in-flight tasks have to wait to be backed out when CICS is restarted, during which time the locks are retained to protect uncommitted resources.

To restore the resources to the state they were in at the beginning of the unit of work, CICS preserves a description of their state at that time:

This topic discusses the way the individual resource managers handle their part of the backout process in terms of the following resources:

Files

CICS file control is presented with the log records of all the recoverable files that have to be backed out. File control:

If backout fails for any file-control-managed resources, file control invokes backout failure support before the unit of work is marked as backout-failed. See Backout-failed recovery.

BDAM files and VSAM ESDS files

In the special case of the file access methods that do not support delete requests (VSAM ESDS and BDAM) CICS cannot remove new records added by the unit of work. In this case, CICS invokes the global user exit program enabled at the XFCLDEL exit point whenever a WRITE to a VSAM ESDS, or to a BDAM data set, is being backed out. This enables your exit program to perform a logical delete by amending the record in some way that flags it as deleted.

If you do not have an XFCLDEL exit program, CICS handles the unit of work as backout-failed, and shunts the unit of work to be retried later (see Backout-failed recovery). For information about resolving backout failures, see Logical delete not performed.

Such flagged records can be physically deleted when you subsequently reorganize the data set offline with a utility program.

CICS data tables

For CICS-maintained data tables, the updates made to the source VSAM data set are backed out. For user-maintained data tables, the in-storage data is backed out.

Intrapartition transient data

Intrapartition destinations specified as logically recoverable are restored by transaction backout. Read and write pointers are restored to what they were before the transaction failure occurred.

Physically recoverable queues are recovered on warm and emergency restarts.

Transient data does not provide any support for the concept of transaction backout, which means that:

CICS does not support recovery of extrapartition queues.

Auxiliary temporary storage

CICS transaction backout backs out updates to auxiliary temporary storage queues if they are defined as recoverable in a temporary storage table. Read and write pointers are restored to what they were before the transaction failure occurred.

CICS does not back out changes to temporary storage queues held in main storage or in a TS server temporary storage pool.

START requests

Recovery of EXEC CICS START requests during transaction backout depends on some of the options specified on the request. The options that affect recoverability are:

PROTECT
This option effectively causes the start request to be treated like any other recoverable resource, and the request is committed only when the task issuing the START takes a syncpoint. It ensures that the new task cannot be attached for execution until the START request is committed.
FROM, QUEUE, RTERMID, RTRANSID
These options pass data to the started task using temporary storage.

When designing your applications, consider the recoverability of data that is being passed to a started transaction.

Recovery of START requests during transaction backout is described below for different combinations of these options.

START with no data (no PROTECT)
Transaction backout does not affect the START request. The new task will start at its specified time (and could already be executing when the task issuing the START command is backed out). Abending the task that issued the START does not abend the started task.
START with no data (PROTECT)
Transaction backout of the task issuing the START command causes the START request also to be backed out (canceled). If the abended transaction is restarted, it can safely reissue the START command without risk of duplication.
START with recoverable data (no PROTECT)
Transaction backout of the task issuing the START also backs out the data intended for the started task, but does not back out the START request itself. Thus the new task will start at its specified time, but the data will not be available to the started task, to which CICS will return a NOTFND condition in response to the RETRIEVE command.
START with recoverable data (PROTECT)
Transaction backout of the task issuing the START command causes the START request and the associated data to be backed out. If the abended transaction is restarted, it can safely reissue the START command without risk of duplication.
START with nonrecoverable data (no PROTECT)
Transaction backout of the task issuing the START does not back out either the START request or the data intended for the (canceled) started task. Thus the new task will start at its specified time, and the data will be available, regardless of the abend of the issuing task.
START with nonrecoverable data (PROTECT)
Transaction backout of the task issuing the START command causes the START request to be canceled, but not the associated data, which is left stranded in temporary storage.
Note:
Recovery of temporary storage (whether or not PROTECT is specified) does not cause the new task to start immediately. (It may qualify for restart like any other task, if RESTART(YES) is specified on the transaction resource definition.) On emergency restart, a started task is restarted only if it was started with data written to a recoverable temporary storage queue.
Restart of started transactions

Started transactions that are defined with RESTART(YES) are eligible for restart only in certain circumstances. The effect of RESTART(NO) and RESTART(YES) on started transactions is shown in Table 2.

Table 2. Effect of RESTART option on started transactions
Description of START command Events Effect of RESTART(YES) Effect of RESTART(NO)
Specifies either recoverable or nonrecoverable data Started task ends normally, but does not retrieve data. START request and its data (TS queue) are discarded at normal end. START request and its data (TS queue) are discarded at normal end.
Specifies recoverable data Started task abends after retrieving its data START request and its data are recovered and restarted, up to n1 times. START request and its data are discarded.
Specifies recoverable data Started task abends without retrieving its data START request and its data are recovered and restarted, up to n1 times. START request and its data are discarded.
Specifies nonrecoverable data Started task abends after retrieving its data START request is discarded and not restarted. Not restarted.
Specifies nonrecoverable data Started task abends without retrieving its data Transaction is restarted with its data still available, up to n1 times. START request and its data are discarded.
Without data Started task abends Transaction is restarted up to n1 times. --

1 n is defined in the transaction restart program, DFHREST, where the CICS-supplied default is 20.

EXEC CICS CANCEL requests

Recovery from CANCEL requests during transaction backout depends on whether:

During transaction backout of a failed task that has canceled a START request that has recoverable data associated with it, CICS recovers both the temporary storage queue and the start request. Thus the effect of the recovery is as if the CANCEL command had never been issued.

If there is no data associated with the START command, or if the temporary storage queue is not recoverable, neither the canceled started task nor its data is recovered, and it stays canceled.

Basic mapping support (BMS) messages

Recovery of BMS messages affects those BMS operations that store data on temporary storage. They are:

Backout of these BMS operations is based on backing out START requests because, internally, BMS uses the START mechanism to implement the operations listed above. You request backout of these operations by making the BMS temporary storage queues recoverable, by defining their DATAIDs in the temporary storage table. For more information about the temporary storage table, see the CICS Resource Definition Guide.

Application programmers can override the default temporary storage DATAIDs by specifying the following operands:

Note:
If backout fails, CICS does not try to restart regardless of the setting of the restart program.

Backout-failed recovery

In principle, backout failure support can apply to any resource that performs backout, but the support is currently provided only by CICS file control.

Files

If backout to a VSAM data set fails for any reason, CICS:

If a unit of work updates more than one data set, the backout may fail for only one, or some, of the data sets. When this occurs, CICS converts to retained locks only those locks held by the unit of work for the data sets for which backout has failed. When the unit of work is shunted, CICS releases the locks for records in data sets that are backed out successfully. The log records for the updates made to the data sets that fail backout are kept for the subsequent backout retry. CICS does not keep the log records that are successfully backed out.

For a given data set, it is not possible for some of the records updated by a unit of work to fail backout and for other records not to fail. For example, if a unit of work updates several records in the same data set, and backout of one record fails, they are all deemed to have failed backout. The backout failure exit is invoked once only within a unit of work, and the backout failure message is issued once only, for each data set that fails backout. However, if the backout is retried and fails again, the exit is reinvoked and the message is issued again.

For BDAM data sets, there is only limited backout failure support: the backout failure exit, XFCBFAIL, is invoked (if enabled) to take installation-defined action, and message DFHFC4702 is issued.

Retrying backout-failed units of work

Backout retry for a backout-failed data set either can be driven manually (using the SET DSNAME RETRY command) or in many situations occurs automatically when the cause of the failure has been resolved (see Possible reasons for VSAM backout failure). When CICS performs backout retry for a data set, any backout-failed UOWs that are shunted because of backout failures on that data set are unshunted4, and the recovery manager passes the log records for that data set to file control. File control attempts to back out the updates represented by the log records and, if the original cause of the backout failure is now resolved, the backout retry succeeds. If the cause of a backout failure is not resolved, the backout fails again, and backout failure support is reinvoked.

Disposition of data sets after backout failures

Because individual records are locked when a backout failure occurs, CICS need not set the entire data set into a backout-failed condition. CICS may be able to continue using the data set, with only the locked records being unavailable. Some kinds of backout failure can be corrected without any need to take the data set offline (that is, without needing to stop all current use of the data set and prevent further access). Even for those failures that cannot be corrected with the data set online, it may still be preferable to schedule the repair at some future time and to continue to use the data set in the meantime, if this is possible.

Possible reasons for VSAM backout failure

There are many reasons why back out can fail, and these are described in this topic. In general, each of these descriptions corresponds with a REASON returned on an INQUIRE UOWDSNFAIL command.

I/O error
You must take the data set offline to repair it, but there may be occasions when the problem is localized and use of the data set can continue until it is convenient to carry out the repair.

Message DFHFC4701 with a failure code of X'24' indicates that an I/O error (a physical media error) has occurred while backing out a VSAM data set. This indicates that there is some problem with the data set, but it may be that the problem is localized. A better indication of the state of a data set is given by message DFHFC0157 (followed by DFHFC0158), which CICS issues whenever an I/O error occurs (not just during backout). Depending on the data set concerned, and other factors, your policy may be to repair the data set:

It might be worth initially deciding to leave a data set online for some time after a backout failure, to evaluate the level of impact the failures have on users.

To recover from a media failure, recreate the data set by applying forward recovery logs to the latest backup. The steps you take depend on whether the data set is opened in RLS or non-RLS mode:

Logical delete not performed
This error occurs if, during backout of a write to an ESDS, the XFCLDEL logical delete exit was either not enabled, or requested that the backout be handled as a backout failure.

You can correct this by enabling a suitable exit program and manually retrying the backout. There is no need to take the data set offline.

Open error
Investigate the cause of any error that occurs in a file open operation. A data set is normally already open during dynamic backout, so an open error should occur only during backout processing if the backout is being retried, or is being carried out following an emergency restart. Some possible causes are:

For other cases, manually retry the backout after the cause of the problem has been resolved. There is no need to take the data set offline.

SMSVSAM server failure
This error can occur only for VSAM data sets opened in RLS access mode. The failure of the SMSVSAM server might be detected by the backout request, in which case CICS file control starts to close the failed SMSVSAM control ACB and issues a console message. If the failure has already been detected by some other (earlier) request, CICS has already started to close the SMSVSAM control ACB when the backout request fails.

The backout is normally retried automatically when the SMSVSAM server becomes available. (See Dynamic RLS restart.) There is no need to take the data set offline.

SMSVSAM server recycle during backout
This error can occur only for VSAM data sets opened in RLS access mode.

This is an extremely unlikely cause of a backout failure. CICS issues message DFHFC4701 with failure code X'C2'. Retry the backout manually: there is no need to take the data set offline.

Coupling facility cache structure failure
This error can occur only for VSAM data sets opened in RLS access mode. The cache structure to which the data set is bound has failed, and VSAM has been unable to rebuild the cache, or to re-bind the data set to an alternative cache.

The backout is retried automatically when a cache becomes available again. (See Cache failure support.) There is no need to take the data set offline.

DFSMSdss non-BWO backup in progress
This error can occur only for VSAM data sets opened in RLS access mode.

DFSMSdss makes use of the VSAM quiesce protocols when taking non-BWO backups of data sets that are open in RLS mode. While a non-BWO backup is in progress, the data set does not need to be closed, but updates to the data set are not allowed. This error means that the backout request was rejected because it was issued while a non-BWO backup was in progress.

The backout is retried automatically when the non-BWO backup completes.

Data set full
The data set ran out of storage during backout processing.

Take the data set offline to reallocate it with more space. (See Moving recoverable data sets that have retained locks for information about preserving retained locks in this situation.) You can then retry the backout manually, using the CEMT, or EXEC CICS, SET DSNAME(...) RETRY command.

Non-unique alternate index full
Take the data set offline to rebuild the data set with a larger record size for the alternate index. (See Moving recoverable data sets that have retained locks for information about preserving retained locks in this situation.) You can then retry the backout manually, using the CEMT, or EXEC CICS, SET DSNAME(...) RETRY command.
Deadlock detected
This error can occur only for VSAM data sets opened in non-RLS access mode.

This is a transient condition, and a manual retry should enable backout to complete successfully. There is no need to take the data set offline.

Duplicate key error
The backout involved adding a duplicate key value to a unique alternate index. This error can occur only for VSAM data sets opened in non-RLS access mode.

This situation can be resolved only by deleting the rival record with the duplicate key value.

Lock structure full error
The backout required VSAM to acquire a lock for internal processing, but it was unable to do so because the RLS lock structure was full. This error can occur only for VSAM data sets opened in RLS access mode.

To resolve the situation, you must allocate a larger lock structure in an available coupling facility, and rebuild the existing lock structure into the new one. The failed backout can then be retried using SET DSNAME RETRY.

None of the above
If any other error occurs, it indicates a possible error in CICS or VSAM code, or a storage overwrite in the CICS region. Diagnostic information is given in message DFHFC4700, and a system dump is provided.

If the problem is only transient, a manual retry of the backout should succeed.

Auxiliary temporary storage

All updates to recoverable auxiliary temporary storage queues are managed in main storage until syncpoint. TS always commits forwards; therefore TS can never suffer a backout failure.

Transient data

All updates to logically recoverable intrapartition queues are managed in main storage until syncpoint, or until a buffer must be flushed because all buffers are in use. TD always commits forwards; therefore, TD can never suffer a backout failure on DFHINTRA.

Commit-failed recovery

Commit failure support is provided only by CICS file control, because it is the only CICS component that needs this support.

Files

A commit failure is one that occurs during the commit stage of a unit of work (either following the prepare phase of two-phase commit, or following backout of the unit of work). It means that the unit of work has not yet completed, and the commit must be retried successfully before the recovery manager can forget about the unit of work.

When a failure occurs during file control’s commit processing, CICS ensures that all the unit of work log records for updates made to data sets that have suffered the commit failure are kept by the recovery manager. Preserving the log records ensures that the commit processing for the unit of work can be retried later when conditions are favorable.

The most likely cause of a file control commit failure, from which a unit of work can recover, is that the SMSVSAM server is not available when file control is attempting to release the RLS locks. When other SMSVSAM servers in the sysplex detect that a server has failed, they retain all the active exclusive locks held by the failed server on its behalf. Therefore, CICS does not need to retain locks explicitly when a commit failure occurs. When the SMSVSAM server becomes available again, the commit is automatically retried.

However, it is also possible for a file control commit failure to occur as a result of some other error when CICS is attempting to release RLS locks during commit processing, or is attempting to convert some of the locks into retained locks during the commit processing that follows a backout failure. In this case it may be necessary to retry the commit explicitly using the SET DSNAME RETRY command. Such failures should be rare, and may be indicative of a more serious problem.

It is possible for a unit of work that has not performed any recoverable work, but which has performed repeatable reads, to suffer a commit failure. If the SMSVSAM server fails while holding locks for repeatable read requests, it is possible to access the records when the server recovers, because all repeatable read locks are released at the point of failure. If the commit failure is not due to a server failure, the locks are held as active shared locks. The INQUIRE UOWDSNFAIL command distinguishes between a commit failure where recoverable work was performed, and one for which only repeatable read locks were held.

In-doubt failure recovery

The CICS recovery manager is responsible for maintaining the state of each unit of work in a CICS region. For example, typical events that cause a change in the state of a unit of work are temporary suspension and resumption, receipt of syncpoint requests, and entry into the in-doubt period during two-phase commit processing.

The CICS recovery manager shunts a unit of work if all the following conditions apply:

Files

When file control shunts its resources for the unit of work, it detects that the shunt is being issued during the first phase of two-phase commit, indicating an in-doubt failure. Any active exclusive lock held against a data set updated by the unit of work is converted into a retained lock. The result of this action is as follows:

For information about types of locks, see Locks.

For data sets opened in RLS mode, interfaces to VSAM RLS are used to retain the locks. For VSAM data sets opened in non-RLS mode, and for BDAM data sets, the CICS enqueue domain provides an equivalent function. It is not possible for some of the data sets updated in a particular unit of work to be failed in-doubt and for the others not to be.

It is possible for a unit of work that has not performed any recoverable work, but which has performed repeatable reads, to be shunted when an in-doubt failure occurs. In this event, repeatable read locks are released. Therefore, for any data set against which only repeatable reads were issued, it is possible to access the records, and to open the data set in non-RLS mode for batch processing, despite the existence of the in-doubt failure. The INQUIRE UOWDSNFAIL command distinguishes between an in-doubt failure where recoverable work has been performed, and one for which only repeatable read locks were held. If you want to open the data set in non-RLS mode in CICS, you need to resolve the in-doubt failure before you can define the file as having RLSACCESS(NO). If the unit of work has updated any other data sets, or any other resources, you should try to resolve the in-doubt correctly, but if the unit of work has only performed repeatable reads against VSAM data sets and has made no updates to other resources, it is safe to force the unit of work using the SET DSNAME or SET UOW commands.

CICS saves enough information about the unit of work to allow it to be either committed or backed out when the in-doubt unit of work is unshunted when the coordinator provides the resolution (or when the transaction wait time expires). This information includes the log records written by the unit of work.

When CICS has re-established communication with the coordinator for the unit of work, it can resynchronize all in-doubt units of work. This involves CICS first unshunting the units of work, and then proceeding with the commit or backout. All CICS enqueues and VSAM RLS record locks are released, unless a commit failure or backout failure occurs.

For information about the resynchronization process for units of work that fail in-doubt, see the CICS Intercommunication Guide.

Intrapartition transient data

When a UOW that has updated a logically recoverable intrapartition transient data queue fails in-doubt, CICS converts the locks held against the TD queue to retained locks. Until the UOW is unshunted, the default action is to reject with the LOCKED condition further requests of the following types:

You can use the WAITACTION option on the TD queue resource definition to control the action that CICS takes when an update request is made against a shunted in-doubt UOW that has updated the queue. In addition to the default option, which is WAITACTION(REJECT), you can specify WAITACTION(QUEUE) to queue further requests while the queue is locked by the failed-indoubt UOW.

After resynchronization, the shunted updates to the TD queue are either committed or backed out, and the retained locks are released.

Auxiliary temporary storage

When a UOW that has updated a recoverable temporary storage queue fails in-doubt, the locks held against the queue are converted to retained locks. Until the UOW is unshunted, further update requests against the locked queue items are rejected with the LOCKED condition.

After resynchronization, the shunted updates to the TS queue are either committed or backed out, and the retained locks are released.


2.
Two-phase commit. The protocol used by CICS when taking a syncpoint in a distributed unit of work, where the first prepare phase is followed by the actual commit phase. See two-phase commit in the "Glossary".
3.
Shunting. The process of suspending a unit of work in order to allow time to resolve the problem that has caused the suspension. Shunting releases the user’s terminal, virtual storage, and CP resources, and allows completion of the unit of work to be deferred for as long as necessary.
4.
Unshunting. The process of attaching a transaction to provide an environment under which to resume the processing of a shunted unit of work.

[[ Contents Previous Page | Next Page Index ]]