six demon bag

Wind, fire, all that kind of thing!

2014-03-26

Cleanup of DB2 Backups Fails with Return Code 136

Recently I encountered a rather weird problem with the cleanup of backups of some of our DB2 databases. The database backups are done via TSM by running the following command:

db2 "backup db DBNAME online use tsm"

Cleanup of obsolete backups is done by running the following commands via a scheduled task.

db2adutl delete full older than TIMESTAMP db DBNAME without prompting
db2adutl delete logs between S0000000.LOG and S(xxxxxxx-1).LOG db DBNAME without prompting
db2 "connect to DBNAME"
db2 "prune history DATE and delete"
db2 "prune logfile prior to Sxxxxxxx.LOG"
db2 "connect reset"

Sxxxxxxx.LOG is the oldest log of the oldest backup to be kept, which is extracted from the output of db2adutl query full db DBNAME. S(xxxxxxx-1).LOG is that log number minus one.

Although this setup had been working for several years without problems and no changes were made to the system, the log cleanup step suddenly started to fail for some databases while it still worked fine for the others.


C:\>db2adutl query logs db DBNAME verbose

Query for database DBNAME
Retrieving LOG ARCHIVE information.

Error: Get next image failed with TSM return code 122

   Log file: S0014324.LOG, Chain Num: 13, DB Partition Number: 0, Taken at: 2014-02-20-00...
   Log file: S0014325.LOG, Chain Num: 13, DB Partition Number: 0, Taken at: 2014-02-20-00...
[...]
   Log file: S0013238.LOG, Chain Num: 11, DB Partition Number: 0, Taken at: 2014-03-12-14...
   Log file: S0013239.LOG, Chain Num: 11, DB Partition Number: 0, Taken at: 2014-03-12-14...


C:\>db2adutl delete logs between S0000000.LOG and S0014553.LOG db DBNAME without prompting

Query for database DBNAME
Retrieving LOG ARCHIVE information.

Error: Get next image failed with TSM return code 122

   Log file: S0014324.LOG, Chain Num: 13, DB Partition Number: 0, Taken at: 2014-02-20-00.25.39
Error: dsmBeginQuery failed with TSM return code 136
Error: dsmBeginTxn failed with TSM return code 136
   Log file: S0014325.LOG, Chain Num: 13, DB Partition Number: 0, Taken at: 2014-02-20-00.26.18
Error: dsmBeginQuery failed with TSM return code 136
Error: dsmBeginTxn failed with TSM return code 136
[...]
   Log file: S0013238.LOG, Chain Num: 11, DB Partition Number: 0, Taken at: 2014-03-12-14.45.11
Error: dsmBeginQuery failed with TSM return code 136
Error: dsmBeginTxn failed with TSM return code 136
   Log file: S0013239.LOG, Chain Num: 11, DB Partition Number: 0, Taken at: 2014-03-12-14.46.10
Error: dsmBeginQuery failed with TSM return code 136
Error: dsmBeginTxn failed with TSM return code 136

The return codes weren't really helpful, though. According to the documentation, return code 122 means "unknown format" and return code 136 means "communication protocol error", neither of which gave us a clear indication of the cause of the problem.

We opened a case with IBM and got a response back, stating that other customers had reported failures with return code 136 when backups had been made with a more recent TSM version than the one doing the cleanup. A blog post I found while researching the error also mentioned a version difference as the cause of a similar issue. Thus we decided to update DB2 to the latest fixpack, and the TSM client to the latest release. The issue went away after that.

Posted 21:10 [permalink]