Friday, August 3, 2012

How do I tune to reduce the log file sync wait events?

Commit is not complete until LGWR writes log buffers including commit redo recods to log files. In a nutshell, after posting LGWR to write, user or background processes waits for LGWR to signal back with 1 sec timeout. User process charges this wait time as ‘log file sync’ event.

Root causes of ‘log file sync’, essentially boils down to few scenarios and following is not an exhaustive list, by any means!

1. LGWR is unable to complete writes fast enough for one of the following reasons:
  • Disk I/O performance to log files is not good enough. Even though LGWR can use asynchronous I/O, redo log files are opened with DSYNC flag and buffers must be flushed to the disk (or at least, written to disk array cache in the case of SAN) before LGWR can mark commit as complete.
  • LGWR is starving for CPU resource. If the server is very busy, then LGWR can starve for CPU too. This will lead to slower response from LGWR, increasing ‘log file sync’ waits. After all, these system calls and I/O calls must use CPU. In this case, ‘log file sync’ is a secondary symptom and resolving root cause for high CPU usage will reduce ‘log file sync’ waits.
  • Due to memory starvation issues, LGWR can be paged out. This can lead to slower response from LGWR too.
  • LGWR is unable to complete writes fast enough due to file system or unix buffer cache limitations.
2. LGWR is unable to post the processes fast enough, due to excessive commits. It is quite possible that there is no starvation for cpu or memory and I/O performance is decent enough. Still, if there are excessive commits, then LGWR has to perform many writes/semctl calls and this can increase ‘log file sync’ waits. This can also result in sharp increase in redo wastage’ statistics’.
3. IMU undo/redo threads. With Private strands, a process can generate few Megabytes of redo before committing. LGWR must write generated redo so far and processes must wait for ‘log file sync’ waits, even if redo generated from other processes is small enough.
4. LGWR is suffering from other database contention such as enqueue waits or latch contention. For example, we have seen LGWR freeze due to CF enqueue contention. This is a possible scenario however unlikely.
5. Various bugs. Oh, yes, there are bugs introducing unnecessary ‘log file sync’ waits.

It is worthwhile to understand and identify root cause and resolve it.
  • First make sure, ‘log file sync’ event is indeed a major wait events.
  • Identify and break down LGWR wait events. Query wait events for LGWR. In this instance LGWR sid is 3 (and usually it is).
SELECT   sid,
    FROM   v$session_event
   WHERE   sid = 3
ORDER BY   3 desc
SQL> / 
SID EVENT                               TIME_WAITED TIME_WAITED_MICRO
--- ----------------------------------- ----------- -----------------
  3 rdbms ipc message                       2889367        2.8894E+10
  3 log file parallel write                  295343        2953429267
  3 LGWR wait for redo copy                     843           8425950
  3 log file single write                        67            674960
  3 control file parallel write                  47            471217
  3 latch free                                    6             60546
  3 control file sequential read                  4             39457
  3 log file sequential read                      2             17374
  3 direct path write                             2             23908
  3 direct path read                              0                14

10 rows selected.
It is worth to note that v$session_event is a cumulative counter from 
instance startup and hence, this can be misleading. Difference between 
two snapshots from this view, for the same session, can be quite useful

  • If excessive redo size is root cause, redo size can be reduced
  • Solid State Disk devices also can be used if redo size is extreme. In that case, it is also preferable to decrease redo size
  • If commit rate is higher, then decreasing commits is correct step but, in few case, if that is not possible, increasing priority of LGWR (using nice) or increasing priority class of LGWR to RT might provide some relief.
  • If I/O bandwith is an issue, then doing anything other than improving 
    I/O bandwidth is not useful. Switching to file systems providing better 
    write throughput is one option. RAW devices are another option. Reducing 
    # of log file members in a group is another option as it reduces # of 
    write calls. But, this option comes with a cost. 

No comments:

Post a Comment