Friday, August 24, 2012

Strategies and techniques to resolve 'log file sync' waits

Commit is not complete until LGWR writes log buffers including commit redo recods to log files. In a
nutshell, after posting LGWR to write, user or background processes waits for LGWR to signal back
with 1 sec timeout. User process charges this wait time as 'log file sync' event.

Root causes of 'log file sync', essentially boils down to few scenarios :
  • Disk I/O performance to log files is not good enough. Even though LGWR can use
    asynchronous I/O, redo log files are opened with DSYNC flag and buffers must be
    flushed to the disk (or at least, written to disk array cache in the case of SAN) before
    LGWR can mark commit as complete.
  • LGWR is starving for CPU resource. If the server is very busy, then LGWR can starve
    for CPU too. This will lead to slower response from LGWR, increasing 'log file sync'
    waits. After all, these system calls and I/O calls must use CPU. In this case, 'log file
    sync' is a secondary symptom and resolving root cause for high CPU usage will reduce
    'log file sync' waits.
  • Due to memory starvation issues, LGWR can be paged out. This can lead to slower
     response from LGWR too
  • LGWR is unable to complete writes fast enough due to file system or unix buffer
    cache limitations.
  • LGWR is unable to post the processes fast enough, due to excessive commits. It is quite
    possible that there is no starvation for cpu or memory and I/O performance is decent enough. Still, if
    there are excessive commits,  then LGWR has to perform many writes/semctl calls and this can
    increase 'log file sync' waits. This can also result in sharp increase in 'redo wastage' statistics'
  • With Private strands, a process can generate few Megabytes of redo
    before committing. LGWR must write generated redo so far and processes must wait for 'log file sync'
    waits, even if redo generated from other processes is small enough
  • LGWR is suffering from other database contention such as enqueue waits or latch contention.
    For example, we have seen LGWR freeze due to CF enqueue contention.
Finding and understanding root cause is essential to resolve a performance issue. 

  • If I/O bandwith is an issue, then doing anything other than improving I/O bandwidth is not
    useful. Switching to file systems providing better write throughput is one option. RAW devices are
    another option. Reducing # of log file members in a group is another option as it reduces # of write
    calls. But, this option comes with a cost.
  • If CPU starvation is an issue, then reducing CPU starvation is the correct step to resolve it.
    Increasing priority of LGWR is a work around
  • If commit rate is higher, then decreasing commits is correct step but, in few case, if that is not
    possible, increasing priority of LGWR (using nice) or increasing priority class of LGWR to RT might
    provide some relief.
  • Solid State Disk devices also can be used if redo size is extreme. In that case, it is also
    preferable to decrease redo size.
  • If excessive redo size is root cause, redo size can be reduced using various techniques

No comments:

Post a Comment