Road To OCM: Strategies and techniques to resolve 'log file sync' waits

Friday, August 24, 2012

Strategies and techniques to resolve 'log file sync' waits

Commit is not complete until LGWR writes log buffers including commit redo recods to log files. In a
nutshell, after posting LGWR to write, user or background processes waits for LGWR to signal back
with 1 sec timeout. User process charges this wait time as 'log file sync' event.

Root causes of 'log file sync', essentially boils down to few scenarios :

Disk I/O performance to log files is not good enough. Even though LGWR can use
asynchronous I/O, redo log files are opened with DSYNC flag and buffers must be
flushed to the disk (or at least, written to disk array cache in the case of SAN) before
LGWR can mark commit as complete.
LGWR is starving for CPU resource. If the server is very busy, then LGWR can starve
for CPU too. This will lead to slower response from LGWR, increasing 'log file sync'
waits. After all, these system calls and I/O calls must use CPU. In this case, 'log file
sync' is a secondary symptom and resolving root cause for high CPU usage will reduce
'log file sync' waits.
Due to memory starvation issues, LGWR can be paged out. This can lead to slower
response from LGWR too
LGWR is unable to complete writes fast enough due to file system or unix buffer
cache limitations.
LGWR is unable to post the processes fast enough, due to excessive commits. It is quite
possible that there is no starvation for cpu or memory and I/O performance is decent enough. Still, if
there are excessive commits, then LGWR has to perform many writes/semctl calls and this can
increase 'log file sync' waits. This can also result in sharp increase in 'redo wastage' statistics'
With Private strands, a process can generate few Megabytes of redo
before committing. LGWR must write generated redo so far and processes must wait for 'log file sync'
waits, even if redo generated from other processes is small enough
LGWR is suffering from other database contention such as enqueue waits or latch contention.
For example, we have seen LGWR freeze due to CF enqueue contention.

Finding and understanding root cause is essential to resolve a performance issue.

If I/O bandwith is an issue, then doing anything other than improving I/O bandwidth is not
useful. Switching to file systems providing better write throughput is one option. RAW devices are
another option. Reducing # of log file members in a group is another option as it reduces # of write
calls. But, this option comes with a cost.
If CPU starvation is an issue, then reducing CPU starvation is the correct step to resolve it.
Increasing priority of LGWR is a work around
If commit rate is higher, then decreasing commits is correct step but, in few case, if that is not
possible, increasing priority of LGWR (using nice) or increasing priority class of LGWR to RT might
provide some relief.
Solid State Disk devices also can be used if redo size is extreme. In that case, it is also
preferable to decrease redo size.
If excessive redo size is root cause, redo size can be reduced using various techniques

Road To OCM

Friday, August 24, 2012

Strategies and techniques to resolve 'log file sync' waits

No comments:

Post a Comment