本文转载自微信公众号「DBA闲思杂想录」,作者潇湘隐者 。转载本文请联系DBA闲思杂想录公众号。
监控工具DPA(SolarWinds的Database Performance Analyzer)发现海外一台Oracle数据库服务器DB Commit Time指标告警,超过红色告警线(40毫秒左右,黄色告警线是10毫秒,红色告警线是20毫秒),如下截图所示,生成了对应时段的AWR报告,分析时发现Top 5 Timed Events里面,log file sync等待事件的平均等待时间为37毫秒,log file parallel write等待事件的平均等待时间为40毫秒
如果对'Tanel Poder::Understanding LGWR, Log File Sync Waits and Commit Performance'这篇文章所讲述的内容很熟悉的话(经典图如下),那么通过等待事件log file sync与log file parallel write的Avg Wait(ms)指标:37ms & 40ms,基本上就可以判断redo log所在的磁盘I/O出现了性能问题
然后在用lfsdiag.sql脚本分析一下详细的统计数据,如下所示:
- HISTOGRAM DATA FOR LFS AND OTHER RELATED WAITS:
-
- APPROACH: Look at the wait distribution for log file sync waits
- by looking at "wait_time_milli". Look at the high wait times then
- see if you can correlate those with other related wait events.
-
- INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
- ---------- ---------------------------------------- --------------- ----------
- 1 log file sync 1 4363
- 1 log file sync 2 835
- 1 log file sync 4 1650
- 1 log file sync 8 4937
- 1 log file sync 16 146252
- 1 log file sync 32 606674
- 1 log file sync 64 263377
- 1 log file sync 128 253254
- 1 log file sync 256 2
- 1 log file switch completion 1 124
- 1 log file switch completion 2 9
- 1 log file switch completion 4 19
- 1 log file switch completion 8 21
- 1 log file switch completion 16 35
- 1 log file switch completion 32 97
- 1 log file switch completion 64 133
- 1 log file switch completion 128 326
- 1 log file switch completion 256 1736
- 1 log file switch completion 512 3042
- 1 log file switch completion 1024 2020
- 1 log file parallel write 1 0
- 1 log file parallel write 2 0
- 1 log file parallel write 4 80
- 1 log file parallel write 8 2142
- 1 log file parallel write 16 170987
- 1 log file parallel write 32 779205
- 1 log file parallel write 64 311463
- 1 log file parallel write 128 79688
- 1 log file parallel write 256 42763
- 1 log file parallel write 512 13052
- 1 log file parallel write 1024 20468
- 1 log file parallel write 2048 14020
- 1 log file parallel write 4096 921
- 1 log file parallel write 8192 96
- 1 log file parallel write 16384 18
- 1 log file parallel write 32768 18
- 1 log file parallel write 65536 8
- 1 log file parallel write 131072 2
- 1 LGWR wait for redo copy 1 8516
- 1 LGWR wait for redo copy 2 20
- 1 LGWR wait for redo copy 4 19
- 1 LGWR wait for redo copy 8 20
- 1 LGWR wait for redo copy 16 11
-
- ORDERED BY WAIT_TIME_MILLI
-
- INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
- ---------- ---------------------------------------- --------------- ----------
- 1 log file sync 1 4363
- 1 log file switch completion 1 124
- 1 log file parallel write 1 0
- 1 LGWR wait for redo copy 1 8516
- 1 log file sync 2 835
- 1 log file switch completion 2 9
- 1 log file parallel write 2 0
- 1 LGWR wait for redo copy 2 20
- 1 log file sync 4 1650
- 1 log file switch completion 4 19
- 1 log file parallel write 4 80
- 1 LGWR wait for redo copy 4 19
- 1 log file sync 8 4937
- 1 log file switch completion 8 21
- 1 log file parallel write 8 2142
- 1 LGWR wait for redo copy 8 20
- 1 log file sync 16 146252
- 1 log file switch completion 16 35
- 1 log file parallel write 16 170987
- 1 LGWR wait for redo copy 16 11
- 1 log file sync 32 606674
- 1 log file switch completion 32 97
- 1 log file parallel write 32 779205
- 1 log file sync 64 263377
- 1 log file switch completion 64 133
- 1 log file parallel write 64 311463
- 1 log file sync 128 253254
- 1 log file switch completion 128 326
- 1 log file parallel write 128 79688
- 1 log file sync 256 2
- 1 log file switch completion 256 1736
- 1 log file parallel write 256 42763
- 1 log file switch completion 512 3042
- 1 log file parallel write 512 13052
- 1 log file switch completion 1024 2020
- 1 log file parallel write 1024 20468
- 1 log file parallel write 2048 14020
- 1 log file parallel write 4096 921
- 1 log file parallel write 8192 96
- 1 log file parallel write 16384 18
- 1 log file parallel write 32768 18
- 1 log file parallel write 65536 8
- 1 log file parallel write 131072 2
-
- REDO WRITE STATS
-
- "redo write time" in centiseconds (100 per second)
- 11.1: "redo write broadcast ack time" in centiseconds (100 per second)
- 11.2: "redo write broadcast ack time" in microseconds (1000 per millisecond)
-
- VERSION INST_ID NAME VALUE MILLISECONDS
- ----------------- ---------- ---------------------------------------- --------------------- -------------------
- 10.2.0.5.0 1 redo write time 9551524 95515240.000
- 10.2.0.5.0 1 redo writer latching time 51
- 10.2.0.5.0 1 redo writes 1434931
-
- AWR WORST AVG LOG FILE SYNC SNAPS:
从上面数据可以看到,log file sync等待事件数量最多的是32ms这个区间的,log file parallel write等待事件发生最多的也是32ms这个区间的,其实这个值已经远远超过了正常值7ms,极其不正常。log file parallel write事件是LGWR进程专属的等待事件,发生在LGWR将log_buffer中的重做日志信息写入联机重做日志文件组的成员文件,LGWR在该事件上等待该写入过程的完成。该事件等待时间过长,说明日志文件所在磁盘缓慢或存在争用。log file sync和log file parallel write是相互关联的。换句话讲,假设log file parallel write的时间非常长,那么必定导致log file sync等待时间拉长。如果log file parallel write 等待非常高,那么可能一般是物理磁盘I/O的问题
另外,我们也检查了一下redo log的切换频率,如下所示,redolog_sitch_time.sql查看发现redo log切换的次数并不频繁,生成的归档日志的量也并不大。大部分时候一小时切换零次或一次。
然后我们找了一台服务器(上述指标正常的服务器)简单对测试了一下IO的速度,这个方法极其简单,就是看看生成一个大文件需要多长时间,简单测试一下I/O性能(没有考虑cache等,测试采样也不详尽),但是对比数据也基本能验证、反馈磁盘IO存在问题)。
问题服务器:
- # time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
-
- 2048+0 records in
-
- 2048+0 records out
-
- 1073741824 bytes (1.1 GB) copied, 88.271 seconds, 12.2 MB/s
-
-
-
- real 1m28.273s
-
- user 0m0.010s
-
- sys 0m0.655s
对比服务器(正常的服务器):
- # time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
-
- 2048+0 records in
-
- 2048+0 records out
-
- 1073741824 bytes (1.1 GB) copied, 2.48344 seconds, 432 MB/s
-
-
-
- real 0m2.485s
-
- user 0m0.004s
-
- sys 0m0.386s
如上对比所示,两台服务器生成同样一个大小文件,耗费的时间,I/O性能差别非常大,完全验证了告警的服务器所在的存储I/O存在性能问题,但是公司分工非常明确,DBA也不清楚底层存储出了什么问题,只能将这个问题反馈出来,等待海外负责维护系统和存储的同事的回复。