一业务系统在白天业务时间出现了严重了IO性能问题,下面是下午业务高峰时间(3-5)的awr报告
从等待事件来看主要都是与IO相关
从上面可以看到除了几个语句的逻辑读很高,其实物理不是很高,每秒产生的重做日志以及物理读也不高.
检查磁盘IO
rx6600-1:[/]#sar -d 1 10
HP-UX rx6600-1 B.11.23 U ia64 07/15/14
16:18:45 device %busy avque r+w/s blks/s avwait avserv
16:18:46 c39t0d3 100.00 0.50 18 1130 0.00 132.11
c41t0d3 83.50 0.50 6 450 0.00 290.78
16:18:47 c3t0d0 0.99 0.50 2 63 0.00 7.12
c39t0d3 91.09 0.50 10 982 0.00 115.53
c41t0d3 100.00 0.50 12 586 0.00 291.67
16:18:48 c3t0d0 3.03 0.50 2 32 0.00 15.93
c39t0d3 100.00 0.50 9 1034 0.00 139.76
c41t0d3 92.93 0.50 7 388 0.00 310.07
16:18:49 c3t0d0 2.00 0.50 4 64 0.00 19.59
c39t0d3 100.00 0.50 12 1088 0.00 127.33
c41t0d3 86.00 0.50 8 416 0.00 251.32
16:18:50 c3t0d0 1.01 0.50 1 2 0.00 8.99
c39t0d3 100.00 0.50 16 954 0.00 117.10
c41t0d3 100.00 0.50 9 614 0.00 295.52
16:18:51 c3t0d0 0.99 0.50 1 8 0.00 10.60
c39t0d3 93.07 0.50 17 913 0.00 110.59
c41t0d3 100.00 0.50 9 350 0.00 326.92
16:18:52 c39t0d3 100.00 0.50 21 1168 0.00 127.22
c41t0d3 88.00 0.50 11 544 0.00 252.08
16:18:53 c3t0d0 2.02 0.50 3 48 0.00 18.51
c39t0d3 88.89 0.50 19 1164 0.00 98.25
c41t0d3 100.00 0.50 11 630 0.00 324.39
16:18:54 c3t0d0 3.00 0.50 3 20 0.00 12.39
c39t0d3 95.00 0.50 20 954 0.00 131.90
c41t0d3 81.00 0.50 9 610 0.00 289.05
16:18:55 c3t0d0 9.00 0.50 11 134 0.00 8.62
c39t0d3 100.00 0.50 19 1090 0.00 137.20
c41t0d3 100.00 0.50 11 512 0.00 327.16
Average c39t0d3 99.50 0.50 16 1048 0.00 123.38
Average c41t0d3 100.00 0.50 9 510 0.00 296.44
Average c3t0d0 2.20 0.50 3 37 0.00 12.28
rx6600-1:[/]#sar -d 1 10
HP-UX rx6600-1 B.11.23 U ia64 07/15/14
16:20:04 device %busy avque r+w/s blks/s avwait avserv
16:20:05 c3t0d0 1.00 0.50 1 16 0.00 8.33
c39t0d3 98.00 0.50 16 928 0.00 114.86
c41t0d3 98.00 0.50 10 684 0.00 266.43
16:20:06 c3t0d0 1.98 0.50 4 81 0.00 8.57
c39t0d3 93.07 0.50 19 1251 0.00 128.81
c41t0d3 91.09 0.50 6 475 0.00 365.83
16:20:07 c3t0d0 2.00 0.50 3 48 0.00 5.87
c39t0d3 98.00 0.50 23 1216 0.00 113.66
c41t0d3 98.00 0.50 8 576 0.00 307.92
16:20:08 c3t0d0 1.00 0.50 2 32 0.00 5.36
c39t0d3 100.00 0.50 21 1132 0.00 118.47
c41t0d3 100.00 0.50 7 592 0.00 300.71
16:20:09 c3t0d0 6.00 0.58 13 194 2.22 26.05
c39t0d3 89.00 0.50 17 1152 0.00 123.54
c41t0d3 87.00 0.50 8 512 0.00 298.26
16:20:10 c3t0d0 3.00 0.50 6 96 0.00 22.78
c39t0d3 85.00 0.50 17 1136 0.00 114.79
c41t0d3 98.00 0.50 9 592 0.00 252.52
16:20:11 c3t0d0 1.00 0.50 1 2 0.00 8.04
c39t0d3 100.00 0.50 17 1216 0.00 138.04
c41t0d3 100.00 0.50 12 672 0.00 291.69
16:20:12 c3t0d0 2.00 0.50 3 34 0.00 9.24
c39t0d3 99.00 0.50 16 1024 0.00 122.11
c41t0d3 88.00 0.50 9 476 0.00 299.79
16:20:13 c39t0d3 91.00 0.50 18 1024 0.00 111.77
c41t0d3 92.00 0.50 3 384 0.00 396.25
16:20:14 c39t0d3 99.00 0.50 17 892 0.00 132.15
c41t0d3 100.00 0.50 10 608 0.00 233.54
Average c3t0d0 1.80 0.53 3 50 0.87 17.64
Average c39t0d3 96.00 0.50 18 1097 0.00 121.54
Average c41t0d3 100.00 0.50 8 557 0.00 290.35
在业务人员下班后重启的双机软件,但在启动数据库时停在了Completed redo application这一步
SQL> startup
ORACLE instance started.
Total System Global Area 1.0318E+10 bytes
Fixed Size 2073176 bytes
Variable Size 3238006184 bytes
Database Buffers 7063207936 bytes
Redo Buffers 14700544 bytes
Database mounted.
从alert.log文件中可以看到如下信息:
Tue Jul 15 22:23:29 2014
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
ILAT =61
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 500
sessions = 555
__shared_pool_size = 3154116608
__large_pool_size = 16777216
__java_pool_size = 33554432
__streams_pool_size = 33554432
sga_target = 10317987840
control_files = /sx_data/ORCL/control01.ctl, /sx_data/ORCL/control02.ctl, /sx_data/ORCL/control03.ctl
db_block_size = 8192
__db_cache_size = 7063207936
compatible = 10.2.0.3.0
log_archive_dest_1 = LOCATION=/sx_data/arch_ORCL/
log_archive_format = %t_%s_%r.dbf
db_file_multiblock_read_count= 16
db_recovery_file_dest = /oracle/flash_recovery_area
db_recovery_file_dest_size= 2147483648
undo_management = AUTO
undo_tablespace = UNDOTBS1
undo_retention = 39600
fast_start_parallel_rollback= FALSE
remote_login_passwordfile= EXCLUSIVE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=ORCLXDB)
local_listener = ORCL
job_queue_processes = 10
background_dump_dest = /oracle/admin/ORCL/bdump
user_dump_dest = /oracle/admin/ORCL/udump
core_dump_dest = /oracle/admin/ORCL/cdump
audit_file_dest = /oracle/admin/ORCL/adump
db_name = ORCL
open_cursors = 2000
optimizer_index_cost_adj = 20
optimizer_index_caching = 90
pga_aggregate_target = 2576351232
PMON started with pid=2, OS id=13613
PSP0 started with pid=3, OS id=13615
MMAN started with pid=4, OS id=13617
DBW0 started with pid=5, OS id=13619
LGWR started with pid=6, OS id=13621
CKPT started with pid=7, OS id=13623
SMON started with pid=8, OS id=13625
RECO started with pid=9, OS id=13627
CJQ0 started with pid=10, OS id=13629
MMON started with pid=11, OS id=13631
Tue Jul 15 22:23:30 2014
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=12, OS id=13635
Tue Jul 15 22:23:30 2014
starting up 1 shared server(s) ...
Tue Jul 15 22:23:31 2014
ALTER DATABASE MOUNT
Tue Jul 15 22:23:39 2014
Setting recovery target incarnation to 2
Tue Jul 15 22:23:42 2014
Successful mount of redo thread 1, with mount id 1380841571
Tue Jul 15 22:23:42 2014
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Tue Jul 15 22:23:42 2014
ALTER DATABASE OPEN
Tue Jul 15 22:23:47 2014
Beginning crash recovery of 1 threads
parallel recovery started with 7 processes
Tue Jul 15 22:23:50 2014
Started redo scan
Tue Jul 15 22:23:52 2014
Completed redo scan
336597 redo blocks read, 78835 data blocks need recovery
Tue Jul 15 22:23:52 2014
Started redo application at
Thread 1: logseq 2270, block 29
Tue Jul 15 22:23:53 2014
Recovery of Online Redo Log: Thread 1 Group 4 Seq 2270 Reading mem 0
Mem# 0: /sx_data/ORCL/redo04.log
Tue Jul 15 22:23:58 2014
Completed redo application
一直停在Completed redo application这,而这时的等待事件是checkpoint complete开始以为是并行恢复慢造成的,就查询了v$transaction,v$fast_start_transactions但视图中并没有进行恢复操作的事务存在.后来咨询了老熊,老熊说检查一下IO情况看是不是存储出问题了,如是再次检查存储IO性能:
rx6600-1:[/]#sar 1 10
HP-UX rx6600-1 B.11.23 U ia64 07/15/14
22:36:41 %usr %sys %wio %idle
22:36:42 2 2 12 84
22:36:43 1 0 12 87
22:36:44 0 0 17 83
22:36:45 0 0 13 87
22:36:46 0 1 12 87
22:36:47 2 1 13 84
22:36:48 1 1 16 82
22:36:49 0 0 12 88
22:36:50 0 0 12 88
22:36:51 0 0 22 78
Average 1 0 14 85
从上面可以看到现在实际上并没有业务在跑居然还存在IO等待这是不正常的
rx6600-2:[/]#bdf
Filesystem kbytes used avail %used Mounted on
/dev/vg00/lvol3 983040 422504 556176 43% /
/dev/vg00/lvol1 1835008 135048 1686776 7% /stand
/dev/vg00/lvol8 8912896 8535352 374824 96% /var
/dev/vg00/lvol7 7962624 2762312 5159704 35% /usr
/dev/vg00/lvol4 524288 83192 437784 16% /tmp
/dev/vg00/tmplv 2064384 93512 1847942 5% /oratmp
/dev/vg00/orasoft 10256384 3144652 6668390 32% /orasoft
/dev/vg00/oracle 20480000 5480497 14062042 28% /oracle
/dev/vg00/lvol6 9076736 5206384 3840128 58% /opt
/dev/vg00/lvol5 131072 25472 104824 20% /home
/dev/cwjcvg/cwjc_datalv
414973952 134188114 263239842 34% /cwjc_data
/dev/sxvg/sx_datalv
624689152 298665485 305654147 49% /sx_data
rx6600-2:[/]#time dd if=/dev/zero of=/var/test bs=8k count=100000
下面对小机自身的磁盘进行IO测试写800M的数据只要12秒左右
msgcnt 2 vxfs: mesg 001: vx_nospace - /dev/vg00/lvol8 file system full (1 block extent)
I/O error
47185+0 records in
47184+1 records out
real 11.7
user 0.0
sys 0.8
但是对EMC存储进行IO测试写800M的数据只要30多分还没有完成
rx6600-2:[/]#time dd if=/dev/zero of=/sx_data/test bs=8k count=100000
711856+0 records in
711855+0 records out
real 30:58.4
user 0.5
sys 13.0
这明显的是存储出了问题,后面得知管理人员早上10点多发现了存储有一个磁盘损坏了,存储做的raid 5,有热备盘.而且还有上百G的数据进行存储级的同步.与出现性能问题的时间一至。
到此问题原因找到了解决起来也就简单了.幸亏问题解决了,第二天有大领导来检查要不就.......哈哈