【MySQL故障处理】 Seconds_Behind_Master= NULL Error

将处理主从故障的过程记录在此：

故障1，Error_code: 1197 （max_binlog_cache_size）

数据库版本：5.7.9

报错信息：

[ERROR] Slave SQL for channel '': 
Worker 1 failed executing transaction 'bea4612c-4828-11e7-90b3-a0423f31cad6:716212' at master log mysql.135371, end_log_pos 103016490; 
Could not execute Write_rows event on table ******; 
Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; 
increase this mysqld variable and try again, Error_code: 1197; 
Writing one row to the row-based binary log failed, Error_code: 1534; 
handler error HA_ERR_RBR_LOGGING_FAILED; 
the event's master log FIRST, end_log_pos 103016490, Error_code: 1197

处理过程：

根据报错信息，知道是max_binlog_cache_size小了，

解决办法：临时增加该值后restart slave即可以继续复制线程。

stop slave;
set global max_binlog_cache_size= 201326592;
start slave;

关于max_binlog_cache_size请看这里：

如果一个事务需要多于设置值的内存，就会报上面的错。最小4096字节，最大支持4GB，因为binlog位置点最大4GB，支持动态修改。

解决了问题，继续查找发生问题的原因，生产环境中的该参数设置为64M，当一个事务影响的数据量超过该值时，即会报错。

查看主库中日志的具体内容：

数据库日志格式为mixed，文件中记录的是row格式，使用下面命令查看

/bin/mysqlbinlog   -vv --base64-output=decode-rows mysql.135371 | more

精简后的内容：
#  13:43:59 server id *******  end_log_pos 10538 CRC32  Write_rows: table id 2776
# at 10538                                          
#  13:43:59 server id *******  end_log_pos 18175 CRC32  Write_rows: table id 2776
# at 18175                                         
#  13:43:59 server id *******  end_log_pos 25789 CRC32  Write_rows: table id 2776
# at 25789                                           
#  13:43:59 server id *******  end_log_pos 33424 CRC32  Write_rows: table id 2776
# at 33424                                            
#  13:43:59 server id *******  end_log_pos 40961 CRC32  Write_rows: table id 2776
# at 40961                                           
#  13:43:59 server id *******  end_log_pos 48553 CRC32  Write_rows: table id 2776
# at 48553                                          
#  13:43:59 server id *******  end_log_pos 56126 CRC32  Write_rows: table id 2776
# at 56126                                          
#  13:43:59 server id *******  end_log_pos 63671 CRC32  Write_rows: table id 2776
# at 63671                                          
#  13:43:59 server id *******  end_log_pos 71205 CRC32  Write_rows: table id 2776
# at 71205                                           
#  13:43:59 server id *******  end_log_pos 78753 CRC32  Write_rows: table id 2776
# at 78753                                           
#  13:43:59 server id *******  end_log_pos 86322 CRC32  Write_rows: table id 2776
# at 86322                                           
#  13:43:59 server id *******  end_log_pos 93925 CRC32  Write_rows: table id 2776

不出意外，执行的sql为

 insert into table1 select * from table2 where time > '2016-10-01'

形式的多语句事务，查看binlog文件的大小，最大的超过了200M（max_binlog_size设置为50M）。

因此是大事务造成的，大事务还容易造成主从延时的问题，推荐将大事务拆分为小事务执行

107M  14:07 mysql.135380
101M  14:08 mysql.135381
112M  14:08 mysql.135382
110M  14:09 mysql.135383
124M  14:09 mysql.135384
 31M  14:10 mysql.135385
226M  14:10 mysql.135386
120M  14:12 mysql.135387
111M  14:13 mysql.135388
102M  14:14 mysql.135389
...
126M  14:16 mysql.135394
...
118M  14:23 mysql.135404
...
110M  14:28 mysql.135409
 63M  14:29 mysql.135410
104M  14:29 mysql.135411
 281  14:30 mysql.135412
115M  14:30 mysql.135413
112M  14:30 mysql.135414
127M  14:30 mysql.135415
119M  14:31 mysql.135416
 85M  14:32 mysql.135417
 60M  14:32 mysql.135418
151M  14:33 mysql.135419

故障2，Errno: 1872，Error: Slave failed to initialize relay log info structure from the repository
数据库在 'in place' 升级之后报出该错误，重启主从后问题解决
查看问题有人说原因是从库信息变化导致的，解决方法可以reset slave，重新change master。

欢迎批评指正