文章详情

短信预约信息系统项目管理师 报名、考试、查分时间动态提醒

请输入下面的图形验证码

提交验证

短信预约提醒成功

ceph mon无法启动-rocksdb数据损坏

2019-06-15 17:36

关注


	ceph mon无法启动-rocksdb数据损坏
[数据库教程]

一、问题描述

rocksdb数据库发生异常导致mon进程无法拉起。

二、问题现象:

mon异常第一次call trace信息如下:

  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,237 2020-07-31 19:36:31.926040 7fdc0142d700 -1 /root/rpmbuild/BUILD/ceph-12.2.12-1/src/mon/Monitor.cc: In function ‘bool Monitor::_scrub(ScrubResult*, std::pair, std::basic_string >*, int*)‘ thread 7fdc0142d700 time 2020-07-31 19:36:31.895145
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,238 /root/rpmbuild/BUILD/ceph-12.2.12-1/src/mon/Monitor.cc: 5374: FAILED assert(err == 0)
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,239  ceph version 12.2.12-30-ged2e5c3 (ed2e5c3c26215c395ed024dabce34321e1f650b3) luminous (stable)
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,240  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5583990f8b10]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,241  2: (Monitor::_scrub(ScrubResult*, std::pair*, int*)+0xc11) [0x558398e7db61]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,242  3: (Monitor::handle_scrub(boost::intrusive_ptr)+0x22f) [0x558398e8adbf]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,243  4: (Monitor::dispatch_op(boost::intrusive_ptr)+0xc08) [0x558398ea6ab8]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,244  5: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x558398ea7a4b]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,245  6: (Monitor::ms_dispatch(Message*)+0x23) [0x558398ed4323]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,246  7: (DispatchQueue::entry()+0x792) [0x5583993c2ef2]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,247  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x5583991a120d]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,248  9: (()+0x7e25) [0x7fdc0a21ae25]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,249  10: (clone()+0x6d) [0x7fdc077b534d]

另外一种call trace信息:
技术图片

三、问题分析

经过调查,是ceph mon处理scrub消息时,调用读取racksdb出现错误,可能数据库发生损坏。

bool Monitor::_scrub(ScrubResult *r,
                     pair *start,
                     int *num_keys)
{
  MonitorDBStore::Synchronizer it = store->get_synchronizer(*start, prefixes);
  while (it->has_next_chunk()) {
    pair k = it->get_next_key();  
    bufferlist bl;
    int err = store->get(k.first, k.second, bl);// 调用racksdb出现错误,可能数据库发生损坏
    assert(err == 0);    
    }
}

经过调查,该问题之前有人在rocksdb仓库中进行了bug submit,我查看了相关的comments信息,以及rocksdb社区对该问题的修复。

issues 提交: https://github.com/facebook/rocksdb/issues/5558

rocksdb修复: https://github.com/facebook/rocksdb/pull/5744

rocksdb修复中提到:

Open-source users recently reported two occurrences of LSM-tree corruption (#5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.

When user uses options.force_consistency_check in RocksDb, instead of crashing the process, we now pass the error back to the users without killing the process.
  1. 通过打开force_consistency_checks选项,在rocksdb Apply操作时候,调用CheckConsistency or CheckConsistencyForDeletes进行相关的一致性检查,在5.4.0版本中,force_consistency_checks 默认是false,其实根本没有进行相关的检测验证。
void CheckConsistency(VersionStorageInfo* vstorage) {
#ifdef NDEBUG
    if (!vstorage->force_consistency_checks()) {
      // Dont run consistency checks in release mode except if
      // explicitly asked to
      return;
    }
#endif
...
}

void CheckConsistencyForDeletes(VersionEdit* edit, uint64_t number,
                                  int level) {
#ifdef NDEBUG
    if (!base_vstorage_->force_consistency_checks()) {
      // Dont run consistency checks in release mode except if
      // explicitly asked to
      return;
    }
#endif    
...
}
  1. 可以在配置中,把force_consistency_checks 置位true,期望进行rocksdb的相关检测。
mon_rocksdb_options = write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true,force_consistency_checks=true

通过步骤步骤4,确实会进行rocksdb的相关检测,但是当检测到rocksdb异常(比如,sst文件排序错误、重叠、要删除文件找不到)等情况,在5.4.0版本中进行的操作是abort(),此时直接kill掉进程退出,在https://github.com/facebook/rocksdb/pull/5744 中进行了相关优化。即当出现这些异常时候,输出相关信息到前端,而不是直接把相关的进程kill掉。以下为部分代码优化:

//rocksdb 5.4.0

131   void CheckConsistency(VersionStorageInfo* vstorage) {
...
139     // make sure the files are sorted correctly
140     for (int level = 0; level < vstorage->num_levels(); level++) {
141       auto& level_files = vstorage->LevelFiles(level);
142       for (size_t i = 1; i < level_files.size(); i++) {
143         auto f1 = level_files[i - 1];
144         auto f2 = level_files[i];
145         if (level == 0) {
146           if (!level_zero_cmp_(f1, f2)) {
147             fprintf(stderr, "L0 files are not sorted properly");
148             abort();// 进程直接退出
149           }
150        
151           if (f2->smallest_seqno == f2->largest_seqno) {
152             // This is an external file that we ingested
153             SequenceNumber external_file_seqno = f2->smallest_seqno;
154             if (!(external_file_seqno < f1->largest_seqno ||
155                   external_file_seqno == 0)) {
156               fprintf(stderr, "L0 file with seqno %" PRIu64 " %" PRIu64
157                               " vs. file with global_seqno %" PRIu64 "
",
158                       f1->smallest_seqno, f1->largest_seqno,
159                       external_file_seqno);
160               abort();// 进程直接退出
161             }
162           } else if (f1->smallest_seqno <= f2->smallest_seqno) {
163             fprintf(stderr, "L0 files seqno %" PRIu64 " %" PRIu64
164                             " vs. %" PRIu64 " %" PRIu64 "
",
165                     f1->smallest_seqno, f1->largest_seqno, f2->smallest_seqno,
166                     f2->largest_seqno);
167             abort();// 进程直接退出
168           }
169         } else {
170           if (!level_nonzero_cmp_(f1, f2)) {
171             fprintf(stderr, "L%d files are not sorted properly", level);
172             abort();// 进程直接退出
173           }
}

//rocksdb v6.10.2

204   Status CheckConsistency(VersionStorageInfo* vstorage) {
243         if (level == 0) {
244           if (!level_zero_cmp_(f1, f2)) {
245             fprintf(stderr, "L0 files are not sorted properly");
246             return Status::Corruption("L0 files are not sorted properly");//不退出,给出提示信息
247           }       
248                   
249           if (f2->fd.smallest_seqno == f2->fd.largest_seqno) {
250             // This is an external file that we ingested
251             SequenceNumber external_file_seqno = f2->fd.smallest_seqno;
252             if (!(external_file_seqno < f1->fd.largest_seqno ||
253                   external_file_seqno == 0)) {
254               fprintf(stderr,
255                       "L0 file with seqno %" PRIu64 " %" PRIu64
256                       " vs. file with global_seqno %" PRIu64 "
",
257                       f1->fd.smallest_seqno, f1->fd.largest_seqno,
258                       external_file_seqno);
259               return Status::Corruption(
260                   "L0 file with seqno " +
261                   NumberToString(f1->fd.smallest_seqno) + " " +
262                   NumberToString(f1->fd.largest_seqno) +
263                   " vs. file with global_seqno" +
264                   NumberToString(external_file_seqno) + " with fileNumber " +
265                   NumberToString(f1->fd.GetNumber()));//不退出,给出提示信息
266             }     
267           } else if (f1->fd.smallest_seqno <= f2->fd.smallest_seqno) {
268             fprintf(stderr,
269                     "L0 files seqno %" PRIu64 " %" PRIu64 " vs. %" PRIu64                                                                                                               
270                     " %" PRIu64 "
",
271                     f1->fd.smallest_seqno, f1->fd.largest_seqno,
272                     f2->fd.smallest_seqno, f2->fd.largest_seqno);
273             return Status::Corruption(
274                 "L0 files seqno " + NumberToString(f1->fd.smallest_seqno) +
275                 " " + NumberToString(f1->fd.largest_seqno) + " " +
276                 NumberToString(f1->fd.GetNumber()) + " vs. " +
277                 NumberToString(f2->fd.smallest_seqno) + " " +
278                 NumberToString(f2->fd.largest_seqno) + " " +
279                 NumberToString(f2->fd.GetNumber()));
280           } //不退出,给出提示信息        
281         } else {    
282           if (!level_nonzero_cmp_(f1, f2)) {
283             fprintf(stderr, "L%d files are not sorted properly", level);
284             return Status::Corruption("L" + NumberToString(level) +
285                                       " files are not sorted properly");
286           }         
287       
288           // Make sure there is no overlap in levels > 0
289           if (vstorage->InternalComparator()->Compare(f1->largest,
290                                                       f2->smallest) >= 0) {
291             fprintf(stderr, "L%d have overlapping ranges %s vs. %s
", level,
292                     (f1->largest).DebugString(true).c_str(),
293                     (f2->smallest).DebugString(true).c_str());
294             return Status::Corruption(
295                 "L" + NumberToString(level) + " have overlapping ranges " +
296                 (f1->largest).DebugString(true) + " vs. " +
297                 (f2->smallest).DebugString(true));//不退出,给出提示信息
298           }
299         }

以下版本进行了该分支代码修复。

 v6.11.4  v6.10.2 v6.10.1 v6.8.1 v6.7.3 v6.6.4 v6.6.3 v6.5.3 v6.5.2

L版本中,当前默认rocksdb版本。

OSD重启过程中,可以看到rocksdb的版本。
2020-08-10 16:08:22.350071 7f57988afd00  4 rocksdb: RocksDB version: 5.4.0

四、问题总结

建议升级高版本的rocksdb进行相关问题的修复。

五、workaround方法:

  1. ssh到node-3节点,备份当前节点的mon数据库文件
cd /var/lib/ark/ceph/ceph/mon/mon/
mv ceph-node-3 bak.ceph-node-3

ssh到主mon节点,拷贝文件到node-3节点

scp -rp /var/lib/ceph//mon/ceph-node-1 node-3:/var/lib/ceph//mon/ceph-node-3

k8s的 mon本地目录可能在/var/lib/ark/ceph/ceph/mon/mon/

重启mon服务。

ceph mon无法启动-rocksdb数据损坏

原文地址:https://blog.51cto.com/wendashuai/2518835

阅读原文内容投诉

免责声明:

① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。

② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

软考中级精品资料免费领

  • 历年真题答案解析
  • 备考技巧名师总结
  • 高频考点精准押题
  • 2024年上半年信息系统项目管理师第二批次真题及答案解析(完整版)

    难度     813人已做
    查看
  • 【考后总结】2024年5月26日信息系统项目管理师第2批次考情分析

    难度     354人已做
    查看
  • 【考后总结】2024年5月25日信息系统项目管理师第1批次考情分析

    难度     318人已做
    查看
  • 2024年上半年软考高项第一、二批次真题考点汇总(完整版)

    难度     435人已做
    查看
  • 2024年上半年系统架构设计师考试综合知识真题

    难度     224人已做
    查看

相关文章

发现更多好内容

猜你喜欢

AI推送时光机
位置:首页-资讯-数据库
咦!没有更多了?去看看其它编程学习网 内容吧
首页课程
资料下载
问答资讯