前言
目前,我们大多数的应用都是没有专门接入ANR监控框架的,所以,本文的讲解主要是以借助log和trace的方式进行ANR问题的分析。
一.ANR问题分类
大分类 | 小分类 | Reason/Subject | 超时时间 | 原因 |
输入无响应 | 输入事件未响应 | Input dispatching timed out (...) | 5S | 具体细分的话()中有6种不同的类型。 主线程阻塞,输入的事件5S得不到响应。 |
界面无焦点 | Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.) | 5S | 应用的Activity进入到了resume周期,但是window却无法进行刷新操作。 | |
Broadcast超时 | 有序广播超时 | Broadcast of Intent(...) | 10S | 分发有序广播时,处理单个广播接收者的时间超过了10S。 |
Service超时 | 启动超时 | executing service (...) | 前台服务:20S 后台服务:200S | Application.onCreate()+Service.onCreate()+Application.onStartCommand()耗时大于超时时间 |
前台服务未绑定 | Context.startForegroundService() did not then call Service.startForeground(): | 0版本:10S 13(含)+版本:30S | Application.onCreate()+Service.onCreate()+Application.onStartCommand()耗时大于超时时间 | |
Provider超时 | Provider连接超时 | ContentProvider not responding | 应用自身所设定 | 连接时间超过应用所设定时间 |
其中输入事件未响应可以细分为6种类型,如下:
InputDispatcher.cpp::checkWindowReadyForMoreInputLocked
细分类型 | 备注 |
Waiting because the touched window is paused. | 窗口被暂停导致无法接收新的输入事件 |
Waiting because the touched window's input channel is not registered with the input dispatcher. The window may be in the process of being removed. | 窗口输入信号渠道未注册,此时有可能应用被杀死 |
Waiting because the %s window's input connection is %s. The window may be in the process of being removed. | 输入信号的FD文件未注册,此时有可能window被移除 |
Waiting because the %s window's input channel is full. Outbound queue length: %d. Wait queue length: %d. | 管道已满 |
Waiting to send key event because the %s window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: %d. Wait queue length: %d. | input发出的输入事件还未收到应用的响应,并且是key类型事件。比如返回就是key类型事件。 Outbound queue代表发送给应用的事件。 Wait queue代表还未发送给应用但是进入排队的事件。 |
Waiting to send non-key event because the %s window has not finished processing certain input events that were delivered to it over %0.1fms ago. Wait queue length: %d. Wait queue head age: %0.1fms. | input发出的输入事件还未收到应用的响应,并且是普通类型事件。比如屏幕点击就是普通类型。 Wait queue代表还未发送给应用但是进入排队的事件,最早的一个已经等待Nms |
二.ANR相关文件介绍
2.1 三份异常文件
出现ANR时,一般会有三份记录文件:
logcat日志文件
data/anr/目录下的anr_2023-05-06-00-58-49-976的txt文件。
data/system/dropbox目录下的system_app_anr@1688888.txt.gz的压缩文件。
其中,第三份文件算是第二份的详细说明版,所以第二份文件我们一般可以不看。只看第一份和第三份即可。
2.2 system_app_anr文件介绍
system_app_anr文件样例如下:
Process: com.xt.clientPID: 1816Flags: 0x38d8be45Package: com.xt.client v1 (1.0.1)Foreground: YesActivity: com.xt.client.activitys/.TestActivityParent-Activity: com.xt.client.activitys/.MainActivitySubject: Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 2. Wait queue head age: 5504.5ms.)Build: qti/gwm/gwm:9/PQ1A.190100.004/xt04260513:userdebug/test-keysCPU usage from 0ms to 12069ms later (2023-04-28 03:56:53.297 to 2023-04-28 03:57:05.366):52% 1131/system_server: 27% user + 24% kernel / faults: 5828 minor 6 major...4.4% 521/android.hardware.sensors@1.0-service: 0.6% user + 3.8% kernel / faults: 1 minor0% 130/kswapd0: 0% user + 0% kernel0.4% 390/vendor.ts.someip@1.0-service: 0.3% user + 0.1% kernel0.4% 1668/com.android.phone: 0.4% user + 0% kernel / faults: 501 minor...32% TOTAL: 17% user + 13% kernel + 0% iowait + 1.1% irq + 0.4% softirq----- pid 1816 at 2023-04-28 03:56:53 -----Cmd line: com.xt.clientBuild fingerprint: 'qti/gwm/gwm:9/PQ1A.190105.004/xt04260513:userdebug/test-keys'ABI: 'arm64'Build type: optimizedZygote loaded classes=10642 post zygote classes=11256Intern table: 88073 strong; 400 weakJNI: CheckJNI is off; globals=3029 (plus 106 weak)Libraries: /system/app/XTMap-app-BUX20_M83/XTMap-app-BUX2.apk!/lib/arm64-v8a/libDexHelper.so (1)//加载的SOHeap: 33% free, 23MB/35MB; 462206 objects //Java内存状态,用来排查是否是GC引起Dumping cumulative Gc timingsStart Dumping histograms for 1002 iterations for concurrent copying...Done Dumping histograms---对内存的一些描述开始---Total bytes allocated 15GBTotal bytes freed 15GBFree memory 11MBFree memory until GC 11MBFree memory until OOME 360MBTotal memory 35MBMax memory 384MBZygote space size 1176KBTotal mutator paused time: 235.004msTotal time waiting for GC to complete: 352.491msTotal GC count: 1002Total GC time: 110.770sTotal blocking GC count: 3Total blocking GC time: 436.933msHistogram of GC count per 10000 ms: 0:2876,1:970,2:8,3:5Histogram of blocking GC count per 10000 ms: 0:3856,1:3---对内存的一些描述结束---Registered native bytes allocated: 24074579...suspend all histogram: Sum: 198.094ms 99% C.I. 10.341us-4065.279us Avg: 195.166us Max: 85963us //挂起状态描述//一般来说,第一个是信号量线程,第二个主线程DALVIK THREADS (105):"Signal Catcher" daemon prio=5 tid=3 Runnable | group="system" sCount=0 dsCount=0 flags=0 obj=0x13700088 self=0x7c55416400 | sysTid=1828 nice=0 cgrp=default sched=0/0 handle=0x7c4ecde4f0 | state=R schedstat=( 3884517931 270873627 2516 ) utm=224 stm=164 core=1 HZ=100 | stack=0x7c4ebe3000-0x7c4ebe5000 stackSize=1009KB | held mutexes= "mutator lock"(shared held) native: #00 pc 00000000003c162c /system/lib64/libart.so (offset 15c000) (art::DumpNativeStack(std::__1::basic_ostream>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+220) ..."main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x75c2fb08 self=0x7c55414c00 | sysTid=1816 nice=0 cgrp=default sched=0/0 handle=0x7cdaf79550 | state=S schedstat=( 340898551613 262139988290 2410472 ) utm=17556 stm=16533 core=4 HZ=100 //启动后,用户态时间占用17556*10=170秒,内核态160秒 | stack=0x7fdf117000-0x7fdf119000 stackSize=8MB | held mutexes= kernel: (couldn't read /proc/self/task/1816/stack) native: #00 pc 000000000001f06c /system/lib64/libc.so (syscall+28) native: #01 pc 00000000000221d0 /system/lib64/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+140) native: #02 pc 0000000000080f4c /system/lib64/libc.so (pthread_cond_wait+60) native: #03 pc 000000000047f1fc /system/lib64/libhwui.so (android::uirenderer::renderthread::DrawFrameTask::postAndWait()+260) native: #04 pc 000000000047f0c8 /system/lib64/libhwui.so (android::uirenderer::renderthread::DrawFrameTask::drawFrame()+44) at android.view.ThreadedRenderer.nSyncAndDrawFrame(Native method) at android.view.ThreadedRenderer.draw(ThreadedRenderer.java:823) at android.view.ViewRootImpl.draw(ViewRootImpl.java:3308) at android.view.ViewRootImpl.performDraw(ViewRootImpl.java:3112) at android.view.ViewRootImpl.performTraversals(ViewRootImpl.java:2481) at android.view.ViewRootImpl.doTraversal(ViewRootImpl.java:1457) at android.view.ViewRootImpl$TraversalRunnable.run(ViewRootImpl.java:7198) at android.view.Choreographer$CallbackRecord.run(Choreographer.java:949) at android.view.Choreographer.doCallbacks(Choreographer.java:761) at android.view.Choreographer.doFrame(Choreographer.java:696) at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:935) at android.os.Handler.handleCallback(Handler.java:873) at android.os.Handler.dispatchMessage(Handler.java:99) at android.os.Looper.loop(Looper.java:193) at android.app.ActivityThread.main(ActivityThread.java:6734) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:506) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)
字段 | 行数 | 介绍 | 备注 |
Process | 1 | 发生ANR的进程名 (区分应用包名) | com.xt.client |
PID | 2 | 进程ID | 1816 进程ID可以通过 adb shell top查看 |
Flags | 3 | ||
Package | 4 | 应用包名/版本号/版本code | com.xt.client v1 (1.0.1) |
Foreground | 5 | 是否前台 YES=前台,FALSE=后台 | Foreground: Yes |
Activity | 6 | 发生ANR的页面名 | |
Parent-Activity | 7 | 发生ANR的页面由谁唤起的 | |
Subject | 8 | ANR的原因分类,类型可以参考第一章 | |
Build | 9 | 系统版本 | |
CPU | 11 | CPU使用率 | from 0ms to 12069ms later指的是从发生ANR的时刻开始统计,到ANR发生之后的12069ms结束,这期间的CPU使用统计。所以这里的CPU使用率是ANR发生之后的,只能做参考 |
user/kernel/faults | 12 | 应用CPU占用率 | 52% 1131/system_server: 27% user + 24% kernel / faults: 5828 minor 6 major 代表1131这个应用的CPU占用率是52%(按单核算的),其中用户态27%,内核态24%。 faults代表内存缺页,分为两个等级minor——轻微的,major——重度,需要从磁盘拿数据 system_server也属于一个应用,只不过是系统应用而已。同理surfaceflinger也一样。 |
TOTAL | 19 | 总CPU使用率的统计 | 32% TOTAL: 17% user + 13% kernel + 0% iowait + 1.1% irq + 0.4% softirq 这里的32%是按照四核算的,也是就是说上面的CPU使用率之和为32%*4=128% 其中17%属于用户态,13属于内核态,IO等待0%,1.1%属于硬中断,0.4%属于软中断 |
utm/stm | 61 | 线程CPU占用 | state=S schedstat=( 340898551613 262139988290 2410472 ) utm=17556 stm=16533 core=4 HZ=100 HZ代表100hz,则每个时间片就是1000ms/100=10ms。 启动后,utm代表用户态时间占用17556时间片,约为175秒,内核态165秒。 core=4代表最后一次执行是在4号CPU |
2.2 异常文件查看
先看一下system_app_anr文件,这份文件,可以让我们对ANR异常有一个初步的了解。
1.看原因
查看Subject/主题
Subject: Context.startForegroundService() did not then call Service.startForeground(): ServiceRecord{7cac113 u0 com.gwm.app.dock.devices/com.gwm.app.dock.service.pool.DockRemotePoolService}
属于前台服务未绑定的超时,就是说service启动之后,在10S内没有去绑定前台UI(这里要区分版本的),这时候系统会通知APP去抛出异常,但是因为APP主线程阻塞,超过10S以上无法处理这个异常,就会提示这种类型的ANR错误。
所以,参考第一章的表格,我们可以做一个初步推断,service启动后10S内没有绑定前台视图。属于前台服务未绑定的类型,则对应的可以按照3.5中的步骤去排查。
2.看负载
首先,看ANR附近的负载
查看从发生ANR之后的5319ms内,每个应用的CPU占用率
CPU usage from 0ms to 5319ms later (2023-05-06 00:58:44.906 to 2023-05-06 00:58:50.225):11% 537/vendor.bosch.hardware.receiverbasesoftware@1.0-service: 9.5% user + 2% kernel / faults: 1921 minor8.4% 1087/system_server: 4% user + 4.4% kernel / faults: 1508 minor6.8% 524/surfaceflinger: 3.3% user + 3.5% kernel / faults: 109 minor5.3% 491/android.hardware.sensors@1.0-service: 0.5% user + 4.8% kernel...1.6% 4350/com.xt.client.TestService: 0.9% user + 0.7% kernel / faults: 1995 minor 14 major41% TOTAL: 23% user + 16% kernel + 0% iowait + 1.5% irq + 0.3% softirq
这里的整体CPU的使用率并不高,其中com.gwm.app.dock.service.dockservice的用户态使用率0.9%,内核态0.7%。
如果CPU占用率较高的话,则有可能是因为资源抢占的原因导致的。
41% TOTAL: 23% user + 16% kernel + 0% iowait + 1.5% irq + 0.3% softirq
总负载41%(如果是4核的,总负载已经把上面的所有累加值除以4了),23%是用户态,16%是内核态,io等待占比0%,irq:硬中断,softirq:软中断。
超过70%的总负载,可以认为是高负载。
其次,还可以看一下应用启动后的负载,甚至手机启动后的负载。
因为ANR之后捕获的是CPU占用率较高+最近使用过的应用,所以可以查看他们所有线程的时间片占用,来评估对应用启动后的CPU时间占用。
3.看线程
"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x748afb08 self=0x78dd014c00 | sysTid=4350 nice=0 cgrp=default sched=0/0 handle=0x7962bd2550 | state=R schedstat=( 45985468 60216099 94 ) utm=0 stm=4 core=0 HZ=100 | stack=0x7fd7678000-0x7fd767a000 stackSize=8MB | held mutexes= kernel: (couldn't read /proc/self/task/4350/stack) native: #00 pc 00000000000c993c /system/bin/linker64 (__dl_mprotect+8) native: #01 pc 000000000002172c /system/bin/linker64 (__dl__ZN20LinkerBlockAllocator11protect_allEi+36) native: #02 pc 0000000000014fd0 /system/bin/linker64 (__dl__ZN18ProtectedDataGuardC1Ev+44) native: #03 pc 0000000000018520 /system/bin/linker64 (__dl__Z9do_dlopenPKciPK17android_dlextinfoPKv+1456) native: #04 pc 0000000000014244 /system/bin/linker64 (__loader_android_dlopen_ext+72) native: #05 pc 0000000000000fc4 /system/lib64/libdl.so (android_dlopen_ext+12) native: #06 pc 0000000000003b68 /system/lib64/libnativeloader.so (android::OpenNativeLibrary(_JNIEnv*, int, char const*, _jobject*, _jstring*, bool*, std::__1::basic_string, std::__1::allocator>*)+384) native: #07 pc 00000000002e85c8 /system/lib64/libart.so (art::JavaVMExt::LoadNativeLibrary(_JNIEnv*, std::__1::basic_string, std::__1::allocator> const&, _jobject*, std::__1::basic_string, std::__1::allocator>*)+2144) native: #08 pc 0000000000004038 /system/lib64/libopenjdkjvm.so (JVM_NativeLoad+412) at java.lang.Runtime.nativeLoad(Native method) at java.lang.Runtime.loadLibrary0(Runtime.java:1014) - locked <0x025e5a64> (a java.lang.Runtime) at java.lang.System.loadLibrary(System.java:1669) at com.tencent.mmkv.MMKV.p(MMKV.java:13) at com.xt.client.TestService.onCreate(DockRemotePoolService.java:5) at android.app.ActivityThread.handleCreateService(ActivityThread.java:3570) at android.app.ActivityThread.access$1300(ActivityThread.java:200) at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1672) at android.os.Handler.dispatchMessage(Handler.java:106) at android.os.Looper.loop(Looper.java:193) at android.app.ActivityThread.main(ActivityThread.java:6734) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:506) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)
这里可以看到,最终ANR之后,系统捕获到的主线程堆栈位于Service.onCreate()方法中,所以很有可能是Service的onCreate方法耗时导致的ANR。
当然并不绝对,ANR是主线阻塞时间累计达到了超时时间,比如之前的方法耗时4.9S,onCreate耗时0.2秒也会超时,但是很明显onCreate并不是主因。并且这里还会出现堆栈漂移的情况,所以这里的堆栈是一个很重要的参考,但并不能100%认定。
2.3 日志介绍
时间范围
再来看一下log日志,因为上面提到,是service启动相关的原因,所以我们的log日志范围可以截取从启动service开始,到打印ANR信息之间的所有日志。
如果要更精准一点的话,可以只筛选相关进程的日志,比如应用进程/system_serve进程/关联进程比如launcher等。
//启动service05-06 00:58:34.880 1087 1104 I am_proc_start : [0,4350,1000,com.xt.client.TestService,service,com.xt.client/com.xt.client.TestService]...//系统发现service超时05-06 00:58:44.922 1087 1102 I am_anr : [0,4350,com.xt.client.TestService,550026821,Context.startForegroundService() did not then call Service.startForeground(): ServiceRecord{7cac113 u0 com.xt.client/com.xt.client.TestService}]...//打印ANR信息05-06 00:58:50.350 1087 1102 E ActivityManager : ANR in com.xt.client.TestService05-06 00:58:50.350 1087 1102 E ActivityManager : PID: 435005-06 00:58:50.350 1087 1102 E ActivityManager : Reason: Context.startForegroundService() did not then call Service.startForeground(): ServiceRecord{7cac113 u0 com.xt.client/com.xt.client.TestService}05-06 00:58:50.350 1087 1102 E ActivityManager : 0.1% 290/kworker/1:1H: 0% user + 0.1%05-06 00:58:50.351 1087 1099 W ActivityManager : Unable to start service Intent { act=com.tencent.wecarspeech.vframework.service.WecarService pkg=com.tencent.wecarspeech } U=0: not found callingPid :4274
关联信息
另外,我们还可以搜索一下,ANR时间点附近,是否有其它的应用也发生了ANR,如果有,则说明有可能是CPU过载导致的。
三.案例排查讲解
3.1 输入事件未响应
说明:
输入事件未响应主要是输入事件在5S内未能完成响应导致的。未能按照完成响应的主要原因有如下几种可能:
主线程有单个高耗时任务,如IO/wait/sleep等。
主线程有大量主线程的任务,累计耗时较高。
CPU负载较高,导致主线程任务迟迟无法得到调度。
所以针对这一类的问题,我们主要看以下几个指标:
日志主线程的相关日志;
日志中其它进程是否出现ANR或者高耗时的操作日志,比如绘制丢包等等。
data_app_anr中的CPU负载。
案例:
首先我们看data_app_anr文件,有两条比较重要的信息:
Subject: Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 2. Wait queue head age: 6040.8ms.)...72% 6732/com.gameloft.android.BTCN.GloftA9BT: 60% user + 12% kernel / faults: 10035 minor 28 major
第1行,排队中未处理的输入事件有2条,并且排队时间6000ms。
第3行,ANR之后的20S内,CPU负载率很高,则有理由怀疑ANR之后也是较高的水平。
接下来看日志:
涉及到的进程id:system_server:1130ANR的进程:673206-03 06:57:37.073 6732 6732 W AdapterController === : doClusterVisible set cluster visible : false06-03 06:57:39.938 6732 6814 I SurfaceControlWatchdog : finalize is complete,do not need trace06-03 06:57:51.960 6732 6732 I System.out : MainActivity onPause//这中间整整7秒没有6732进程的日志,直到57秒ANR发生06-03 06:57:58.561 1130 1145 I am_anr : [0,6732,com.gameloft.android.BTCN.GloftA9BT,986201156,Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 2. Wait queue head age: 6040.8ms.)]
我们可以看到,06-03 06:57:51.960之后的7秒时间内,主线程都没有输出一条日志。结合之前的CPU高负载,则有理由怀疑onPause之后主线程有高耗时操作,从而导致ANR问题的发生。
3.2 应用界面无焦点
说明
针对这一类的问题,需要查看的进程有如下几个:
sytem_server进程,重点查看其ActivityManager,WindowManager,InputDispatcher的日志
发生ANR的应用的日志,主要看起主线程是否阻塞。
案例
ANR的进程PID= 6596
system_server的进程PID=1042
//55最后一次渲染成功05-24 14:33:55.391 1042 1162 I WindowManager : finishDrawingWindow: Window{2068493 u0 SmartMaskView} mDrawState=HAS_DRAWN//setting切换到resume状态05-24 14:33:56.199 6596 6596 W xt_settings : onResume, mCurrentPosition =205-24 14:33:56.199 6596 6596 I am_on_resume_called : [0,com.xt.client.MainActivity,RESUME_ACTIVITY]//多个window排队等待绘制05-24 14:33:56.231 1042 1162 I WindowManager : finishDrawingWindow: Window{211f896 u0 com.xt.client.map/com.xt.client.map.activity.MapMainActivity} mDrawState=DRAW_PENDING05-24 14:33:56.261 1042 2318 I WindowManager : finishDrawingWindow: Window{292de3b u0 com.android.settings/com.android.settings.ui.activity.MainActivity} mDrawState=DRAW_PENDING05-24 14:33:56.332 1042 2301 I WindowManager : finishDrawingWindow: Window{48413fe u0 StatusBar} mDrawState=DRAW_PENDING05-24 14:33:58.591 1042 2301 I WindowManager : finishDrawingWindow: Window{5569848 u0 com.android.systemui} mDrawState=DRAW_PENDING//前台就是setting05-24 14:34:00.820 1042 2301 D ActivityManager : current display id is 0 topActivity is com.android.settings/.ui.activity.MainActivity//6596的主线程正常日志输出,代表主线程并没有阻塞。而且其CPU负载并不高,代表没有执行大量的任务。05-24 14:34:00.266 6596 6596 W xt_settings : getNewIntent path =null05-24 14:34:03.798 1042 5770 D ActivityManager : current display id is 0 topActivity is com.android.settings/.ui.activity.MainActivity//发生ANR了05-24 14:34:04.954 1042 1131 I InputDispatcher : Application is not responding: AppWindowToken{eb57e0c token=Token{373653f ActivityRecord{3766d5e u0 com.android.settings/.ui.activity.MainActivity t291}}}. It has been 5004.0ms since event, 5002.3ms since wait started. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.05-24 14:34:04.954 1042 1131 I WindowManager : Input event dispatching timed out sending to application AppWindowToken{eb57e0c token=Token{373653f ActivityRecord{3766d5e u0 com.android.settings/.ui.activity.MainActivity t291}}}. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.05-24 14:34:05.032 1042 1059 I am_anr : [0,6596,com.android.settings,684244549,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
所以通过上面的判断,应用页面可以正常的切换到resume状态,但是无法正常绘制,并且主线程并不阻塞,这种情况推测是系统侧渲染进程繁忙或者其它异常导致的,这种问题就可以转给对应的系统开发。
3.3 服务启动超时
说明
这一类的ANR,一般来说是由于Application.onCreate()+Service.onCreate()+Service.onStartCommand()三个生命周期的时间大于超时时间导致的。所以这一类问题,我们应该把关注点放到系统何时通知启动的Service,Service何时创建完成,何时执行完onStartCommand方法。
案例:
这个问题比较明显,system_app_anr日志中直接捕获到了有问题的堆栈:
at android.os.BinderProxy.transactNative(Native method)at android.os.BinderProxy.transact(Binder.java:1129)at com.bangcle.andJni.JniLib1637326792.cV(Native method)at com.xt.qnxservice.adapter.IQnxAdapterService$Stub$Proxy.registerDataChangedListener(IQnxAdapterService.java:-1)at com.xt.qnxservice.clientsdk.XTAdapterClient.checkService(XTAdapterClient.java:304)at com.bangcle.andJni.JniLib1637326792.cV(Native method)at com.xt.qnxservice.clientsdk.XTAdapterClient.init(XTAdapterClient.java:-1)at com.xt.rdhelper.RightDirectionHelperService.initAdapter(RightDirectionHelperService.java:219)at com.xt.rdhelper.RightDirectionHelperService.initTool(RightDirectionHelperService.java:116)at com.xt.rdhelper.RightDirectionHelperService.onCreate(RightDirectionHelperService.java:108)
跨进程通信的时候,binder服务提供者是跑在子线程上的,但是使用者却是主线程,所以很容易导调用者的主线程阻塞,因此,主线程跨进程调用时请慎重。
另外,即使捕获的堆栈定位不到出问题的堆栈,我们也可以往这个方向去排查,筛选system_server和ANR进程的相关日志,比如下面这样。以及关注CPU负载,负载较高会导致应用得不到时间片。
06-02 22:10:06.085 1130 20281 W ActivityManager : start service Intent { act=com.gwm.app.dock.remote pkg=com.gwm.app.dock.devices }
3.4 前台服务未绑定
说明
这个其实和3.3类似,前台服务未绑定是从13以后才和3.3中的异常有了明显的区分。
13以下,前台服务未绑定,一般是由于Application.onCreate()+Service.onCreate()+Service.onStartCommand()三个生命周期的时间>10s,或者就是没有绑定前台视图导致的。
所以排查流程和上面类似,不再赘述。
3.5 有序广播超时
说明
这种类型的ANR,只有两种可能:
有序广播接收者中,单个的接受者处理时间>10s。
整个有序广播处理流程中,累计时间>2*单个超时时间*接受者数量。
第二种场景是不是会有个疑问,这种场景下不是一定会优先触发第一种类型的超时吗?正常情况下却是这样,但是如果system_server也处于主线程阻塞的状态的话,那就不一定了,第二种属于一种异常补偿措施。
所以,大多数情况下,我们只要关注第一种就好了,看一下onReceive何时被通知,何时执行完。同样,也要关注CPU负载。
另外,如果是静态广播的话,还要关注application的时间。
案例
首先我们看一下data_app_anr文件:
PID: 6066UID: 10162Frozen: falseFlags: 0x38e8bf46Package: com.xt.client v1 (1.0)Foreground: YesCPU usage from 84817ms to -1ms ago (2023-06-05 11:45:03.104 to 2023-06-05 11:46:27.921): 3.9% 631/system_server: 2.2% user + 1.7% kernel / faults: 20764 minor 797 major 1.2% 129/kswapd0: 0% user + 1.2% kernel +1.2% 6066/com.xt.client: 0% user + 0% kernel +0% 6100/logcat: 0% user + 0% kernel5.2% TOTAL: 1.7% user + 2.8% kernel + 4.2% iowait + 1.3% irq + 1% softirqSubject: Broadcast of Intent { act=hhhh flg=0x810 }----- pid 6066 at 2023-06-05 11:46:28.263385266+0800 -----..."main" prio=5 tid=1 Native | group="main" sCount=1 ucsCount=0 flags=1 obj=0x71ac6f78 self=0xb400007d1e23a7b0 | sysTid=6066 nice=-10 cgrp=top-app sched=0/0 handle=0x7e6554b4f8 | state=S schedstat=( 808244194 250487416 988 ) utm=49 stm=31 core=3 HZ=100 | stack=0x7fd5329000-0x7fd532b000 stackSize=8188KB | held mutexes= native: #00 pc 00000000000a33b8 /apex/com.android.runtime/lib64/bionic/libc.so (__epoll_pwait+8) (BuildId: 01331f74b0bb2cb958bdc15282b8ec7b) native: #01 pc 0000000000010dfc /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+176) (BuildId: 5a0d720732600c94ad8354a1188e9f52) native: #02 pc 000000000015a56c /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: a31474ac581b716d4588f8c97eb06009) at android.os.MessageQueue.nativePollOnce(Native method) at android.os.MessageQueue.next(MessageQueue.java:335) at android.os.Looper.loopOnce(Looper.java:161) at android.os.Looper.loop(Looper.java:288) at android.app.ActivityThread.main(ActivityThread.java:7872) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:548) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:936)
堆栈定位到了nativePollOnce上面,明显是堆栈漂移了。
CPU负载不算太高,但是iowait等待的比较较高,怀疑有大量的IO操作。
接下来看日志
2023-06-05 11:46:16.787 6066-6066 test com.xt.plugin I click dotest2023-06-05 11:46:17.802 4066-4066 test com.xt.client I MyBroadcast,onReceive2023-06-05 11:46:27.970 4066-4066 test com.xt.client I MyBroadcast2,onReceive2023-06-05 11:46:27.970 4066-4066 test com.xt.client I MyBroadcast3,onReceive2023-06-05 11:40:57.457 631-5024 ActivityManager system_process E ANR in com.xt.client PID: 4066 Reason: Broadcast of Intent { act=collect flg=0x810 }
日志里面就比较明显了,17秒的时候MyBroadcast收到通知,直到27秒MyBroadcast2才收到通知,中间差了整整10S。这10S最大的可能性是应用侧主线程阻塞导致的,结合上面的IO占比,所以怀疑MyBroadcast之后有大量的IO操作阻塞了主线程。
3.6 Provider连接超时
说明
可以说这种ANR超时,是人为造出来的,对于应用来说,只有主动设置超时时间,才会出现这种类型的ANR。95%的开发应该都懒的进行设置。
如果出现,那么一定是provider的提供者服务超时,这种问题就要让provider提供者帮忙排查了。
3.7 总结
ANR类型 | 优先处理人 | 处理方案 |
输入事件未响应 | 应用侧优先排查 | 重点查看主线程是否阻塞 |
应用界面无焦点 | 系统侧优先排查 | 查看是否正常完成渲染 |
有序广播超时 | 应用侧优先排查 | 排查onCreate/onReceive等方法总耗时 |
服务启动超时 | 应用侧优先排查 | 排查onCreate/onStartCommand等方法总耗时 |
前台服务未绑定 | 应用侧优先排查 | 排查onCreate/onStartCommand等方法总耗时 |
Provider连接超时 | provider提供方排查 | 这个是提供方线程耗时操作导致的 |
四.相关资料
Logcat使用:
新版logcat最全使用指南_logcat筛选_失落夏天的博客-CSDN博客
ANR原理:
ANR系列之四:ContentProvider类型ANR产生原理讲解
资料参考:
来源地址:https://blog.csdn.net/AA5279AA/article/details/131048598