Android RIL 调试问题分析 ——数据频繁断开
- 1. 问题描述
- 2. 问题分析
- 2.1 RIL初分析
- 2.2 数据业务重试机制梳理和分析
- 2.3 网卡统计数据的分析
1. 问题描述
在使用SDX55模组和对应支持NSA/SA网络的RIL时,出现数据业务频繁断开重连现象,用户体验较差。
2. 问题分析
2.1 RIL初分析
从ril log中初步分析,可以得到数据频繁断开原因均一致,原因见以下log段:
代码语言:javascript复制07-31 15:06:42.594 D/DCT ( 1617): [0]Data stall alarm
07-31 15:06:42.594 D/DCT ( 1617): [0]onActionIntentDataStallAlarm: action=com.android.internal.telephony.data-stall
07-31 15:06:42.596 D/DCT ( 1617): [0]updateDataStallInfo: mDataStallTxRxSum={txSum=0 rxSum=0} preTxRxSum={txSum=0 rxSum=0}
07-31 15:06:42.596 D/DCT ( 1617): [0]updateDataStallInfo: NO IN/OUT
07-31 15:06:42.597 D/DCT ( 1617): [0]getRecoveryAction: 1
07-31 15:06:42.597 D/DCT ( 1617): [0]onDataStallAlarm: tag=14890 do recovery action=1
07-31 15:06:42.597 D/DCT ( 1617): [0]getRecoveryAction: 1
这段log中分析出数据断开是安卓上层的数据恢复机制导致断开重连,但是由于什么原因导致的断开重连还需要我们去深入了解下这个机制的原理和触发条件,才能进一步分析解决问题。
2.2 数据业务重试机制梳理和分析
Framework telephony中数据业务链接错误处理一般分3种情况:
- SETUP_DATA_CALL 时返回错误
- Modem上报DATA_CALL_LIST包含错误码或者链接中断
- 一段时间内没有上下行数据(TX/RX)
下面具体来看每种情况的处理。
- 1、SETUP_DATA_CALL失败 DataConnection在收到SETUP_DATA_CALL结果后,用Message通知DcTracker处理:
protected void onDataSetupComplete(AsyncResult ar) {
if (ar.exception == null) {//链接成功
}
else{
...
//标记permanent fail的次数,会影响后面onDataSetupCompleteError的判断
if (isPermanentFail(cause)) apnContext.decWaitingApnsPermFailCount();
apnContext.removeWaitingApn(apnContext.getApnSetting()); //从waiting列表中移除已经失败的APN
onDataSetupCompleteError(ar);//继续处理错误
...
}
}
处理Error的逻辑:
- 如果apnContext中的所有waiting APN都失败了,且不是每个都发生permanent fail(永久性错误),则设置delay并重新发起这次连接
- 如果apnContext中仍有没有尝试的waiting APN,则设置delay并尝试用下一个APN去连接
/**
* Error has occurred during the SETUP {aka bringUP} request and the DCT
* should either try the next waiting APN or start over from the
* beginning if the list is empty. Between each SETUP request there will
* be a delay defined by {@link #getApnDelay()}.
*/
@Override
protected void onDataSetupCompleteError(AsyncResult ar) {
String reason = "";
ApnContext apnContext = getValidApnContext(ar, "onDataSetupCompleteError");
if (apnContext == null) return;
//已经尝试过所有APN
if (apnContext.getWaitingApns().isEmpty()) {
apnContext.setState(DctConstants.State.FAILED);//apnContext state设置成FAILED
mPhone.notifyDataConnection(Phone.REASON_APN_FAILED, apnContext.getApnType());
//清除DataConnection
apnContext.setDataConnectionAc(null);
//如果所有APN都发生Permanent fail,则不做重试
if (apnContext.getWaitingApnsPermFailCount() == 0) {
if (DBG) {
log("onDataSetupCompleteError: All APN's had permanent failures, stop retrying");
}
} else {//执行重试
int delay = getApnDelay(Phone.REASON_APN_FAILED);
if (DBG) {
log("onDataSetupCompleteError: Not all APN's had permanent failures delay=" delay);
}
startAlarmForRestartTrySetup(delay, apnContext);
}
} else {//waitingAPN中还有没有尝试的APN,继续尝试下一个
if (DBG) log("onDataSetupCompleteError: Try next APN");
apnContext.setState(DctConstants.State.SCANNING);
// Wait a bit before trying the next APN, so that
// we're not tying up the RIL command channel
startAlarmForReconnect(getApnDelay(Phone.REASON_APN_FAILED), apnContext);//试下一个APN
}
}
- 附:ApnContext的所有状态
/**
* IDLE: ready to start data connection setup, default state
* CONNECTING: state of issued startPppd() but not finish yet
* SCANNING: data connection fails with one apn but other apns are available
* ready to start data connection on other apns (before INITING)
* CONNECTED: IP connection is setup
* DISCONNECTING: Connection.disconnect() has been called, but PDP
* context is not yet deactivated
* FAILED: data connection fail for all apns settings
* RETRYING: data connection failed but we're going to retry.
*
* getDataConnectionState() maps State to DataState
* FAILED or IDLE : DISCONNECTED
* RETRYING or CONNECTING or SCANNING: CONNECTING
* CONNECTED : CONNECTED or DISCONNECTING
*/
public enum State {
IDLE,
CONNECTING,
SCANNING,
CONNECTED,
DISCONNECTING,
FAILED,
RETRYING
}
- 2、链接中断 DcController监听RIL_UNSOL_DATA_CALL_LIST_CHANGED消息,获得每一个数据连接的更新: mPhone.mCi.registerForDataNetworkStateChanged(getHandler(), DataConnection.EVENT_DATA_STATE_CHANGED, null); (1)RIL上报DATA_CALL_LIST_CHANGED时会带上当前的Modem中的DataCall list,DcController将此dataCall list和上层的active list做对比: 1)已经丢失 及 断开 的连接将会重试 2)发生变化 和 发生永久错误的链接则需要清除
private void onDataStateChanged(ArrayList<DataCallResponse> dcsList) {
// Create hashmap of cid to DataCallResponse
HashMap<Integer, DataCallResponse> dataCallResponseListByCid =
new HashMap<Integer, DataCallResponse>();
for (DataCallResponse dcs : dcsList) {
dataCallResponseListByCid.put(dcs.cid, dcs);
}
//如果上报的dcsList中并没有找到对应的active的链接,则默认连接丢失并加入重试List
ArrayList<DataConnection> dcsToRetry = new ArrayList<DataConnection>();
for (DataConnection dc : mDcListActiveByCid.values()) {
if (dataCallResponseListByCid.get(dc.mCid) == null) {
if (DBG) log("onDataStateChanged: add to retry dc=" dc);
dcsToRetry.add(dc);
}
}
// Find which connections have changed state and send a notification or cleanup
// and any that are in active need to be retried.
ArrayList<ApnContext> apnsToCleanup = new ArrayList<ApnContext>();
boolean isAnyDataCallDormant = false;
boolean isAnyDataCallActive = false;
for (DataCallResponse newState : dcsList) {
DataConnection dc = mDcListActiveByCid.get(newState.cid);
//不在Active MAP中的连接,表明这个连接还没同步到上层,会有其他地方处理。
if (dc == null) {
// UNSOL_DATA_CALL_LIST_CHANGED arrived before SETUP_DATA_CALL completed.
loge("onDataStateChanged: no associated DC yet, ignore");
continue;
}
if (dc.mApnContexts.size() == 0) {
if (DBG) loge("onDataStateChanged: no connected apns, ignore");
} else {
// Determine if the connection/apnContext should be cleaned up
// or just a notification should be sent out.
if (newState.active == DATA_CONNECTION_ACTIVE_PH_LINK_INACTIVE) {
//连接INACTIVE,按照错误类型区分处理
DcFailCause failCause = DcFailCause.fromInt(newState.status);
if (failCause.isRestartRadioFail()) {
//恢复需要重启radio
mDct.sendRestartRadio();
} else if (mDct.isPermanentFail(failCause)) {
//链接发生不可恢复的错误,需要Cleanup
apnsToCleanup.addAll(dc.mApnContexts.keySet());
} else {
for (ApnContext apnContext : dc.mApnContexts.keySet()) {
if (apnContext.isEnabled()) {
//apn是enabled状态,重试
dcsToRetry.add(dc);
break;
} else {
//apn已经disabled,需要cleanup
apnsToCleanup.add(apnContext);
}
}
}
} else {
//LinkProperty发生变化
UpdateLinkPropertyResult result = dc.updateLinkProperty(newState);
if (result.oldLp.equals(result.newLp)) {
if (DBG) log("onDataStateChanged: no change");
} else {
//判断interface是否一致
if (result.oldLp.isIdenticalInterfaceName(result.newLp)) {
if (! result.oldLp.isIdenticalDnses(result.newLp) ||
! result.oldLp.isIdenticalRoutes(result.newLp) ||
! result.oldLp.isIdenticalHttpProxy(result.newLp) ||! result.oldLp.isIdenticalAddresses(result.newLp)) {
// If the same address type was removed and
// added we need to cleanup
CompareResult<LinkAddress> car =
result.oldLp.compareAddresses(result.newLp);
if (DBG) {
log("onDataStateChanged: oldLp=" result.oldLp " newLp=" result.newLp " car=" car);
}
boolean needToClean = false;
//如果address发生变化,需要清除这个old connection
for (LinkAddress added : car.added) {
for (LinkAddress removed : car.removed) {
if (NetworkUtils.addressTypeMatches(
removed.getAddress(),
added.getAddress())) {
needToClean = true;
break;
}
}
}
if (needToClean) {
apnsToCleanup.addAll(dc.mApnContexts.keySet());
} else {
if (DBG) log("onDataStateChanged: simple change");
//其他的LP变化,只做notify
for (ApnContext apnContext : dc.mApnContexts.keySet()) {
mPhone.notifyDataConnection(
PhoneConstants.REASON_LINK_PROPERTIES_CHANGED,
apnContext.getApnType());
}
}
} else {
if (DBG) {
log("onDataStateChanged: no changes");
}
}
} else {
//interface发生改变,cleanUp这个old connection
apnsToCleanup.addAll(dc.mApnContexts.keySet());
if (DBG) {
log("onDataStateChanged: interface change, cleanup apns="
dc.mApnContexts);
}
}
}
}
}
...
}
...
//清除链接
for (ApnContext apnContext : apnsToCleanup) {
mDct.sendCleanUpConnection(true, apnContext);
}
//通知DataConnection链接丢失,需要发起重连
for (DataConnection dc : dcsToRetry) {
dc.sendMessage(DataConnection.EVENT_LOST_CONNECTION, dc.mTag);
}
}
}
(2)DataConnection ActiveState在收到LOST_CONNECTION消息后: 1) 如果重试次数没有达到上限,则设置定时重试,并切换到RetryingState 2) 如果不需要重试,则切换到Inactive状态,并可能通知DcTracker处理(onDataSetupCompleteError,可看第一种情况)
代码语言:javascript复制case EVENT_LOST_CONNECTION: {
if (DBG) {
log("DcActiveState EVENT_LOST_CONNECTION dc=" DataConnection.this);
}
if (mRetryManager.isRetryNeeded()) {
// We're going to retry
int delayMillis = mRetryManager.getRetryTimer();
//重试
mDcRetryAlarmController.startRetryAlarm(EVENT_RETRY_CONNECTION, mTag,delayMillis);
transitionTo(mRetryingState);
} else {
mInactiveState.setEnterNotificationParams(DcFailCause.LOST_CONNECTION);
transitionTo(mInactiveState);
}
retVal = HANDLED;
break;
}
(3)RetryingState 收到RETRY消息后,发起连接并切换到ActivatingState
代码语言:javascript复制case EVENT_RETRY_CONNECTION: {
if (msg.arg1 == mTag) {
mRetryManager.increaseRetryCount();//计数
onConnect(mConnectionParams);//开始连接
transitionTo(mActivatingState);//切换到Activating State
} else {
if (DBG) {
log("DcRetryingState stale EVENT_RETRY_CONNECTION"
" tag:" msg.arg1 " != mTag:" mTag);
}
}
retVal = HANDLED;
break;
}
(4)RetryManager负责重试相关的计数:
代码语言:javascript复制 public boolean isRetryNeeded() {
boolean retVal = mRetryForever || (mRetryCount < mCurMaxRetryCount);
if (DBG) log("isRetryNeeded: " retVal);
return retVal;
}
- 3、一段时间内持续没有接收到新的数据包 在Data完成连接后,DcTracker会定时检查TX/RX的更新,如果RX的值持续没有更新并超过设置的上限值,就会触发Recovery动作。 首先来看方法onDataStallAlarm,它由Alarm定时触发,执行这些操作:更新TX/RX数据 -> 判断是否需要Recover并执行 -> 重新设置Alarm来触发下一次检查。
protected void onDataStallAlarm(int tag) {
if (mDataStallAlarmTag != tag) {
if (DBG) {
log("onDataStallAlarm: ignore, tag=" tag " expecting " mDataStallAlarmTag);
}
return;
}
//更新mSentSinceLastRecv
updateDataStallInfo();
//默认值是10
int hangWatchdogTrigger = Settings.Global.getInt(mResolver,
Settings.Global.PDP_WATCHDOG_TRIGGER_PACKET_COUNT,
NUMBER_SENT_PACKETS_OF_HANG);
boolean suspectedStall = DATA_STALL_NOT_SUSPECTED;
if (mSentSinceLastRecv >= hangWatchdogTrigger) {
//一段时间没有RX,且超过watchdog的值,需要recover
suspectedStall = DATA_STALL_SUSPECTED;
sendMessage(obtainMessage(DctConstants.EVENT_DO_RECOVERY));
} else {
if (VDBG_STALL) {
log("onDataStallAlarm: tag=" tag " Sent " String.valueOf(mSentSinceLastRecv)
" pkts since last received, < watchdogTrigger=" hangWatchdogTrigger);
}
}
//重新设置Alarm任务,一段时间后再次执行本方法(onDataStallAlarm)
startDataStallAlarm(suspectedStall);
}
updateDataStallInfo()负责记数,处理分3种情况:
- 有TX 也有RX -> 正常,重置计数和Recovery action(Recovery action后面会写到)
- 有TX没有RX -> 异常,累计TX数据
- 没有TX 只有RX -> 正常,重置计数和Recovery action
private void updateDataStallInfo() {
long sent, received;
TxRxSum preTxRxSum = new TxRxSum(mDataStallTxRxSum);
mDataStallTxRxSum.updateTxRxSum();
sent = mDataStallTxRxSum.txPkts - preTxRxSum.txPkts;
received = mDataStallTxRxSum.rxPkts - preTxRxSum.rxPkts;
//收发正常,RecoveryAction重置
if ( sent > 0 && received > 0 ) {
if (VDBG_STALL) log("updateDataStallInfo: IN/OUT");
mSentSinceLastRecv = 0;
putRecoveryAction(RecoveryAction.GET_DATA_CALL_LIST);
} else if (sent > 0 && received == 0) {
//没有RX;若不在通话状态则需要累计本次发送量
if (isPhoneStateIdle()) {
mSentSinceLastRecv = sent;
} else {
mSentSinceLastRecv = 0;
}
//没有发数据,RecoveryAction重置
} else if (sent == 0 && received > 0) {
if (VDBG_STALL) log("updateDataStallInfo: IN");
mSentSinceLastRecv = 0;
putRecoveryAction(RecoveryAction.GET_DATA_CALL_LIST);
} else {
if (VDBG_STALL) log("updateDataStallInfo: NONE");
}
}
TX/RX数据由TrafficStats提供的静态方法获得,是native层方法统计所有Mobile的iface后返回的数据:
代码语言:javascript复制public void updateTxRxSum() {
this.txPkts = TrafficStats.getMobileTcpTxPackets();
this.rxPkts = TrafficStats.getMobileTcpRxPackets();
}
- 4、最后看下doRecovery方法如何执行恢复数据。doRecovery方法中有5种不同的Recovery action对应着各自的处理:
- 向Modem主动查询DATA CALL LIST
- 清除现有的数据链接
- 重新驻网
- 重启Radio
- 深度重启Radio(根据高通的注释,这个操作涉及到RIL的设计)
如果一种方法执行之后,连接依然有问题,则执行下一种恢复方法,顺序类似于循环链表,直到恢复正常后updateDataStallInfo()将Action重置:
代码语言:javascript复制 protected void doRecovery() {
if (getOverallState() == DctConstants.State.CONNECTED) {
// Go through a series of recovery steps, each action transitions to the next action
int recoveryAction = getRecoveryAction();
switch (recoveryAction) {
case RecoveryAction.GET_DATA_CALL_LIST:
mPhone.mCi.getDataCallList(obtainMessage(DctConstants.EVENT_DATA_STATE_CHANGED));
putRecoveryAction(RecoveryAction.CLEANUP);
break;
case RecoveryAction.CLEANUP:
cleanUpAllConnections(Phone.REASON_PDP_RESET);
putRecoveryAction(RecoveryAction.REREGISTER);
break;
case RecoveryAction.REREGISTER:
mPhone.getServiceStateTracker().reRegisterNetwork(null);
putRecoveryAction(RecoveryAction.RADIO_RESTART);
break;
case RecoveryAction.RADIO_RESTART:
putRecoveryAction(RecoveryAction.RADIO_RESTART_WITH_PROP);
restartRadio();
break;
case RecoveryAction.RADIO_RESTART_WITH_PROP:
// This is in case radio restart has not recovered the data.
// It will set an additional "gsm.radioreset" property to tell
// RIL or system to take further action.
// The implementation of hard reset recovery action is up to OEM product.
// Once RADIO_RESET property is consumed, it is expected to set back
// to false by RIL.
EventLog.writeEvent(EventLogTags.DATA_STALL_RECOVERY_RADIO_RESTART_WITH_PROP, -1);
if (DBG) log("restarting radio with gsm.radioreset to true");
SystemProperties.set(RADIO_RESET_PROPERTY, "true");
// give 1 sec so property change can be notified.
try {
Thread.sleep(1000);
} catch (InterruptedException e) {}
restartRadio();
putRecoveryAction(RecoveryAction.GET_DATA_CALL_LIST);
break;
default:
throw new RuntimeException("doRecovery: Invalid recoveryAction="
recoveryAction);
}
mSentSinceLastRecv = 0;
}
}
2.3 网卡统计数据的分析
通过学习和分析安卓的数据业务重试机制,我们了解到安卓上层主要是通过检测网卡的数据传输统计数据TX/RX来作为是否进行数据重试机制的触发条件。那我们就需要下一步将去关注网卡的TX/RX变化是否和分析原因符合。 通过将上述分析,做对比测试,发现在出问题时确实存在网卡的TX/RX数据是没有变化的,和上述分析结论是一致的。针对这一现象,需转由驱动侧进行分析。