Returning from holiday recently, I was expecting a calm day of catching up and doing some basic tasks. The opposite was true, this day I was introduced to a situation which puzzled us for two weeks.
We got ourselves here a Mandelbug:
Situation
We have been reported that Android sometimes get the wrong reply to a particular GET requests. Ok, let us investigate, I got this, will be quick...Reproducibility
The bug is up till now non-deterministic to us. We were firstly not able to find the determining factor, it just occasionally occurred, persisted for some minutes (maybe up to half an hour) and then disappeared without a trace. This made the investigation and also any communication much harder. This happened for both iOS and Android apps.We got ourselves here a Mandelbug:
A bug whose underlying causes are so complex and obscure as to make its behavior appear chaotic or even non-deterministic
First hypothesis
We have decided to focus only on the android part. A debugging proxy was attached shortly for catching all traffic between the app and our internal infrastructure. There was a lot of investigation around reproducibility, but every attempt to pinpoint the underlying cause was unsuccessful. The faulty GET requests were not found in our API, so we had the first hypothesis:
'Requests are blocked/cached somewhere on the way from app to preference API'
Verifying
Several days into the investigation, we started to look around other streams through which the GET requests should flow. This involved bringing in several members from other teams, long chats, calls and repeated explaining of the problem.
There was also a parallel investigation on the android side to check if the problem might be within the app.
We went back and forth, compared data, searched for differences in the particular right/wrong requests, verified several times that the requests are missing from each component on the way. The intriguing situation was also an interfering bug for the POST requests which was messing with our perception. Another interesting observation was that even when the bug occurs both on Android and iOS, it does not occur particularly at the same time.
I was even familiarizing myself with android studio and building the apps myself just to get deep logs.
After we eliminated the possibility of any party blocking these requests, gathered data for several occasions this bug appeared, we needed to dismiss the first hypothesis.
There was also a parallel investigation on the android side to check if the problem might be within the app.
We went back and forth, compared data, searched for differences in the particular right/wrong requests, verified several times that the requests are missing from each component on the way. The intriguing situation was also an interfering bug for the POST requests which was messing with our perception. Another interesting observation was that even when the bug occurs both on Android and iOS, it does not occur particularly at the same time.
I was even familiarizing myself with android studio and building the apps myself just to get deep logs.
After we eliminated the possibility of any party blocking these requests, gathered data for several occasions this bug appeared, we needed to dismiss the first hypothesis.
Second hypothesis
Summarizing all the data and findings brought us to a refined hypothesis:
'Problematic GET reply is coming from within the app'This was a trigger to get more people from the Android team involved and after another round of explaining the intricacies of the bug, they found the root cause.
Gotcha!
The GET requests were hiding inside of the app from the beginning! It is, however, more deep than that. The reply to the GET request was missing a header that should clarify caching. This was however disliked by one component on the way, which filled it with a default 'private' value (eg - you can cache when you want, but keep it inside the client). We fixed it with filling the caching header ourselves. The non-reproducibility of the situation stayed till this day, we never found out the exact conditions under which the caching occurred (it wasn't needed anymore).
I would like to express here a big thanks to all people involved, software development is rarely a feat of an individual.
Lessons learned
Realize caching possibility
For the next 10 years, in a similar situation, caching will always be my prime suspect.
Look at all data, not only parts
When investigating, I primarily focused on each part in isolation and might neglect some of the contexts.
For example not seeing the problematic GET request in the debugging proxy I falsely neglected as a possible fault of the proxy itself (accepting the explanation of the developer)
Search for similarities, not only differences
When comparing the correct GET request logs with the problematic ones, I was focused on finding some differences. In hindsight, searching for patterns of similarity could inhibit the search as I would see several identical GET's among the problematic ones.
Prezi of this presentation which I used also on a South West Test lightning talk.
Prezi of this presentation which I used also on a South West Test lightning talk.
Comments
Post a Comment