Skip to main content

Mandelbug - bug, who didn't want to be found

Returning from holiday recently, I was expecting a calm day of catching up and doing some basic tasks. The opposite was true, this day I was introduced to a situation which puzzled us for two weeks.

Situation

We have been reported that Android sometimes get the wrong reply to a particular GET requests. Ok, let us investigate, I got this, will be quick...

Reproducibility

The bug is up till now non-deterministic to us. We were firstly not able to find the determining factor, it just occasionally occurred, persisted for some minutes (maybe up to half an hour) and then disappeared without a trace. This made the investigation and also any communication much harder. This happened for both iOS and Android apps.
We got ourselves here a Mandelbug:
A bug whose underlying causes are so complex and obscure as to make its behavior appear chaotic or even non-deterministic

First hypothesis

We have decided to focus only on the android part. A debugging proxy was attached shortly for catching all traffic between the app and our internal infrastructure. There was a lot of investigation around reproducibility, but every attempt to pinpoint the underlying cause was unsuccessful. The faulty GET requests were not found in our API, so we had the first hypothesis:
'Requests are blocked/cached somewhere on the way from app to preference API'

Verifying

Several days into the investigation, we started to look around other streams through which the GET requests should flow. This involved bringing in several members from other teams, long chats, calls and repeated explaining of the problem.
There was also a parallel investigation on the android side to check if the problem might be within the app.
We went back and forth, compared data, searched for differences in the particular right/wrong requests, verified several times that the requests are missing from each component on the way. The intriguing situation was also an interfering bug for the POST requests which was messing with our perception. Another interesting observation was that even when the bug occurs both on Android and iOS, it does not occur particularly at the same time.
I was even familiarizing myself with android studio and building the apps myself just to get deep logs.
After we eliminated the possibility of any party blocking these requests, gathered data for several occasions this bug appeared, we needed to dismiss the first hypothesis.

Second hypothesis

Summarizing all the data and findings brought us to a refined hypothesis:

'Problematic GET reply is coming from within the app'
This was a trigger to get more people from the Android team involved and after another round of explaining the intricacies of the bug, they found the root cause.

Gotcha!

The GET requests were hiding inside of the app from the beginning! It is, however, more deep than that. The reply to the GET request was missing a header that should clarify caching. This was however disliked by one component on the way, which filled it with a default 'private' value (eg - you can cache when you want, but keep it inside the client). We fixed it with filling the caching header ourselves. The non-reproducibility of the situation stayed till this day, we never found out the exact conditions under which the caching occurred (it wasn't needed anymore).
I would like to express here a big thanks to all people involved, software development is rarely a feat of an individual.

Lessons learned

Realize caching possibility

For the next 10 years, in a similar situation, caching will always be my prime suspect.

Look at all data, not only parts

When investigating, I primarily focused on each part in isolation and might neglect some of the contexts.
For example not seeing the problematic GET request in the debugging proxy I falsely neglected as a possible fault of the proxy itself (accepting the explanation of the developer)

Search for similarities, not only differences

When comparing the correct GET request logs with the problematic ones, I was focused on finding some differences. In hindsight, searching for patterns of similarity could inhibit the search as I would see several identical GET's among the problematic ones.


Prezi of this presentation which I used also on a South West Test lightning talk.

Comments

Popular posts from this blog

When to start automation?

If you are asking this as a tester, you probably asking too late. Automation is something that can save you some portion of your work (understand resources for your client) and i rarely found cases of testing work that did not need at least some portion of automation. I know that it is rarely understood that automation is something to be developed & maintained and if you cover enough of the application, you do not need any more regression - well i do not think that somebody has done an automation regression suite that if fully reliable (i am not speaking about maintaining this code - which is another topic). There can be always a bug (or quality issue) that slips through, even when you scripts go through the afflicted part. I understand that many testers have no development background or skills, but i doubt the developers that could help you are far away. I am not assuming that they can do the scripts for you.... However if they understand what you need, they can say how e

Testing impact on security

... or the impact when testing is lacking? Security breaches , hacks , exploits , major ransomware attacks - their frequency seem to increase recently. These can result in financial, credibility and data loss, and increasingly the endangerment of human lives. I don't want to propose that testing will always prevent these situations. There were probably testers present (and I'm sure often also security testers) when such systems were created. I think that there was simply a general lack of risk-awareness on these projects. There are many tools and techniques from  a pure technical point of view to harden the software in security context. Some of them have automated scans which crawl through your website and might discover the low hanging fruits of security weaknesses ( ZAP , Burpsuite ...), without much technical knowledge from the person operating it. The more important aspect is however the mindset with which you approach the product. The tester is often the f

RST Explored - My experience

My experience report from my recent RST Class I attended the RST class after a while, wanting to refresh my knowledge about the RST view on testing. It was a 4-day event, each day 3 Sessions, approx 4hour/day. My general impression was that it enriched and refreshed my understanding of testing.   Each of the four days had an central theme Day 1: "It is possible to test everything?" Day2: "When to stop testing? How to test from specifications." Day3: "Product coverage outline. Complexity of the system" Day4: "Risk analysis and coverage"   Going deeper into the topics of each day would be impossible without spoilers, I will therefore rather focus on my impressions and what this training has brought me. The way Michael was guiding us through the class was very engaging, although we usually started with a short lecture, questions and remarks were encouraged from start and we had an shared review after each exercise - students explaining their work,