Troubleshooting a client’s connectivity for the Web can be tricky business; modern web applications are getting more and more demanding and likewise their need is bigger than ever for richer, more open networking resources. In this article we will go through a process to troubleshoot your product’s issues related to Web Connectivity.
If you are familiar with all the protocols and network demands of a modern web application (HTTP, Websockets, WebRTC) then do continue on, otherwise we strongly encourage you to read our article on Modern Web Applications Network Requirements.
Troubleshooting A Customer’s Issue
Troubleshooting is information, we need to collect all the available information to start connecting the dots and reach to a conclusion that will allow us to solve the problem. Below, we will provide a framework to perform that kind of troubleshooting…
Step 1: Identify the problem
The customer reports “it doesn’t work” and you are left staring the void. You need to understand what exactly “doesn’t work” and how does it fail to work? Which part of the application failed to perform? What network operations and protocols were involved in this operation? Unfortunately your customer cannot give you this type of information, you need to pro-actively collect this information so when an incident happens you can go back to what was recored and see what went wrong.
In implementation terms, you need to establish a logging system for both your Web Services (webserver) and your Frontend Application (the code of your app that runs on your customers’ browsers). These logs need to contain customer specific information along with all the other information they convey so you can easily search and lookup logs on a per customer basis.
Step 2: Understand the environment
Identifying the problem does not necessarily mean that we understand why it is happening. For these kind of cases we need to dig deeper and understand the environment under which the issue manifests.
First we need some standard triaging to happen:
- What were the sequence of events that triggered the problem?
- Can this issue be reliably reproduced?
- If not has it happened once or multiple times?
- If it happened once, has the user tried to Refresh the page and check if the issue was there again?
- If not has it happened once or multiple times?
- Has this issue been reported by multiple users?
- Does this issue belong to a general group of “unexplained” or “impossible” issues?
Then we need to dive into the user’s environment as a whole:
- What operating system are they on? What browser type and version do they use?
- What type of Internet connection do they have? Home/ISP? School/Education? Enterprise network? Cellular?
- What quality of Internet connection do they have? What is their bandwidth, is their line stable? Does it have what we call “packet-loss” where packets are dropped and never reach their destination?
- What is the stability of their Internet connectivity? Does it have frequent disconnects? Does their modem has poor signal with the ISP and fluctuates up and down and constantly reconnects? Is it over cellular where everything goes?
- Which part of the world was the user at and at what time their local the issues manifested? It might be the case of a generic network overload on busy hours or a known outage incident on some part of the Internet.
Step 3: Make Assertions and Test Them
By having collected all of this data and if you still haven’t figured out the real cause you are in a good position to start making some speculations as to what happened.
If you are having a lot of unpredictable problems up and down your product that come and go as you push new changes then you most likely have a systemic problem, which means that your system itself is inherently unstable. In these cases the remedies aren’t many, you are in a bad position of having accumulated too much technical debt that needs to be paid back. In most cases this means a complete rewrite of the offending part or the whole application if things are that severe.
If the issues are more narrowly defined and you have excluded all application factors, meaning the application works fine for you and 95% of your clients, then it most likely has to do with the reporting client’s specific configuration and Network connectivity. So you make an assumption that the problem comes from Websocket connectivity, you then need to verify that assumption by testing it. You either create an in-application or use an external test service to drive your customer there, have them perform the test and send back to you the results of that test.
Hopefully by now you have a good understanding of the problem and you can proceed to apply a solution that can cope with the connectivity challenges a browser can face.
The Next Step: Automation
As you have already realized, the troubleshooting process can be very lengthy and painful for you, and most importantly for your customers. While at the early stages of your business you can afford to handle all your customers personally this will not scale and the user experience is terrible.
So you gain time, your customers do not have to perform a 12-step troubleshooting guide and you can fix, iterate and evolve faster than ever before.