Part 2 - Application Troubles in the Cloud (Symptom: Performance Problems whilst part of a Service Mesh)

Overview

Introduction

I have recently supported a customer to deploy numerous applications over multiple OpenShift clusters running on top of hundreds of worker nodes, using external storage infrastructure for persistence and Red Hat Service Mesh (OSSM) for encryption, authentication and traffic management.

The method to deliver in this environment did not differ too much from delivering the applications in any other environment, with automation involved throughout, however when an application issue was reported for many of the platform maintenance team members it was not possible to hone in quickly on the area they should focus on. It was not immediately obvious what information would assist in the problem identification and resolution or where from and how to collect them.

In Part 1 - Application Troubles in the Cloud (Symptom: Kubernetes POD Failures/Restarts) blog post we looked at the cluster health state in order to resolve POD startup failures, in this second blog I will focus on Performance Problems that may be caused by a Service Mesh setup, configuration and health.

Causes of Application Performance Problems in Service Mesh

The second symptom the platform maintenance personnel came up against were various kind of performance failures related to connecting to or from within the Service Mesh. The result was that responses were being delayed and at times eventually failed. Troubleshooting such issues we will focus on two possible causes:

Below we present one by one the causes along with the possible areas to focus for each, as well as the information which should be gathered and looked upon.

Cause 1 - Failed Connections

In our use case all traffic was secured by exercising the Mutual TLS (mTLS) authentication between caller and services. Service Mesh based services exposed externally receive their own certificates to present to the caller (with a passthrough OCP Route declaring a DNS resolvable service hostname and a Gateway resource defining the secret hosting the certificate for that hostname). This brings us to the first set of possible failures related to mTLS configurations. Troubleshooting should focus on Client/Service Mesh certificate configurations, for that check the defined Service Mesh certificates and verify Service Mesh external (in/out) network configurations, as there can be misalignments on presenting or accepting the correct certificates or pointing to the correct service. Similarly, outgoing communications could also suffer from misconfigurations of egress certificates (check the applied certificates and egress resource configurations). In addition to authentication some authorisation failures could be caused by the Service Mesh security settings, hence exploring those configuration settings applied in ServiceMeshControlPlane resource is deemed necessary.

Furthermore, special focus must be given on possible network issues, resulting from the service mesh configurations, which can cause certain locality deployment expectations of the applications to be broken (check based on the involved PODs the topology of the nodes calls originate from/are received at to verify n/w connectivity is viable). In addition, the cause could be that the calling service is not in a service mesh but called service is (check the PODs are in the same Service Mesh and if they are in a different Service Mesh check the certificates used) or that the calling service is part of a different service mesh (check on which Service Mesh the service is a member of). Further checks should be performed that the OCP Route exposing the service exists or that the service mesh configs render it accessible and finally it should be validated that the Service Mesh envoy sidecar has the correct/up-to-date configurations (first check the Service Mesh Operator State, and if the cause still unknown check the Service Mesh Configuration).

Cause 2 - Slow Responses

Application can suffer performance degradation because POD or Deployment resource limits have been reached. Firstly, ensure that CPU/RAM/Network/Storage POD Limits have not been exceeded (inspect POD Resources Usage) and that Deployment state (replicas) is the expected (by checking if the # of Deployed PODs is correct) . If there are issues then check to ensure that any set Resource Limits are not exceeded.

If it is not clear from reviewing the resources on what could be causing the applications to fail or to start performing erratically, look at the cluster events for the possible causes behind POD restarts. The application and service mesh logs can also reveal if the logging level is behind the performance degradation or if some other functional cause is behind backed up requests towards a failing/slow service. The latter can be further investigated utilizing the Service Mesh observability stack (validate configuration/errors with KIALI to ensure for instance that retries occur but persisting constantly.

Conclusion

Application failures can be caused by misconfigurations in many levels from the network to the client, the service mesh or the application itself. The result of these misconfigurations could potentially result in slow response rate, reduced throughput, timeouts etc. All these will degrade the performance of the overall solution and in this post WE focused on what can be the possible causes of these performance related issues when the application is part of the mesh or communicates to a service in a mesh.

The combined resources around exploring troubleshooting around this symptom can also be found at Application (in Service Mesh) performance problems.

Posts in this Series