Recently I’ve been working in some interesting OCP deployments with Service Mesh. I mean that is a very powerful and I’d say complicates subject – even for experts on the matter doesn’t seem trivial.
The context here is Istio, just to be clear. So I’m talking about the Cloud Native Computing Foundation project. Service Mesh is basically an extension of OCP where it provides customizable features. In this matter, Service mesh can add so much flexility and enables such a centralized control for microservices handling.
Features of Service mesh include load balancing, full/automatic authentication canary releases, access control and even end-to-end authentication (via Istio mtls). Everything in one place.
Objectively Service Mesh adds a transparent layer of transport – all without any application change. To do that Service Mesh captures/intercepting traffic between services that will act to modify, redirect, or create new requests to other services.
To do this interception/capture of requests, Service Mesh relies on the envoy sidecar – which is a container together with the other application in the same pod – a sidecar.
For deployments such as JBoss EAP/Widlfly this can be very interesting for be able to control communication and establish a level of network control more than what the services (eap-app-ping for clustering) already provide.
On the other hand, some architectures are coming up to use Istio without the sidecar, called sidecarless. One example is Ambient Mesh. So sidecarless implementations can be useful for environments where instrumenting the pod increases its complexity (deployment and instrumentation) and where it can be just simpler to not instrument the pod.
Get the namespace inspect for troubleshooting OCP issues.
One of the must useful tools in OCP, together with Must gather I think is the inspect.
The DevOps in OCP can be chaotic sometimes, with some many pods and operators and . But that’s exactly why the inspect can be a core component to debug the pods/services/deployments in OCP.
So to avoid this all going over your head, just get the inspect first. From there you can do a top down approach: so start with the deployment(config) and move to the services and pods, or bottom up – meaning the pods yaml/logs, and then move from to deployment config.
I mean, I’m trying to lead out the idea to get the inspect – via oc adm inspect ns/$namespace – so then this can be lead indicator for several issues: pod crashes, application issues, pod resource starvations. What happens if the application logs is ok, but the yaml shows the service’s label is wrong.
This avoids for example, only seeing the pod logs and forgetting about the resource allocations – in terms of cpus and memory allocation.
Doing a more global review: pod yaml, core configmaps, services, deployments, everything at once.
For deployments more and more complex, with several components – and sometimes, with Service Mesh – the istio side-car will be inside the pod and the user see the sidecar pods and access logs (set on the smcp – service mesh control plane). On this matter I will write some presentations on this regard.
I’m sure the above is not consensus, some people will opt out for getting the pod yamls or just pod logs first, and just then get the inspect. But for OCP problems I start with inspect, because I can see the complete deployment, see all pods on the namespace and do the work once. So almost by definition you will have an overview of the data, which is must better than an narrow view of only pod logs.
But, also because of its awesome performance for large heaps, we are talking about 50gb+ for examples. It is awesome very simple to understand, the performance doesn’t scale with larger heaps and so on.
However, I’ve explained this before, Shenandoah (in its non generational form) is not applicable for all situations and workloads and there will be workloads where its performance will be hurt more than helped by Shenandoah – given it is not generational. Being non-generational is a core part of the algorithm and helps considerably in several aspects but can hurt in other more specific aspects.
An example is when a high number of very short lived objects is created at random periods, which leads to all the threads kicking in and running at the same time and can lead to several subsequent full pauses in a roll. For those cases a generational collector, like G1GC and Parallel, would likely handle better the situation – by spliting the collection in phases. For those (generational) workloads Amazon (Correto) is developing its Generational Shenandoah.
In this aspect as well, I’ve seen some comments/discussions that Shenandoah will eventually surpass all and should replace G1GC/Parallel word-loads handles. Similar to how G1GC replaces CMS. That wouldn’t be the case, given some word-loads have a better performance with generational collectors. And in this aspect, Shenandoah is not necessarily a “improved” G1GC, so I won’t suggest all workloads to be replaced with Shenandoah necessarily.
Consequently, there needs to be a due diligence from the development team to verify how a non-generational collector is handling – in terms of latency, throughput, and less (but not least) footprint – which is most of the times sacrificed in several situations when developing in Java or (self/auto) collected garbage collection development.
But this can be generalize pretty much for anything in JVM/Java – no magic JVM flag will cut the latency in half (unless very specific cases for example where a certain collector is more adequate than another).
During the pandemic times I’ve watch the content of Peter on the Mentor Pilot’s Youtube channel considerably – the channel brings several aspects of Aviation – deep technical, procedural, behavioural analysis – within the aviation themes.
I can say Peter taught several important life lessons: beware confirmation bias, prejudice, verify the assumptions, learning always, and trust on the team.
One important thing I’ve learned with him was the usage of PIOSEE decision model – and how the Pilots use this model on critical situations. I show below:
It requires you to swiftly identify the problem at hand.
Gather information about the problem that is occurring.
With the gathered information about the problem, you and your team generate options to solve the problem
You need to select an option after efficiently evaluating the alternatives.
Options are worthless without swift and effective execution.
After execution, you and your team evaluate the process, noting places for improvement.
PIOSEE Model – PIOSEE is similar to FORDEC model – given the same number of stages .
This decision making model can be very useful in several situations and can be applied for ITtroubleshooting as well – from war room ( where actual systems operations mal functions results in system off-line) but also for upgrade and migration procedures.
First, defining the problem can be very useful and it is the first step to be understanding the problem. A well defined problem will be much better/faster troubleshooted. Sometimes the problem definition can be much harder than finding the actual solutions. Knowing the problem we will be able to know who are the resources (human and material/IT resources) that are needed to solve them.
Then collecting the right information, it can be an inspect from Openshift (oc adm inspect), or a server report (from Infinispan) or even a few heap/thread dumps/VM.info in java applications (deployed in kubernetes or not). Or even collecting custom resources, in case we need to see the API/resources created with some Operator (Service Mesh, or Data Grid Operator, MTA Operator) and so on. Knowing what aspects/data to collect for each situation will result in much faster troubleshooting phase.
Later, after the analysis of the data, provided that all the information is collected – which can be a top down approach (custom resource for example – in case a operator is used) up to the down/very low level – which can be kernel tracing data, kernel audit logs, or even heap dump specific interpretation. This goes on the analysis of the options we have: restart, reboot, upgrade, downgrade, remove certain JVM flags, add certain JVM flags, re-write the system.
The selection of the options, and its trade-offs, is the next step on the decision model – one needs to understand the data, interpret it, and then select the option – I think it is very important to considering two aspects on this stage: trade-off and time for implementing. If the options have many trade-offs, other options should be considered. Once the selection options are all listed – the selection should be done at once.
Later the execution of the option should be done thoroughly – with the right resources following the procedures (with or without the checklists) , but of course better if those can be tested – but sometimes the procedure issui generis therefore that’s the first time this is happening and might not have been tested/prepared before.
Finally, the evaluation of the system after the procedure will do – this includes visual references, in java particularly – jcmd threads/heap, will provide enough data. If not enough data/references will provide clues to how the system is performing well or not. If required more information, more data can be feed from the system and this process can be iterative until the (initial) problem is 100% solved.
I think trying to establish several procedures, methods, and preparations for critical situations can help considerably for this. In this matter, the QA/QE of a system/java application can avoid problems – and it is very useful if not essential before deploying in production. However, how/what procedures/how long they should take can put the system back much faster.