Service Mesh

All

Recently I’ve been working in some interesting OCP deployments with Service Mesh. I mean that is a very powerful and I’d say complicates subject – even for experts on the matter doesn’t seem trivial.

The context here is Istio, just to be clear. So I’m talking about the Cloud Native Computing Foundation project. Service Mesh is basically an extension of OCP where it provides customizable features. In this matter, Service mesh can add so much flexility and enables such a centralized control for microservices handling.

Features of Service mesh include load balancing, full/automatic authentication canary releases, access control and even end-to-end authentication (via Istio mtls). Everything in one place.

Objectively Service Mesh adds a transparent layer of transport – all without any application change. To do that Service Mesh
captures/intercepting traffic between services that will act to modify, redirect, or create new requests to other services.

To do this interception/capture of requests, Service Mesh relies on the envoy sidecar – which is a container together with the other application in the same pod – a sidecar.

For deployments such as JBoss EAP/Widlfly this can be very interesting for be able to control communication and establish a level of network control more than what the services (eap-app-ping for clustering) already provide.

On the other hand, some architectures are coming up to use Istio without the sidecar, called sidecarless. One example is Ambient Mesh. So sidecarless implementations can be useful for environments where instrumenting the pod increases its complexity (deployment and instrumentation) and where it can be just simpler to not instrument the pod.

Using the inspect on Openshift

All

Get the namespace inspect for troubleshooting OCP issues.

One of the must useful tools in OCP, together with Must gather I think is the inspect.

The DevOps in OCP can be chaotic sometimes, with some many pods and operators and . But that’s exactly why the inspect can be a core component to debug the pods/services/deployments in OCP.

So to avoid this all going over your head, just get the inspect first. From there you can do a top down approach: so start with the deployment(config) and move to the services and pods, or bottom up – meaning the pods yaml/logs, and then move from to deployment config.

I mean, I’m trying to lead out the idea to get the inspect – via oc adm inspect ns/$namespace – so then this can be lead indicator for several issues: pod crashes, application issues, pod resource starvations. What happens if the application logs is ok, but the yaml shows the service’s label is wrong.

This avoids for example, only seeing the pod logs and forgetting about the resource allocations – in terms of cpus and memory allocation.

Doing a more global review: pod yaml, core configmaps, services, deployments, everything at once.

For deployments more and more complex, with several components – and sometimes, with Service Mesh – the istio side-car will be inside the pod and the user see the sidecar pods and access logs (set on the smcp – service mesh control plane).
On this matter I will write some presentations on this regard.

I’m sure the above is not consensus, some people will opt out for getting the pod yamls or just pod logs first, and just then get the inspect. But for OCP problems I start with inspect, because I can see the complete deployment, see all pods on the namespace and do the work once. So almost by definition you will have an overview of the data, which is must better than an narrow view of only pod logs.

Shenandoah won’t save all bad performant situations

All

Shenandoah Garbage collector has been already discussed on this blog several times and I’m certainly a fan of it.

First because at Red Hat I was mentored by Roman Kennke – who developed the algorithm and wrote the paper with Chirstine Flood – and wrote the original paper: Shenandoah: An open-source concurrent compacting garbage collector for OpenJDK Christine H. Flood, Roman Kennke, and Andrew Dinn.

But, also because of its awesome performance for large heaps, we are talking about 50gb+ for examples. It is awesome very simple to understand, the performance doesn’t scale with larger heaps and so on.

However, I’ve explained this before, Shenandoah (in its non generational form) is not applicable for all situations and workloads and there will be workloads where its performance will be hurt more than helped by Shenandoah – given it is not generational. Being non-generational is a core part of the algorithm and helps considerably in several aspects but can hurt in other more specific aspects.

In this matter actually, Amazon team is working on the generational Shenandoah and in 2021 announced it – to use it, download correto and set:

-XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions
-XX:ShenandoahGCMode=generationa

An example is when a high number of very short lived objects is created at random periods, which leads to all the threads kicking in and running at the same time and can lead to several subsequent full pauses in a roll. For those cases a generational collector, like G1GC and Parallel, would likely handle better the situation – by spliting the collection in phases. For those (generational) workloads Amazon (Correto) is developing its Generational Shenandoah.

In this aspect as well, I’ve seen some comments/discussions that Shenandoah will eventually surpass all and should replace G1GC/Parallel word-loads handles. Similar to how G1GC replaces CMS. That wouldn’t be the case, given some word-loads have a better performance with generational collectors. And in this aspect, Shenandoah is not necessarily a “improved” G1GC, so I won’t suggest all workloads to be replaced with Shenandoah necessarily.

Consequently, there needs to be a due diligence from the development team to verify how a non-generational collector is handling – in terms of latency, throughput, and less (but not least) footprint – which is most of the times sacrificed in several situations when developing in Java or (self/auto) collected garbage collection development.

But this can be generalize pretty much for anything in JVM/Java – no magic JVM flag will cut the latency in half (unless very specific cases for example where a certain collector is more adequate than another).

PIOSEE Decision Model and preparations for critical situations

All

During the pandemic times I’ve watch the content of Peter on the Mentor Pilot’s Youtube channel considerably – the channel brings several aspects of Aviation – deep technical, procedural, behavioural analysis – within the aviation themes.

I can say Peter taught several important life lessons: beware confirmation bias, prejudice, verify the assumptions, learning always, and trust on the team.

One important thing I’ve learned with him was the usage of PIOSEE decision model – and how the Pilots use this model on critical situations. I show below:

PProblemIt requires you to swiftly identify the problem at hand.
IInformationGather information about the problem that is occurring.
OOptionsWith the gathered information about the problem, you and your team generate options to solve the problem
SSelectYou need to select an option after efficiently evaluating the alternatives.
EExecuteOptions are worthless without swift and effective execution.
EEvaluateAfter execution, you and your team evaluate the process, noting places for improvement.

PIOSEE Model – PIOSEE is similar to FORDEC model – given the same number of stages [1][2][3].

PIOSEE Model

This decision making model can be very useful in several situations and can be applied for IT troubleshooting as well – from war room ( where actual systems operations mal functions results in system off-line) but also for upgrade and migration procedures.

First, defining the problem can be very useful and it is the first step to be understanding the problem. A well defined problem will be much better/faster troubleshooted. Sometimes the problem definition can be much harder than finding the actual solutions. Knowing the problem we will be able to know who are the resources (human and material/IT resources) that are needed to solve them.

Then collecting the right information, it can be an inspect from Openshift (oc adm inspect), or a server report (from Infinispan) or even a few heap/thread dumps/VM.info in java applications (deployed in kubernetes or not). Or even collecting custom resources, in case we need to see the API/resources created with some Operator (Service Mesh, or Data Grid Operator, MTA Operator) and so on. Knowing what aspects/data to collect for each situation will result in much faster troubleshooting phase.

Later, after the analysis of the data, provided that all the information is collected – which can be a top down approach (custom resource for example – in case a operator is used) up to the down/very low level – which can be kernel tracing data, kernel audit logs, or even heap dump specific interpretation. This goes on the analysis of the options we have: restart, reboot, upgrade, downgrade, remove certain JVM flags, add certain JVM flags, re-write the system.

The selection of the options, and its trade-offs, is the next step on the decision model – one needs to understand the data, interpret it, and then select the option – I think it is very important to considering two aspects on this stage: trade-off and time for implementing. If the options have many trade-offs, other options should be considered. Once the selection options are all listed – the selection should be done at once.

Later the execution of the option should be done thoroughly – with the right resources following the procedures (with or without the checklists) , but of course better if those can be tested – but sometimes the procedure is sui generis therefore that’s the first time this is happening and might not have been tested/prepared before.

Finally, the evaluation of the system after the procedure will do – this includes visual references, in java particularly – jcmd threads/heap, will provide enough data. If not enough data/references will provide clues to how the system is performing well or not. If required more information, more data can be feed from the system and this process can be iterative until the (initial) problem is 100% solved.

I think trying to establish several procedures, methods, and preparations for critical situations can help considerably for this. In this matter, the QA/QE of a system/java application can avoid problems – and it is very useful if not essential before deploying in production. However, how/what procedures/how long they should take can put the system back much faster.

[1] https://medium.com/@sb_30by50/7-leveraging-piosee-and-nits-in-a-time-of-massive-uncertainty-2c3582f77475

[2] https://khurrambhatti.com/2021/12/27/piosee-for-team-coaching/

[3] https://pilot-network.com/news/decision-making-models

Resolution/Update for 2023.

All

In 2022 I’ve stayed away from the blog for several months – I will try to update it more often – like 2021/2020.

The reason for that is that I didn’t want to write short blog posts and focus on long ones.

As 2023 resolutions, some aspects I would like to cover this year are:

Shenandoah, JNPL, EJB

OpenShift: Operator, Openshift networking, routes,services..

DataGrid/Infinispan, and JVM.

But also Wildfly as well. But also what I’ve learned besides IT as well, like movies (mostly character driven plots, not plot driven ones).

Linkedin is bad on basic stuff

All

I mean, I barely see the website, and I think it is great for looking for jobs and such – I think for searching people is an amazing tool. But it fails in some basic features very badly when you are not looking for job and don’t have the time to keep several profiles being updated. Main reasons below:

The default/main profile written in stone, meaning it is a pain to switch your default profile. You need to go there, delete it and re-create from scratch <– and that’s because they enable this now, for years you just couldn’t. If you moved from one country to the other, just give up on writing your profile in a new language and not being able to set as primary.

The languages integration is none. I have to literally write 5 profiles if I want to keep portuguese, english, french, german, and italian as profile languages. Any minimal change in anything at your current work, it is a pain to go there, change everything back again. I gave up and keep only 3 main profiles.

No language verification. You can add a new spanish profile, write everything in another language. No validation will be made. This is good and bad, because at least you can re-write your main profile in another language, like I did.

Jobs are dynamics like life not static. This is a more general problem, also correlated to writing a CV. You start doing one thing, focus on another, then do another thing, the role you are assign has the duties you are suppose to do, but from time to time the focus changes. Keeping updating all this, every time, from time to time is painful. Creating a more interactive experience would be more useful instead of a few lines for description. Maybe a better timeline, I mean, you start doing that, now you do this, your were someone mentee, now you are a mentor. And so on, sometimes by product. Or even maybe show by project/product.

Sometimes I even think about deleting it there and keeping just my personal blog. But then I cannot see my friends work, or their promotions, or even job changes. Maybe they should try to do less of a Facebook/Meta journal updates/network and more on professional graph network. You can select the other people you interact, see their projects, how they are being displayed.

DG Kubernetes Operators

All

I’m sure if you are into Kubernetes/OCP you already played with those bundles of Go automation scripts called operators. If you haven’t that’s pretty cool stuff in the sense of automating the deployment, some operators cover the whole deployment life and as usual, if a pod or a resource is different than the spec, on the status, it will self adjust – for instance it will pop up another pod if the number of pods is different than the operator settings.

What is interesting about this, I would say for DevOps teams is the facility to deploy applications/start EAP/DG/SSO with the operators. As soon as you download one from OperatorHub in OCP for instance, and install in your respective project (given you have storage setup) the deployment is complete seamlessly, only the namespace is will need to be known before hand.

Depending on the operator you can have more or less features, like seemly upgrade up to complete autopilot.

I think it is amazing how powerful it is, with a few oc apply using some templates one can spawn a complete cluster very easily: oc apply rhdg-setup.yaml -f –, where the template defines a namespace for the operator, a namespace for the pods, the operator itself (subscription and OperatorGroup) and the cluster itself:

- apiVersion: operators.coreos.com/v1 <--- api version
  kind: OperatorGroup          <----- OperatorGroup
  metadata:
    name: datagrid
    namespace: ${OPERATOR_NAMESPACE}
  spec:
    targetNamespaces:
      - ${CLUSTER_NAMESPACE}
      # - ${GRAFANA_NAMESPACE}
- apiVersion: operators.coreos.com/v1alpha1
  kind: Subscription            <----- Subscription
  metadata:
    name: datagrid-operator
    namespace: ${OPERATOR_NAMESPACE} <---- the namespace
  spec:
    channel: 8.2.x  <--- channel to fetch the operator, on this case channel 8.2.x instead of 8.1.x or 8.3.x
    installPlanApproval: Manual  <----- manual approval
    name: datagrid
    source: redhat-operators     <----- source
    sourceNamespace: openshift-marketplace <---- source namespace
    startingCSV: datagrid-operator.v8.2.8 <----- starting version

Above the fields are complete for subscription and operator group – as described here. Also the api version and so on defined here. It is very easy to add jvm flags or add routes, just do it in the custom resource, for instance infinispan or cache cr, and that’s all.

Learning nihongo update – Ado

All

After about 6 months of learning nihongo – aka Japanese, I can say I moved from A1 to A1. The language is considerably difficulty and one must be very disciplined to continue studying every day and make progress. But I did some progress though, when listening to a song, now I can recognize a few words (yes, that was my progress). But I want to end in a good note (and hopeful) and will continue to learn for the next 2/3y and be able to progress considerably. It takes more discipline than anything else – and consistence.

For sure I will make more progress when learning Japanese in Japan, but I will push the max I can do here in カナダ.

For sharing (シェアする – she a su ru) here is a song that I like very much by a Japanese Australian singer (yes born in 2002) ~ Ado. She has a magnificent and strong voice.

Verbs to learn on this song:

喰らって –> eat

 アンダスタン –> lol! anderstand

Spring boot multiple profiles

All

Development java web applications with spring is comes in handy considerably when developing production (and fast non production software). I mean, just adding an application.properties with the below can make it very simple to use several profiles, for each development (since Spring Boot 1.3):

spring.profiles.active=@activatedProperties@

And then setting on the mvn’s pom.xml the profiles:

<profiles>
		<profile>
			<id>local</id>
			<properties>
				<activatedProperties>local</activatedProperties>
			</properties>
			<activation>
				<activeByDefault>false</activeByDefault> <!-- here if false or true -->
                                <!-- <activeByDefault>true</activeByDefault> -->
			</activation>
		</profile>

And then you have a series of application-*.properties files, mandatorily pattern application-{custom_suffix}.properties, which is place on src/main/resources/application-*.properties, like src/main/resources/application-test.properties, where you will define the properties:

# AUTO-CONFIGURATION - set false for those which is not required
#spring.autoconfigure.exclude= # Auto-configuration classes to exclude.
spring.main.banner-mode=off
spring.jmx.enabled=false
...
server.jsp-servlet.registered=false
spring.freemarker.enabled=false

Interesting logging.config seems to be found in the server deployment, not in the application sources. We can use JAVA_OPTS=”$JAVA_OPTS -Dlogfile.name=test_file_name” to set a file name on runtime:

#appender.R.fileName = ${sys:logfile.name}
appender.R.fileName = ${filename}