During the pandemic times I’ve watch the content of Peter on the Mentor Pilot’s Youtube channel considerably – the channel brings several aspects of Aviation – deep technical, procedural, behavioural analysis – within the aviation themes.
I can say Peter taught several important life lessons: beware confirmation bias, prejudice, verify the assumptions, learning always, and trust on the team.
One important thing I’ve learned with him was the usage of PIOSEE decision model – and how the Pilots use this model on critical situations. I show below:
|P||Problem||It requires you to swiftly identify the problem at hand.|
|I||Information||Gather information about the problem that is occurring.|
|O||Options||With the gathered information about the problem, you and your team generate options to solve the problem|
|S||Select||You need to select an option after efficiently evaluating the alternatives.|
|E||Execute||Options are worthless without swift and effective execution.|
|E||Evaluate||After execution, you and your team evaluate the process, noting places for improvement.|
PIOSEE Model – PIOSEE is similar to FORDEC model – given the same number of stages .
This decision making model can be very useful in several situations and can be applied for IT troubleshooting as well – from war room ( where actual systems operations mal functions results in system off-line) but also for upgrade and migration procedures.
First, defining the problem can be very useful and it is the first step to be understanding the problem. A well defined problem will be much better/faster troubleshooted. Sometimes the problem definition can be much harder than finding the actual solutions. Knowing the problem we will be able to know who are the resources (human and material/IT resources) that are needed to solve them.
Then collecting the right information, it can be an inspect from Openshift (oc adm inspect), or a server report (from Infinispan) or even a few heap/thread dumps/VM.info in java applications (deployed in kubernetes or not). Or even collecting custom resources, in case we need to see the API/resources created with some Operator (Service Mesh, or Data Grid Operator, MTA Operator) and so on. Knowing what aspects/data to collect for each situation will result in much faster troubleshooting phase.
Later, after the analysis of the data, provided that all the information is collected – which can be a top down approach (custom resource for example – in case a operator is used) up to the down/very low level – which can be kernel tracing data, kernel audit logs, or even heap dump specific interpretation. This goes on the analysis of the options we have: restart, reboot, upgrade, downgrade, remove certain JVM flags, add certain JVM flags, re-write the system.
The selection of the options, and its trade-offs, is the next step on the decision model – one needs to understand the data, interpret it, and then select the option – I think it is very important to considering two aspects on this stage: trade-off and time for implementing. If the options have many trade-offs, other options should be considered. Once the selection options are all listed – the selection should be done at once.
Later the execution of the option should be done thoroughly – with the right resources following the procedures (with or without the checklists) , but of course better if those can be tested – but sometimes the procedure is sui generis therefore that’s the first time this is happening and might not have been tested/prepared before.
Finally, the evaluation of the system after the procedure will do – this includes visual references, in java particularly – jcmd threads/heap, will provide enough data. If not enough data/references will provide clues to how the system is performing well or not. If required more information, more data can be feed from the system and this process can be iterative until the (initial) problem is 100% solved.
I think trying to establish several procedures, methods, and preparations for critical situations can help considerably for this. In this matter, the QA/QE of a system/java application can avoid problems – and it is very useful if not essential before deploying in production. However, how/what procedures/how long they should take can put the system back much faster.