KES Checks
Rotating your KES keys is one of the regular maintenance activities of Stake Pool Operators. Unfortunately, this activity can introduce hidden issues like the ones below:
Issue 1: KES Access Rights Not Set Correctly
When performing a KES rotation, the stake pool operator may chose to use other user ID like root. This is not ideal as newly created files will be owned by this user causing unforeseen issues if not remediated.
Symptoms
As with all advanced checks, there is no obvious symptoms until you have missed a block.
When that happens, gLiveView will show Leader has increased by 1 but Adopted and Invalid will have no increase.
For this particular error, you will also notice that node.cert is owned by root or some user other than the user that you usually start your nodes with.
Check
Assuming the user and group you use to start your nodes is "cardano", you can check if this problem is present in your system with a simple list command:
You should see an output similar to the following:
Ensure that you have read-write permissions for user at the minimum. If this is not available, then this problem may occur and you may fail to create blocks.
If you already noticed that you failed to create a block, you can confirm if this is the cause using the method below.
Run this from your block producing node to extract data from the console:
Open blockfail.log and search for "TraceNodeIsLeader" and you will find details of the block you missed.
Root Cause and Avoidance
The KES rotation was ran under a sudo shell or directly under root. Therefore, the ownership of node.cert was assigned to another user (e.g. root) instead of the user you use to start up your nodes.
To avoid this issue, always login as the user that you use to start up your nodes and just use sudo to perform administrative activities if needed.
Remediation
Assuming you use "cardano" as the user and group name for starting your nodes, use the following commands to fix the permissions:
It is a good idea to check $NODE_HOME regularly to see if any files have owner or group assigned to someone other than the user that you use to start your nodes (e.g. cardano). Apply the chown and chgrp commands to these files accordingly.
Credits to Luc of LVLUP Pool for providing the basic details of this issue
Issue 2: KES File Was Not Copied After Rotation
As of this writing, KES rotation is a very manual activity and the stake pool operator need to make sure he follows each step correctly. Mistakes will usually be discovered through errors but this particular mistake does not produce any easily viewable errors.
Symptoms
As with all advanced checks, there is no obvious symptoms until you have missed a block. When that happens, gLiveView will show Leader has increased by 1 but Adopted and Invalid will have no increase.
For this particular error, may also notice that vrf.skey is newer than node.cert and kes.skey though this does not always mean that you have this error.
Check
When you rotate your KES keys, you create a new kes.skey and node.cert. Therefore, the time stamp for both files should be very near each other. You can check this quickly using a list command:
You should see an output similar to this:
If you see that kes.skey is more than a few minutes older than node.cert (especially if they are days apart), then it is likely that you have this issue.
Root Cause and Avoidance
When you performed your KES rotation, you only copied node.cert but forgot to copy kes.skey from your cold machine.
To avoid this issue, make a checklist of the steps you will do prior to performing maintenance activities like KES rotation and physically check each step as you complete it. It is also a good idea to postpone maintenance activities until you are well rested and alert.
Remediation
Go back to your cold machine and copy the correct kes.skey onto your block producing node.
If you cannot find the correct kes.skey, the safest solution is to perform the KES rotation again-- making sure this time that you copy both kes.skey and node.cert at the same time to your block producing node.
Issue 3: Incorrect KES Counter File Was Used
As of this writing, KES rotation is a very manual activity and the stake pool operator need to make sure he follows each step correctly. Mistakes will usually be discovered through errors but this particular mistake does not produce any easily viewable errors.
Symptoms
As with all advanced checks, there is no obvious symptoms until you have missed a block. When that happens, gLiveView will show Leader has increased by 1 but Adopted and Invalid will have no increase.
Check
First, check what is the current number of rotations for your KES files:
You should see an output similar to this:
When you rotate your KES keys, you also update your node.counter file which is located in your cold machine. Go to your cold machine and check the value in the node.counter file:
The number that you see should be higher than your current number of rotations which you found in the previous step. If it is lower, you should do the steps in the Remediation section below.
Root Cause and Avoidance
If you did not point your KES rotation command to your previous node.counter file, you will end up creating a new counter file which will definitely have a lower number than your current one. Always remember to specify your old counter file when you perform a KES rotation.
To avoid this issue, make a checklist of the steps you will do prior to performing maintenance activities like KES rotation and physically check each step as you complete it. It is also a good idea to postpone maintenance activities until you are well rested and alert.
Remediation
If you have this error, redo the KES rotation steps until the contents of node.counter is greater than your current number of rotations.
There have been suggestions that it is possible to increase the number manually in your node.counter file. However, from observation, I saw that one number in the cborhex portion also increments. If you have backups of your old node.counter files, you may be able to deduce which number needs to increment in the cborhex portion. However, to be safe and not introduce new errors, I suggest to just do the KES rotation and let the program update node.counter automatically.
Last updated