KES Checks

Rotating your KES keys is one of the regular maintenance activities of Stake Pool Operators. Unfortunately, this activity can introduce hidden issues like the ones below:

Issue 1: KES Access Rights Not Set Correctly

When performing a KES rotation, the stake pool operator may chose to use other user ID like root. This is not ideal as newly created files will be owned by this user causing unforeseen issues if not remediated.

Symptoms

As with all advanced checks, there is no obvious symptoms until you have missed a block.

When that happens, gLiveView will show Leader has increased by 1 but Adopted and Invalid will have no increase.

For this particular error, you will also notice that node.cert is owned by root or some user other than the user that you usually start your nodes with.

Check

Assuming the user and group you use to start your nodes is "cardano", you can check if this problem is present in your system with a simple list command:

cd $NODE_HOME
ls -l node.cert

You should see an output similar to the following:

-rw------- 1 cardano cardano  365 Apr 12 21:55 node.cert

Ensure that you have read-write permissions for user at the minimum. If this is not available, then this problem may occur and you may fail to create blocks.

If you already noticed that you failed to create a block, you can confirm if this is the cause using the method below.

Run this from your block producing node to extract data from the console:

sudo journalctl -u cardano-node.service > blockfail.log

Open blockfail.log and search for "TraceNodeIsLeader" and you will find details of the block you missed.

Root Cause and Avoidance

The KES rotation was ran under a sudo shell or directly under root. Therefore, the ownership of node.cert was assigned to another user (e.g. root) instead of the user you use to start up your nodes.

To avoid this issue, always login as the user that you use to start up your nodes and just use sudo to perform administrative activities if needed.

Remediation

Assuming you use "cardano" as the user and group name for starting your nodes, use the following commands to fix the permissions:

cd $NODE_HOME
sudo chown cardano node.cert
sudo chgrp cardano node.cert
sudo chmod u+rw node.cert

It is a good idea to check $NODE_HOME regularly to see if any files have owner or group assigned to someone other than the user that you use to start your nodes (e.g. cardano). Apply the chown and chgrp commands to these files accordingly.

Credits to Luc of LVLUP Pool for providing the basic details of this issue

Issue 2: KES File Was Not Copied After Rotation

As of this writing, KES rotation is a very manual activity and the stake pool operator need to make sure he follows each step correctly. Mistakes will usually be discovered through errors but this particular mistake does not produce any easily viewable errors.

Symptoms

As with all advanced checks, there is no obvious symptoms until you have missed a block. When that happens, gLiveView will show Leader has increased by 1 but Adopted and Invalid will have no increase.

For this particular error, may also notice that vrf.skey is newer than node.cert and kes.skey though this does not always mean that you have this error.

Check

When you rotate your KES keys, you create a new kes.skey and node.cert. Therefore, the time stamp for both files should be very near each other. You can check this quickly using a list command:

cd $NODE_HOME
ls -l kes.skey node.cert

You should see an output similar to this:

-rw------- 1 cardano cardano 1327 Apr 12 21:32 kes.skey
-rw------- 1 cardano cardano  365 Apr 12 21:55 node.cert

If you see that kes.skey is more than a few minutes older than node.cert (especially if they are days apart), then it is likely that you have this issue.

Root Cause and Avoidance

When you performed your KES rotation, you only copied node.cert but forgot to copy kes.skey from your cold machine.

To avoid this issue, make a checklist of the steps you will do prior to performing maintenance activities like KES rotation and physically check each step as you complete it. It is also a good idea to postpone maintenance activities until you are well rested and alert.

Remediation

Go back to your cold machine and copy the correct kes.skey onto your block producing node.

If you cannot find the correct kes.skey, the safest solution is to perform the KES rotation again-- making sure this time that you copy both kes.skey and node.cert at the same time to your block producing node.

Issue 3: Incorrect KES Counter File Was Used

As of this writing, KES rotation is a very manual activity and the stake pool operator need to make sure he follows each step correctly. Mistakes will usually be discovered through errors but this particular mistake does not produce any easily viewable errors.

Symptoms

As with all advanced checks, there is no obvious symptoms until you have missed a block. When that happens, gLiveView will show Leader has increased by 1 but Adopted and Invalid will have no increase.

Check

First, check what is the current number of rotations for your KES files:

cd $NODE_HOME
cardano-cli text-view decode-cbor --in-file node.cert | grep int | head -1

You should see an output similar to this:

      02  # int(2)

When you rotate your KES keys, you also update your node.counter file which is located in your cold machine. Go to your cold machine and check the value in the node.counter file:

cd /location-of-your-coldkeys
cat node.counter

The number that you see should be higher than your current number of rotations which you found in the previous step. If it is lower, you should do the steps in the Remediation section below.

Root Cause and Avoidance

If you did not point your KES rotation command to your previous node.counter file, you will end up creating a new counter file which will definitely have a lower number than your current one. Always remember to specify your old counter file when you perform a KES rotation.

To avoid this issue, make a checklist of the steps you will do prior to performing maintenance activities like KES rotation and physically check each step as you complete it. It is also a good idea to postpone maintenance activities until you are well rested and alert.

Remediation

If you have this error, redo the KES rotation steps until the contents of node.counter is greater than your current number of rotations.

There have been suggestions that it is possible to increase the number manually in your node.counter file. However, from observation, I saw that one number in the cborhex portion also increments. If you have backups of your old node.counter files, you may be able to deduce which number needs to increment in the cborhex portion. However, to be safe and not introduce new errors, I suggest to just do the KES rotation and let the program update node.counter automatically.

Last updated