ValOS

Abstract

This specification defines risks that can apply when operating a blockchain node.

It describes mitigations that can minimize the likelihood that particular risks will be realized and cause a problem, such as compromising the ability to manage a node or actions that result in reduced economic rewards, or penalties such as slashing.

Finally, it provides a set of controls to verify that a Node Operator is appropriately managing the relevant risks.

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
FIN1	Process	Onboarding	Onboarded entities are not adequately vetted to ensure financial, operational, regulatory, or reputational appropriateness, resulting in potential financial, legal, or reputational damage	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.2.2 Training 4.3.1 Update Third-party Software 4.4.1 Controlled and Audited Secret Access 4.4.4 Key Management 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.9.3 Security and Compliance 4.10.1 Stakeholder Communication Management 4.10.2 Incident Response Plans
FIN2	Infrastructure	Deposit	Fiat and digital assets deposited are not received in the appropriate currency, address, or fiat account, leading to financial loss	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.9.3 Security and Compliance
FIN3	Infrastructure	Deposit	Fiat and digital assets are not correctly processed and assets are misallocated to individuals, entities, or operational addresses leading to financial loss	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.6 Process Automation 4.9.2 Node, System and Network Health
FIN4	Process	Withdrawal	Fiat and digital assets are not correctly disbursed to individuals, entities, or addresses, leading to financial and reputational loss	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.6 Process Automation 4.9.2 Node, System and Network Health
FIN5	Infrastructure	Compounding	Staking rewards are not appropriately collected, governed, restaked, compounded, or allocated to clients leading to financial loss	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.8.6 Process Automation 4.9.1 Blockchain 4.9.3 Security and Compliance
FIN6	Process	Reporting	Financial reporting requirements are not adhered to or inconsistently applied, leading to regulatory, legal, and financial consequences	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.8.6 Process Automation 4.9.3 Security and Compliance 4.10.1 Stakeholder Communication Management 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.7 Incident Communication
FIN7	Process	Up to date compliance	Failure to review relevant regulation and update compliance procedures leading to financial, legal, and regulatory repercussions	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.2 Training 4.4.5 Operational Information Management 4.9.3 Security and Compliance 4.10.1 Stakeholder Communication Management 4.10.2 Incident Response Plans 4.10.4 Analyzing Security Events

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
SLS1	Infrastructure	Operational Failure: Single validator signs two different blocks	Single node signs two different blocks through failure in setting up the anti-slashing mechanism correctly (e.g. local anti-slashing database is disabled or has been deleted) or failure in the validator migration process.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.3.2 Local Anti-Slashing Database 4.3.3 Signature Management 4.4.4 Key Management 4.6.2 Physically Distributed Infrastructure 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.5 Containerized and Orchestrated Environments 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.5 Doppelgänger Protection
SLS2	Infrastructure	Operational Failure: Shutting down validator only temporarily	Validator shuts down temporarily. System spins up a new validator with the same key	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.3.2 Local Anti-Slashing Database 4.3.3 Signature Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.5 Containerized and Orchestrated Environments 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.5 Doppelgänger Protection
SLS3	Infrastructure	Operational Failure: Validator keys are used on 2 different validators	System takes the same keys twice from the key database and deploys them on two different validators.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.3.2 Local Anti-Slashing Database 4.3.3 Signature Management 4.4.4 Key Management 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.5 Containerized and Orchestrated Environments 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health
SLS4	Infrastructure	Operational Failure: Failure in setting up the anti-slashing mechanisms correctly	Failure in setting up the anti-slashing mechanisms correctly (e.g. Web3Signer has no slashing protection enabled, no database, database only in memory and not on disk, 2 or several copies of Web3Signer, slashing database can be deleted)	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.2 Local Anti-Slashing Database 4.3.3 Signature Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.5 Containerized and Orchestrated Environments 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health
SLS5	Infrastructure	Double key usage in the CI/CD pipeline	Usage of same key within different environments causing a slashing	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.4.1 Controlled and Audited Secret Access 4.4.4 Key Management 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.1 Avoid Customizing Third-party Software 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.5 Containerized and Orchestrated Environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.5 Doppelgänger Protection 4.10.4 Analyzing Security Events
SLS6	Software	Software Bug (e.g. Validator Client) (Intentional or accidental) through update	New versions of a validator client that may cause errors that lead to slashing Supply chain attack	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.4 Client Diversity 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.8.5 Containerized and Orchestrated Environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.4 Upgrades
SLS7	Software	Software Bug (e.g. Validator Client) through software customization	New versions of a validator client has errors that lead to slashing	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.4 Client Diversity 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.1 Avoid Customizing Third-party Software 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.8.5 Containerized and Orchestrated Environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.4 Upgrades
SLS8	Replaced by HCK1
SLS9	Replaced by HCK2
SLS10	Replaced by HCK3
SLS11	Replaced by HCK4
SLS12	Replaced by HCK4
SLS13	Replaced by HCK4
SLS14	Process	Operational Failure: Incorrect implementation of the failover mechanism: Failover system comes unexpectedly online	If the failover does not ensure that old system is not still alive in some way or is using a stale version of the anti-slashing database, e.g.: failover system starts accidentally although primary system is not down	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.4.5 Operational Information Management 4.6.4 Protection against Environmental Threat 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.4 Analyzing Security Events
SLS15	Process	Operational Failure: Incorrect implementation of the failover mechanism: Primary system comes unexpectedly back online	If the failover does not ensure that old system is not still alive in some way or is using a stale version of the anti-slashing database, e.g.: failover system starts (manually / automatically) because primary system is down and primary system comes back online	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.6.4 Protection against Environmental Threat 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.9.1 Blockchain 4.9.2 Node, System and Network Health
SLS16	Removed
SLS17	Process	Operational Failure: Slashing monitoring ignores alerts	Slashing events continue or recur because alerts are not monitored	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.2 Training 4.3.2 Local Anti-Slashing Database 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
SLS18	Process	Operational Failure: Slashing monitoring does not shut down the validators	Slashing continues because monitoring system fails to automatically shut down malfunctioning validator	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.2 Local Anti-Slashing Database 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
SLS19	Process	Incident Response does not update Slashing Database	A slashing event recurs, because the database is not updated as part of Incident Response	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.2 Local Anti-Slashing Database 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.6 Process Automation 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
SLS20	Infrastructure	Chainsplit increases slashing penalty	After a chainsplit occurs, continuing to support the leading fork can lead to a larger penalty if it is later rejected	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.4 Client Diversity 4.8.3 Protection against Supply-chain Malware 4.9.1 Blockchain 4.9.2 Node, System and Network Health

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
DOW1	Infrastructure	External: Operational Failure of Cloud Service Provider	Cloud Downtime, malfunction	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.6.2 Physically Distributed Infrastructure 4.6.4 Protection against Environmental Threat 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
DOW2	Infrastructure	Operational Failure of own bare metal set-up due to malfunction software	Malfunction of software (e.g. validator client or third party software) leads to downtime	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.4 Client Diversity 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.1 Avoid Customizing Third-party Software 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.9.2 Node, System and Network Health 4.9.4 Upgrades 4.9.5 Doppelgänger Protection 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
DOW3	Infrastructure	Operational Failure of own bare metal set-up due to malfunction hardware	Malfunction of hardware (e.g. physical network, computer system, CPU, RAM) leads to downtime	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.6.1 Managed Physical Access 4.6.5 Manage Equipment Lifecycle 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
DOW4	Infrastructure	External: Operational Failure of own bare metal set-up due to people (man-made)	Employees are responsible for the downtime event (accidentally or intentionally)	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.8.6 Process Automation 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans
DOW5	Infrastructure	External: Operational Failure of own bare metal set-up due to natural causes	A natural event (e.g. earthquake, flood, hurricane,...) leads to downtime	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.6 Deletion protection 4.6.2 Physically Distributed Infrastructure 4.6.4 Protection against Environmental Threat 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
DOW6	Infrastructure	Failure to design for high availability	Having too few beacon nodes relative to validator clients, leading to: - opportunity costs - slashing on some networks	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.6 Deletion protection 4.6.2 Physically Distributed Infrastructure 4.6.3 Protection against Utility Failure 4.6.4 Protection against Environmental Threat 4.6.5 Manage Equipment Lifecycle 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans
DOW7	Infrastructure	External: Internet connectivity	Loss of infrastructure network connection due to: - Sudden cloud outage - Sudden internet failure in on-premise machines - Accidental firewall change locks out access.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.6.1 Managed Physical Access 4.6.2 Physically Distributed Infrastructure 4.6.3 Protection against Utility Failure 4.6.4 Protection against Environmental Threat 4.6.5 Manage Equipment Lifecycle 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
DOW8	This risk has been merged into DOW9
DOW9	Infrastructure	Power supply	Volatile power supply damages infrastructure or causes system downtime	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.6.1 Managed Physical Access 4.6.2 Physically Distributed Infrastructure 4.6.3 Protection against Utility Failure 4.6.4 Protection against Environmental Threat 4.6.5 Manage Equipment Lifecycle 4.9.2 Node, System and Network Health 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
DOW10	Infrastructure	External: DDOS attack	Systems unresponsive, slowed down, and compromised	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.9.5 Doppelgänger Protection 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
DOW11	Software	Software Bug in the Validator Client	Downtime or accidental interpretation of dishonest behavior	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.4 Upgrades 4.10.3 Identifying and Responding to Security Incidents
DOW12	Software	Software Bug in the Validator Client (Intentional or accidental) through software update	New versions of a validator client that may cause errors that lead to downtime (Supply chain attack)	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.9.4 Upgrades 4.10.3 Identifying and Responding to Security Incidents
DOW13	Software	Software Bug in the Validator Client through software customization	New versions of a validator client may cause errors that lead to downtime	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.1 Avoid Customizing Third-party Software 4.8.2 Configuration Management 4.8.4 Deployment testing environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.4 Upgrades 4.10.3 Identifying and Responding to Security Incidents
DOW14	Software	Software Bug in third party software	Third party software failure can lead to downtime of the whole system	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.6 Deletion protection 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.8.4 Deployment testing environments 4.9.2 Node, System and Network Health 4.10.3 Identifying and Responding to Security Incidents
DOW15	Software	Latency / Failure of relays	Latency / Failure of relays	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.6.3 Protection against Utility Failure 4.6.4 Protection against Environmental Threat 4.6.5 Manage Equipment Lifecycle 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.10.3 Identifying and Responding to Security Incidents
DOW16	Replaced by HCK2
DOW17	Replaced by HCK3
DOW18	Replaced by HCK4
DOW19	Software	Running outdated validator software	The node operator is not updating its validator software	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.4 Client Diversity 4.7.2 Comprehensive Testing for Changes to Code 4.8.1 Avoid Customizing Third-party Software 4.8.6 Process Automation 4.9.1 Blockchain 4.9.3 Security and Compliance 4.9.4 Upgrades 4.10.3 Identifying and Responding to Security Incidents
DOW20	Software	Validator client update incompatible with IT system	System downtime after validator client update caused by incompatibility	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.6 Deletion protection 4.6.5 Manage Equipment Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.1 Avoid Customizing Third-party Software 4.8.4 Deployment testing environments 4.8.6 Process Automation 4.10.3 Identifying and Responding to Security Incidents
DOW21	Software	Updates take too long	System downtime caused by software update processes taking longer than planned, with no failover capacity	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.2 Training 4.3.1 Update Third-party Software 4.3.4 Client Diversity 4.4.5 Operational Information Management 4.6.5 Manage Equipment Lifecycle 4.8.1 Avoid Customizing Third-party Software 4.8.2 Configuration Management 4.8.4 Deployment testing environments 4.8.5 Containerized and Orchestrated Environments 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
KEC1	Infrastructure	Failure to use vault system	No audit trail and controlled access to secrets	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.4.1 Controlled and Audited Secret Access 4.4.2 Encrypted Data 4.4.3 Cold Storage 4.4.4 Key Management 4.4.5 Operational Information Management 4.5.4 Authentication Policies 4.6.5 Manage Equipment Lifecycle
KEC2	Replaced by HCK1, HCK2
KEC3	Replaced by HCK1, HCK2
KEC4	Replaced by HCK4
KEC5	Replaced by HCK4
KEC6	Process	Loss of Signing Keys (Operational Failure)	Signing keys are lost in an operational process	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.3.3 Signature Management 4.4.1 Controlled and Audited Secret Access 4.4.2 Encrypted Data 4.4.3 Cold Storage 4.4.4 Key Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.6.2 Physically Distributed Infrastructure 4.6.5 Manage Equipment Lifecycle 4.7.3 Validated Inputs and Outputs 4.8.3 Protection against Supply-chain Malware 4.8.6 Process Automation 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans
KEC7	Process	Privilege escalation mechanisms not prevented	Someone with access to one service/node can increase their privileges and do more harm on further nodes.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.4.4 Key Management 4.4.5 Operational Information Management 4.5.1 Least Privilege 4.5.4 Authentication Policies 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.8.3 Protection against Supply-chain Malware 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans
KEC8	Replaced by HCK6
KEC9	Process	Loss of Withdrawal Keys (Operational Failure)	Loss of Withdrawal Keys (Operational Failure)	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.3.3 Signature Management 4.4.1 Controlled and Audited Secret Access 4.4.3 Cold Storage 4.4.4 Key Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.7.3 Validated Inputs and Outputs 4.8.3 Protection against Supply-chain Malware 4.8.6 Process Automation 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans
KEC10	Replaced by HCK1, HCK2
KEC11	Replaced by HCK4

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
HCK1 (replaces SLS8, KEC2, KEC3, KEC10)	People	Malicious Internal Employee intentionally causes operational failure with appropriate user rights	Anything that an internal employee has access to is at risk of being exploited to sabotage the operation resulting in a slashing incident.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.4.1 Controlled and Audited Secret Access 4.4.2 Encrypted Data 4.4.4 Key Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
HCK2 (replaces SLS9, DOW16, KEC2, KEC3, KEC10, GIR2)	People	Malicious Internal Employee intentionally causes operational failure via privilege escalation	A malicious internal employee can get additional rights via privilege escalation.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.4.1 Controlled and Audited Secret Access 4.4.2 Encrypted Data 4.4.4 Key Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.8.1 Avoid Customizing Third-party Software 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.8.5 Containerized and Orchestrated Environments 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
HCK3 (replaces SLS10, DOW17, GIR2, GIR5)	People	Malicious Ex-Employee intentionally causes an operational failure	A former employee whose access is not blocked or removed	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.3.3 Signature Management 4.4.4 Key Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.8.1 Avoid Customizing Third-party Software 4.8.2 Configuration Management 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.9.4 Upgrades 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
HCK4 (replaces SLS11, SLS12, SLS13, DOW18, KEC4, KEC5, KEC11, GIR1)	People	Malicious External Hacker intentionally causes operational failure	Malicious External Hacker gets system access through absence of or weak cyber security standards	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.1 Controlled and Audited Secret Access 4.4.2 Encrypted Data 4.4.4 Key Management 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.3 Managed Network Access to Nodes 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.6.5 Manage Equipment Lifecycle 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.8.3 Protection against Supply-chain Malware 4.9.1 Blockchain 4.9.2 Node, System and Network Health 4.9.3 Security and Compliance 4.9.4 Upgrades 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.7 Incident Communication
HCK5 (replaces GIR8)	Process	No Input validation	Attacks induce buffer overflow, DoS, code injection, etc.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.7.3 Validated Inputs and Outputs 4.9.2 Node, System and Network Health 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
HCK6 (replaces KEC8)	Infrastructure	Failure to protect infrastructure against physical access	Someone who gains physical access to a server can have access to locally exposed ports and can access the software API	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.1 Controlled and Audited Secret Access 4.4.2 Encrypted Data 4.4.4 Key Management 4.5.1 Least Privilege 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.6.2 Physically Distributed Infrastructure 4.6.5 Manage Equipment Lifecycle 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.2 Configuration Management 4.8.3 Protection against Supply-chain Malware 4.9.3 Security and Compliance 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
GIR1	Replaced by HCK4
GIR2	Replaced by HCK2, HCK3
GIR3	Infrastructure	Fix versions on every deploy	Downtime if a system needs to be just re-started if newest version is accidentally pulled	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.8.1 Avoid Customizing Third-party Software 4.8.2 Configuration Management
GIR4	Process	Insufficient monitoring/logging	- Inability to learn from incidents - Late detection of incidents - insufficient automation to react to incidents	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.5.2 Employee Authorization Management 4.8.2 Configuration Management 4.9.1 Blockchain
GIR5	Replaced by HCK3
GIR6	Process	No password rotation	- Leak of passwords - brute force	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.4 Key Management 4.5.4 Authentication Policies 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.9.3 Security and Compliance 4.10.4 Analyzing Security Events
GIR7	Process	Use of direct auth	Authentication information does not expire timely and can be used later.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.3 Signature Management 4.4.4 Key Management 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.9.2 Node, System and Network Health 4.10.4 Analyzing Security Events
GIR8	Replaced by HCK5
GIR9	Infrastructure	Failure to properly perform network segmentation	Having containers or nodes accessible from any IP addresses increases the attack vector enormously	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.5.1 Least Privilege 4.5.3 Managed Network Access to Nodes 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.9.3 Security and Compliance
GIR10	Infrastructure	Lack of encrypted traffic between services and deployment scripts	Anyone on the network can sniff out packages with secrets included, and may be able to steal passwords and tokens in this way	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.2 Encrypted Data 4.4.3 Cold Storage 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code
GIR11	Infrastructure	No separate tests and staging environments	Improper change management and testing of software updates "in production"	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.7.2 Comprehensive Testing for Changes to Code 4.8.4 Deployment testing environments 4.9.4 Upgrades
GIR13	Infrastructure	High Blast radius of software bug in overall system	A small error affects the whole system and all clients right away, instead of being caught early with limited effect on the whole system.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.6 Deletion protection 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.4 Deployment testing environments 4.8.5 Containerized and Orchestrated Environments 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
GIR14	Infrastructure	Low Infrastructure provider security	Hacks through the apis of the infrastructure provider	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.4 Key Management 4.9.3 Security and Compliance 4.9.4 Upgrades 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events
GIR15	Infrastructure	CVE Monitoring	Attack on the system suddenly possible once published	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.9.3 Security and Compliance 4.10.3 Identifying and Responding to Security Incidents 4.10.4 Analyzing Security Events 4.10.5 Disaster Recovery Plans
GIR16	People	Human error	Anything a human can touch can go wrong	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.2 Training 4.3.1 Update Third-party Software 4.4.4 Key Management 4.4.5 Operational Information Management 4.5.1 Least Privilege 4.8.1 Avoid Customizing Third-party Software 4.8.6 Process Automation 4.10.5 Disaster Recovery Plans
GIR17	Process	Use of non-hardened images	Attack on the system using the weakest link of a given node/container	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.2 Encrypted Data 4.4.3 Cold Storage 4.5.3 Managed Network Access to Nodes 4.5.4 Authentication Policies 4.6.1 Managed Physical Access 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.3 Protection against Supply-chain Malware 4.9.3 Security and Compliance
GIR18	Process	Insufficient change management mechanisms in place	- Downtime on update - Slow down in reaction time to incident	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.4 Key Management 4.4.5 Operational Information Management 4.7.2 Comprehensive Testing for Changes to Code 4.8.4 Deployment testing environments 4.8.6 Process Automation 4.9.4 Upgrades
GIR19	Process	Lack of automation for deployment	- Downtime on update - Slow down in reaction time to incident	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.8.6 Process Automation 4.9.4 Upgrades 4.10.5 Disaster Recovery Plans
GIR20	Process	Lack of testing (software and infrastructure)	- Downtime on update - Slow down in reaction time to incident	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.4.5 Operational Information Management 4.8.4 Deployment testing environments 4.8.6 Process Automation 4.9.4 Upgrades
GIR21	Process	Lack of enforced code review	- Downtime on update - Slow down in reaction time to incident	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code 4.8.4 Deployment testing environments 4.8.6 Process Automation 4.9.4 Upgrades 4.10.5 Disaster Recovery Plans
GIR22	Process	Lack of Security training (password hygiene, phishing attacks, ...)	Employees spill secrets	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.2 Training 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.5.1 Least Privilege 4.5.4 Authentication Policies 4.9.3 Security and Compliance 4.10.3 Identifying and Responding to Security Incidents 4.10.5 Disaster Recovery Plans
GIR23	Process	Make-shift container orchestration procedures	Failure when e.g. failover is actually needed to be performed	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.7.2 Comprehensive Testing for Changes to Code 4.8.5 Containerized and Orchestrated Environments
GIR24	Software	Third party software and vendors	Suboptimal third-party software practices	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.7.2 Comprehensive Testing for Changes to Code
GIR25	People	Centralized knowledge	If the infrastructure knowledge is not shared across the team, this could lead to a heavy dependency on a single person	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.2 Training 4.3.1 Update Third-party Software 4.4.1 Controlled and Audited Secret Access 4.4.5 Operational Information Management 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.8.6 Process Automation 4.10.5 Disaster Recovery Plans

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
SPS0	Counterparty	General Counterparty Risk	Whenever a service is provided by a third party, the relevant risks are run by the third party, but in most cases at least some and often the bulk of the consequences for a failure will be borne by the node operator.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.4.5 Operational Information Management 4.5.1 Least Privilege 4.5.2 Employee Authorization Management 4.5.4 Authentication Policies 4.8.3 Protection against Supply-chain Malware 4.9.5 Doppelgänger Protection 4.10.1 Stakeholder Communication Management 4.10.3 Identifying and Responding to Security Incidents 4.10.7 Incident Communication
SPS1	Process	Exit Risk - Delinquent state	No new stake will be allocated to the Node Operator (happens automatically) the daily rewards sent to the Node Operator will be halved (with the remaining half sent towards that day’s rebase) (happens automatically) reduced rewards will continue for the duration of a cooldown period long enough to determine whether, immediately after service restoration by the Node Operator, subsequently received validator exit requests are processed in a timely manner.	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.3.5 Delinquent State 4.4.5 Operational Information Management 4.7.1 Secure Development Lifecycle 4.7.2 Comprehensive Testing for Changes to Code

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
RER1	Process	Mismanagement during incident	Reputation damage due to mismanagement of slashing, downtime or access loss to keys	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.2 Encrypted Data 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.8.1 Avoid Customizing Third-party Software 4.10.1 Stakeholder Communication Management 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.7 Incident Communication
RER2	People	Negative appearance in public	Damage to reputation due to bad behavior in public	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.2.1 Identified Individuals 4.2.2 Training 4.3.1 Update Third-party Software 4.4.2 Encrypted Data 4.10.7 Incident Communication
RER3	Process	Mismanagement of Post-Incident	Reputation damage due to mismanagement of Post-slashing, -downtime or access loss to keys	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.4.6 Deletion protection 4.10.1 Stakeholder Communication Management 4.10.2 Incident Response Plans 4.10.3 Identifying and Responding to Security Incidents 4.10.7 Incident Communication
RER4	Infrastructure	Withdrawal	Staking withdrawal requests cannot be met efficiently, leading to delays in payment processing causing reputational loss	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.8.1 Avoid Customizing Third-party Software
RER5	Process	Poor Communication	Poor reputation or reputational damage due to insufficient operational communication and overall transparency	4.1.1 Assessing risks 4.1.2 Assessing Financial Impact 4.1.3 Assessing Incident Probability 4.3.1 Update Third-party Software 4.4.5 Operational Information Management 4.10.3 Identifying and Responding to Security Incidents

This Mitigation Strategies section serves as a go-to resource for node operators, providing actionable insights and mitigation options to enhance the security, reliability, and efficiency of their operations.

Most of the best practices that optimize up-time, access control and general stability directly apply to operating a node properly. However, for some risks specific to running a node operator, high levels of process segregation need to be achieved for mitigation to be effective.

A core principle for mitigating risks is to actively identify and manage the risks. This means understanding the particular risks, the likelihood of something going wrong, and the likely impact if that does occur. That information enables a Node Operator to decide what level of risk is reasonable and how to prioritise available resources to mitigate risk.

Risk management decisions need to take into account any regulation that obliges a Node Operator to meet specific benchmarks or implement specific mitigation strategies or other activities.

A first step for effective risk management is to document the potential risks, as well as the tools and processes currently in place to address those risks.

Documentation needs to include an assessment of the relevant risks, an explanation of what level of risk is acceptable and why, and how each process or infrastructure component contributes to and protects against risks.

This enables Node Operators to identify activities that are not contributing to the business, or that actually increase the potential risks they face. The accuracy, availability and completeness of this information is of crucial import.

A common industry approach to assessing risks is to consider the probability of an event occurring and the likely impact of that event.

If these are ranked on a linear numerical scale (e.g. probability between 0 and 1), and an approximate overall financial impact, they can be multiplied, provide a simple initial ranking for priority of mitigating each risk.

Since the cost of risk mitigation varies considerably, the overall priority for addressing risk, or deciding that a given level of risk is acceptable, generally depends on comparing the risk ranking with the cost of mitigation, and available resources.

Identify relevant staff and others responsible for identifying, assessing, and determining how to manage risks
Ensure that every service, where possible, is configuration hardened. Common benchmarks such as CIS provide helpful guidance.
Analyze each infrastructure component's security, availability, processing integrity, confidentiality and privacy.
Creation and continuous analysis of a Software Bill of Materials [SBOM].

All risks

There are a number of factors to take into account when assessing the overall financial impact of a given risk, with the direct cost incurred as the most obvious. It is important to understand the time required to mitigate the impact of an event, and the cost that will be incurred over that time.

An incident can incur a variety of costs in terms of employee time spent managing the incident, communication, and follow-up, new mitigations implemented to mitigate concrete or reputational damage such as replacement or additional infrastructure, as well as potential costs of compensation or legal costs.

It is also useful to consider opportunity costs such as competitors taking advantage of an incident to promote themselves as a better alternative.

Tools to support assessing financial impact

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Validator Penalty Simulator

All risks

Predicting the likelihood of an unexpected future event is generally difficult, and results are unlikely to precisely match the predictions. Nevertheless it is important to consider the context of a specific operation and attempt useful predictions.

Analyzing historical data to understand past trends and incidents (external, internal incidents, and near-miss incidents)
Reviewing industry reports for insights into common risks and their fiscal consequences in similar scenarios
Consulting with experts in the field to gain a comprehensive perspective on risk probabilities and impacts
Using risk assessment tools or software for a more data-driven analysis

All risks

Unless a validator system is immutable and fully automated, there will be people involved in managing it. It is therefore important that appropriate management of people is part of managing the validator node. This impacts in various areas, from mitigating the risk of hacking by unknown parties with access to privileged roles, to the ability to provide timely incident response and minimize the damage caused by a security incident.

As well as the Controls for People Management some relevant controls are grouped with other areas, such as

Limit Physical Access
Minimize Authorization
Log Personnel Information

It is important to identify individuals who have access to and can control aspects of the operations of the Validator node. While a globalized workforce can provide multiple benefits, it is difficult to hold an anonymous individual accountable. This fact is repeatedly used by large-scale hacking operations to infiltrate valuable targets with a goal of eventually using access granted willingly to rob, damage, or destroy the target.

FIN1
KEC7
HCK1, HCK2, HCK3
SPS0
RER2

It is important that individuals whose actions influence the Validator node have appropriate skills, and as the ecosystem evolves training helps maintain a relevant skillset.

As well as themes specific to the individuals' tasks and Node Operator internal policies (such as this document), there are a number of areas where up-to-date skills matter, such as:

Security practices, including protection from social engineering attacks such as phishing
Relevant regulatory requirements, a broad topic possibly including privacy, anti-bribery, conflict of interest, and more

FIN1, FIN7
SLS17
DOW21
GIR16, GIR22, GIR25
RER2

In a nutshell: technology needs to serve the business goal, not the other way around.

To ensure this happens, it is important to consider both the business goals and the available technology, and then use appropriate technology to meet those goals.

Updates to software components provided by third-parties often address newly-discovered or longstanding vulnerabilities. It is a best practice to update software regularly, but it is important to check for vulnerabilities that can be introduced by an upgrade as part of a supply-chain attack, and to verify that any customization of open-source software, or specific configuration options, as well as other software used by the node operator, are all compatible with an update and do not create new vulnerabilities on updating.

"version-pinning"
actively managing dependencies
testing updates before automatically deploying them

Controls for Development and Update

FIN1
SLS1, SLS2, SLS3
DOW4, DOW12, DOW21
GIR4, GIR6, GIR16, GIR18, GIR21, GIR22, GIR25
KEC6, KEC9
SPS1
RER1, RER2, RER3, RER4, RER5

To avoid double signing, validators can maintain a history of messages they signed. This data is crucial, as inconsistencies can cause a double-signing event. The data needs to be reliably persistent, and properly connected to the systems that use it.

A common format for anti-slashing data is defined by Slashing Protection Interchange Format.

Tools to support anti-slashing databases

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

SLS1, SLS2, SLS3, SLS4, SLS17, SLS18, SLS19

Tools that manage signatures for transactions generally provide a workflow that includes passive and active protection against a variety of risks. Using these tools helps minimize the chances that a signature is given without checking what is being signed, and that risk-bearing transactions require appropriate authorization.

Properly configured signature management tools also provide the ability to recover, or mitigate any problems, in the case where a transaction was not completed.

As well as the use of various kind of "multi-sig", which can include simple requirements for multiple signatures, or incorporate such techniques as multi-part compute ("MPC") or the like, signature management tools can include automated verification steps in the process of authorizing a transaction.

Tools to support signature management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

FIN3, FIN4
SLS1, SLS2, SLS3, SLS4, SLS5, SLS14, SLS15
KEC1, KEC6, KEC9
HCK3
GIR7

A diverse set of clients for different protocols can reduce "blast radius" in a case where one client has a protocol error or other bug. This can be especially important if the bug causes a chain split. A common scenario is when an upgrade introduces a problem. The ability to migrate relevant keys to a different client, if a specific client error is observed, provides an important layer of protection. In addition, maintaining client diversity helps ensure that the network as a whole does so, ideally providing real protection against a vulnerability present in a single version of a single client by ensuring that particular version does not dominate the network.

Note that there are often a different range of clients available at different levels of the infrastructure. For example in Ethereum, it is possible to run different clients on each of the Execution and Consensus layers.

Running multiple Execution and Consensus clients. See also Ethereum Merge: Run the majority client at your own peril! [ETHdiverse]

SLS6, SLS7, SLS20
DOW2, DOW19, DOW21

Node operators need to withdraw validators correctly, as they can otherwise be put into a delinquent state. This can result in direct penalties, or an opportunity cost realized as monetary losses.

SPS1

Information management can mitigate many risks. One aspect is the management of highly confidential information, such as the management of signing keys or withdrawal keys, but it is also important to manage operational information.

Best practice for credential management is to use a Single Sign on system, that gives users authorized access to secrets through e.g. certificates, and/or vault mechanisms.

In this way, everything is audited, and anomaly detection can be activated for those vaults.

Using multi-sig wallets requiring authorization from multiple parties for specific actions, helps to ensure both that relevant access is monitored and that it is correctly controlled.

FIN1
SLS5
KEC1, KEC6, KEC9
HCK1, HCK2, HCK4, HCK6
GIR25

Many different components interplay while a staking operation is going on. If confidential information is not protected by encryption, it can be intercepted and read during transmission. There is also a risk of accidental or malicious leaking of stored information, which can be somewhat mitigated if that information is stored in encrypted form.

It is therefore crucial to ensure that confidential data is only stored and transmitted in an encrypted state.

KEC1, KEC6
HCK1, HCK2, HCK4, HCK6
GIR10, GIR17
RER1, RER2

Cold Storage, in particular "air-gapped" storage, can help protect information not used often such as withdrawal keys, private key generation materials, and the like, by making it more difficult for malicious entities to access the information and by reducing the chance that it will be leaked in the event of accidentally publishing data.

KEC1, KEC6, KEC9
GIR10, GIR17

Operating a node normally entails the use of a range of keys, such as

Keys used by signature management tools
A vault
SSH keys
API keys for cloud infrastructure

It is important to protect private keys from accidental or malicious misuse, and in particular unplanned deletion. It is not normal to provide broad access to unencrypted signing keys.

follow relevant standards such as CCSS v9.0 Table and BSSC Key Management Standard version 1
ensuring that there are no single individuals with the capability to access or delete them,
having backups with strong acess control,
actively managing access to keys and key material, and
"key rotation", i.e. periodic changes of keys as well as rapid managed changes if a data breach occurs.

Modern vault systems enable the enforcement of policies to ensure that access to keys is only available with verified roles, and deletion is managed according to established protocols.

Tools to support key management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

FIN1
SLS1, SLS3, SLS5
KEC1, KEC6, KEC7, KEC9
HCK1, HCK2, HCK3, HCK4, HCK6
GIR6, GIR7, GIR14, GIR16, GIR18

Node operators are likely to rely on a wide range of operational information, including internal procedures, understanding software configurations, plans for future development, and employee management.

Best practice includes ensuring there is no single point of failure due to centralized information being held by a single external provider or only being known to a single employee.

Documentation, even if rarely actively read by those responsible for operations (who presumably know their job), is important for many reasons including

to enable onboarding new employees and service partners, or helping employees take on new roles
to ensure smooth continued operation in the case that a key employee's role changes, particularly where they leave the organization
to enable accurate reporting as necessary
to enable monitoring of operations and investigation of security incidents and other failures

FIN2, FIN3, FIN4, FIN6, FIN7
SLS3, SLS4, SLS5, SLS14
DOW1, DOW2, DOW3, DOW4, DOW12, DOW13, DOW21
KEC1, KEC6, KEC7, KEC9
HCK1, HCK2, HCK3, HCK4
GIR4, GIR15, GIR16, GIR18, GIR19, GIR20, GIR21, GIR22, GIR25
SPS0, SPS1
RER1, RER3, RER5

Loss of important information, especially loss of control over keys, can have a crippling impact. It is important to have mechanisms to protect against, and recover from, unintentional or malicious deletion of important data.

Best practice includes having journaled backups of important information.

FIN6
SLS4
DOW1, DOW2, DOW3, DOW4, DOW5, DOW6, DOW12, DOW13, DOW14, DOW20
KEC6, KEC9
HCK1, HCK2, HCK3, HCK4
GIR4, GIR13
RER1, RER3

Access Control covers physical access to devices and facilities, the ability to connect to servers through networks, and the ability to perform specific tasks, such as getting answers to requests.

The core principle to follow in granting authorization is Least Privilege. This is generally achieved by using some form of Role-Based Access Control, in combination with an inventory of assets and services, to ensure that only those who need access are granted that access, and that it is revoked as soon as appropriate.

Tracking this information is important to ensure that access can be audited and verified.

Tools to support access control

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

FIN1
DOW7, DOW16
GIR1, GIR7, GIR9, GIR16, GIR22
HCK1, HCK2, HCK3, HCK4
KEC2, KEC4
SPS0

The core of Least Privilege is that access is only granted to those who need it, and only for as long as it is relevant. This means that an individual user's privileges are likely to change over time, and in particular any offboarding process includes a rapid revocation of user's assigned roles.

Almost all Least Privilege implementation is managed through Role-based Access Control (commonly known as "RBAC"), where a set of roles are defined according to the tasks they need to perform, and access rights are based on holding a particular role, with individual users assigned relevant roles that are revoked or deliberately renewed on a timely basis. It is important to ensure that individuals can fulfil their designated tasks, without having authorizations they do not need.

A Single Sign on mechanism that allows rapid assigning and revoking of roles
Authentication tokens that have a limited lifetime
Regular review of roles and permissions for both users and software
Disable privilege escalation mechanisms (e.g. executing as root user in Docker, docker exec -uroot, or impersonation in Keycloak)
Use of roles on the API endpoint level to determine the correct authorization.

Tools to support least privilege control

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

FIN1
GIR7, GIR9, GIR16, GIR22, GIR25
KEC7
HCK1, HCK2, HCK3, HCK4, HCK6
SPS0

Ensuring that employees whose roles have changed do not have lingering credentials reduces the risk they or others can misuse those credentials to cause harm.

ensure authorization changes are automated as part of management of employee lifecycles, covering role changes as well as termination, transfer, and promotion procedures

FIN1
DOW4
HCK1, HCK2, HCK3, HCK4
GIR4, GIR7, GIR13, GIR25
SPS0

Following the principles of defense in depth and least privilege, it is important that nodes are not directly accessible without permission, and that they do not leak information to the Web that can help malicious parties gain unauthorized access.

An internal virtual private network with only have well-defined endpoints accessible from the web
A load-balancer that has a firewall
Disable meta-data serving through public endpoints (e.g. port scans, or what server is running in what version)
Limits on outbound traffic of a node that runs a certain service
Rate limits to ensure that internal services cannot unintentionally DDos each other
Require explicit authorization of external access capability

HCK4
GIR9, GIR17

Best practice is to use password and related authentication policies to ensure that access control mechanisms are sufficiently strong at every layer of the infrastructure. This can include appropriate requirements for the strength of passwords and the use of Multi-Factor Authentication as well as Multi-Sig requirements.

FIN1
DOW4
GIR6, GIR7, GIR9, GIR13, GIR17, GIR22, GIR25
KEC1, KEC7
HCK1, HCK2, HCK4, HCK5, HCK6
SPS0

Physical devices are subject to physical changes, including environmental issues such as temperature extremes that can cause damage, and utility failures such as power or internet failure.

This covers all physical devices that can access the Node, as well as all areas in which such devices are kept, whether "on-premises", distributed, hosted by a third party, or remote mobile devices such as laptops.

Best practice for managing physical access includes ensuring that authorization is only granted as necessary, following the principles of Least Privilege. Generally this means some devices are physically segregated in areas where access is restricted according to function. Note that this covers the use of devices authorized to access the networks that nodes operate on, and is particularly important for devices authorized to access management and analytical functions of nodes.

Ideally all physical access to premises and facilities is monitored, to deter and determine whether the facility is subject to piggybacking. This term refers to the situation where an unauthorized entrant is allowed in by someone who has a valid authorization for themselves. In the context of remote operators' access through a computer, controlling this is particularly challenging in practice.

Piggybacking can occur inadvertently through politely holding a door for someone without checking that they have current valid authorization to enter, negligently by allowing someone to enter for a legitimate purpose despite knowing that person does not have valid authorization, or maliciously allowing someone to enter knowing that their purpose is nefarious.

In the inadvertent case, relevant mitigations include

ensuring that all those with authorization understand the necessity to enforce physical access control,
providing simple and effective ways to check authorization,
ensuring that remote access devices as far as possible are dedicated to the defined purposes (rather than allowing the use of general-purpose laptops that could be attacked when being used for a different task such as general email, or playing games).

To minimize negligently allowed access, it is important to ensure that access systems are effectively maintained and managed to ensure there is no good reason to allow an unauthorized person access. This can range from the design of onboarding systems to the effectiveness of internal management feedback systems for discovering unanticipated problems faced by operators.

Best practice includes managing physical access with systems that can efficiently enable access to authorized parties (keycards, biometric scanners), and monitor actual access such as visual verification that the authorized party is the one entering.

It is important to log and audit access sufficiently frequently to detect problems - see also Monitoring.

DOW3, DOW4, DOW7, DOW9, DOW10
GIR9, GIR17
HCK1, HCK2, HCK3, HCK4, HCK5, HCK6

A single validator represents a single point of failure, that can introduce slashing or downtime risks.

[DVT] (Distributed Validator Technology) provides an approach to mitigating this problem, by distributing the keys and the hardware that runs validation, in such a way that multiple clients physically located in different places share the task of validation. Thus if a single client or small number of them fail, the overall validation is unaffected. (Note that while the Ethereum Foundation provides a specific technical specification for DVT that has been implemented the principes can be implemented in different ways.)

Likewise, maintaining multiple validators running on separate hardware and software can increase resilience to a failure in any one platform.

SLS1
DOW1, DOW5, DOW6, DOW7, DOW9
KEC6
HCK6

To ensure that a local utility failure does not impact a validator, it is useful to have redundant systems, such as a backup power supply e.g. through local batteries or power generation, and for connectivity e.g. physical connection such as fibre-optic cable, and one or more modes of wireless connection.

The level of mitigation that is appropriate depends on the level of risk, and the costs of both failure and mitigating failure. These calculations mean economies of scale often enable larger-scale operations to be more robust than smaller ones, for a given price.

DOW6, DOW7, DOW9, DOW15

It is also important to ensure that facilities have appropriate protection from relevant environmental risks such as fire, flooding, extreme wind, as well as earthquakes and destructive physical attacks. Appropriate mitigations will depend in part on the specific location and nature of the facility, but will generally revolve around siting of facilities, their architecture, and specific measures to ensure resilience.

SLS14, SLS15
DOW1, DOW5, DOW6, DOW7, DOW9, DOW15

Monitoring can also identify specific conditions that adversely affect equipment and suggest that a lifecycle plan needs adjustment - whether writing off equipment destroyed by fire, or increasing preventive maintenance for physical access systems that are being used far in excess of expectations that drove the existing maintenance plan.

The lifecycle of equipment, most particularly node servers and computers used to access and manage them, is a determinant of overall security.

a capability to remotely pause, shut down, and wipe devices clean

DOW3, DOW6, DOW7, DOW9, DOW15, DOW20, DOW21
KEC1, KEC6
HCK4, HCK6

A secure development lifecycle helps ensure that vulnerabilities are not introduced to codebases, and subsequently deployed.

auditable version control systems
thorough testing and authorization before changes are accepted

FIN3, FIN4
SLS1, SLS2, SLS3, SLS4, SLS5, SLS6, SLS7, SLS14, SLS15, SLS17, SLS19
DOW2, DOW6, DOW10, DOW11, DOW12, DOW13
KEC7
HCK1, HCK2, HCK3, HCK4, HCK5, HCK6
GIR6, GIR7, GIR9, GIR10, GIR13, GIR15, GIR17, GIR21
SPS1

A comprehensive test suite helps ensure changes do not introduce new vulnerabilities or situations that lead to operational failures. Equally, it is important that someone other than the developer who produces Code changes reviews them.

Static and Dynamic analysis is important, as well as user testing wherever changes impact user interface or user-generated content.

Measuring test coverage, and requiring new tests that are reviewed as part of and code review, help ensure that coverage is sufficiently comprehensive to detect errors that can arise through later changes.

incorporating static and dynamic testing in the integration pipeline for code development.

FIN3, FIN4
SLS1, SLS2, SLS3, SLS4, SLS5, SLS6, SLS7, SLS14, SLS15, SLS17, SLS18, SLS19
DOW2, DOW6, DOW10, DOW11, DOW12, DOW13, DOW14, DOW19, DOW20
KEC7
HCK1, HCK2, HCK3, HCK4, HCK5, HCK6
GIR6, GIR7, GIR9, GIR10, GIR11, GIR13, GIR15, GIR17, GIR18, GIR21, GIR23, GIR24
SPS1

Unchecked inputs are a major vector for a range of attacks. These include

brute force authorization, or denial of service (including DDoS) attacks, often identifiable by a high rate of failing requests using inputs with minimal variation
overflow attacks, where excessive input causes a problem, generally mitigated by programming practices or overflow-safe languages
targeted efforts to inject code that executes functionality that should not be authorized, or causes an adverse system reaction including a crash

Ideally, the load balancer in front of the node filters out all traffic with payloads that cause overflow. Additionally, it is important to validate inputs against the relevant parameters, particularly where these allow a range of functionalities to be triggered.

using a data schema such as JSON schema with schema evolution techniques,
defining minimum and maximum input sizes and MIME types.

Tools to support input and output validation

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

ajv
Apache Ranger
In the Apache web-server, control request sizes of different pieces of the request:
- LimitRequestBody
- LimitRequestFields
ORM systems exist for almost all programming languages and frameworks, such as
validatorjs

DOW10
KEC6, KEC7, KEC9
HCK1, HCK2, HCK3, HCK4, HCK5

Updating software is a major risk vector. Good processes for software development and managing the deployment of updates are important to mitigate some of this risk. As well as having control over the update process, it is important to have the capacity to revert to a known environment in an emergency where an update has been found to introduce unexpected problems.

Validator software, and other software validators use, is very often open source. However, customizing software can introduce errors. In addition customizations can produce incompatibilities when software is updated.

This means that any customization introduces a need for continued extra testing, in particular whenever relevant software is updated. Customization also increases the risk that test coverage is inadequate, meaning a future error will not be found in pre-deployment testing and only discovered through a failure operating in production, with attendant risks of reputational damage, direct losses, and increased cost for incident management.

SLS5, SLS7
DOW2, DOW13, DOW19, DOW20, DOW21
HCK2, HCK3
GIR3, GIR16
RER1, RER4

It is important to manage the configuration of hardware, and software. A minimal profile helps reduce possible attack surface, while minimizing, and carefully tracking, customization is important to ensure smooth and safe upgrades.

Software configuration to follow includes, among others:

Firewall configurations
Docker image setups
Container orchestration configurations
Database configurations
Webserver/Load balancer configurations

Tools to support configuration management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

CIS benchmarks
CoGuard
Using GIT to manage configurations
Liquibase

SLS1, SLS3, SLS4, SLS5, SLS6
DOW12, DOW13, DOW21
HCK2, HCK3, HCK6
GIR3, GIR4

Protection against malware needs to be implemented on all assets and users need to exercise proper caution.

Regularly check the latest CVE entries., to cover all software tools used.
Specifically check for any announcements of vulnerabilities before upgrading any software component

Tools to support supply-chain protection

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Trivy

All Slashing Risks
DOW2, DOW11, DOW12, DOW14
KEC6, KEC7, KEC9
HCK2, HCK4, HCK6
GIR15, GIR17
SPS0

Use separate tests and staging environments

This minimizes a potential blast radius. It is important to run any change (even an update of a validator software or Web3Signer) through a test environment first to maximize the likelihood that any errors can be discovered before they impact a production environment.

Tools to support deployment testing

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

The "Blue-Green Deployment pattern" Blue Green Deployment [WikipediaBG]

SLS6, SLS7, SLS14
DOW2, DOW11, DOW12, DOW13, DOW14, DOW20, DOW21
GIR11, GIR13, GIR18, GIR20, GIR21

Containerized and orchestrated environments are designed to reinforce security by automating many good practices, with mechanisms that have been widely tested in diverse environments. As tools that can be used well or badly, their best practice recommendations are important to ensure the full benefits are realized.

SLS1, SLS2, SLS3, SLS4, SLS5, SLS6, SLS7
DOW21
HCK2
GIR13 GIR23

Human error is always a risk. An automated script, whether or not invoked by a human, can help minimize inadvertent errors.

Another benefit of properly set up automation is that it can help reduce the risk of exposing secrets.

Tools to support process automation

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

FIN3, FIN4, FIN5, FIN6
SLS1, SLS2, SLS3, SLS4, SLS17, SLS18, SLS19
DOW4, DOW19, DOW20
KEC6, KEC9
GIR16 GIR18, GIR19, GIR20, GIR21, GIR25

Monitoring is an important tool to identify risks and gain relevant data, and some requirement for it is a very common feature of compliance and security frameworks.

Monitoring takes many forms. It can be done internally, and provided as a service. The latter is especially common for monitoring the health of widely available third-party infrastructure such as blockchains, and cloud services.

Monitoring can take place throughout the ecosystem. Low-level indicators such as whether network traffic is within expected or design parameters, whether databases are being updated at expected rates, or whether server facilities are maintaining an appropriate temperature are all examples of monitoring with fairly obvious value, and where immediate remediations or further investigation is straightforward.

Monitoring access to physical infrastructure is more complex, and the resulting information about people is subject to privacy requirements, but can be a useful diagnostic tool if something goes very wrong, or if you just want to know who keeps blocking the server-room door open on warm days.

As well as monitoring in real time, logging information allows analysis to discover information that is only observable though variations (or non-variations) in specific monitored information over time.

Given the importance of logged information, and of privacy requirements, best practice is to have a clearly documented policy for record retention. This needs to retain enough information to enable historical analysis and comparison. Some data are best only retained in anonymized form, or stored with extra security provisions applied.

A good monitoring system provides very broad coverage, with redundancy both as an aspect that can be monitored to detect anomalies and to eliminate the risk of a single point of failure - when monitoring is compromised it can indicate a simple failure of the monitoring system, but can also mask a broader issue that the system is expected to detect.

With a good monitoring system in place providing broad coverage of operations, there needs to be useful and targeted alerting system based on the monitoring system.

To learn that a potential problem has been identified, as soon as possible, and act on it effectively, a monitoring system needs a robust targeted alerting system. A system that overloads its watchers with alerts is likely to lead to alert fatigue, where the alerts are ignored in practice because too often they require an onerous human response when they are not identifying a real problem. Like monitoring systems in general, redundancy in alert systems is important.

Knowing an incident has occurred can trigger an Incident Response Plan, but if it relies on individuals, it is important to provide 24/7 response. Many attacks are deliberately targeted for times when responders are less likely to have high availability.

Alert systems can in turn drive automated emergency responses, ranging from capture of increased levels of detail, through requesting additional authorization beyond the normal requirements, to full system shutdowns.

Here again, there are important trade-offs between ensuring a highly responsive system, and one that is robust in the face of real-world variability. For example, a system that can automatically suspend multi-sig transactions unless they are authorized within a short time is not always appropriate, because it can interfere with normal operations over a high-latency network or where a number of individuals are expected to coordinate extensively, taking a significant amount of time, before authorizing a particular action.

Among many aspects of Validator Operations to monitor directly are the following:

are Slashing Events occurring on the beacon chain? To whom? How is this impacting the network?
is the Anti-Slashing Database functioning correctly?
how well are Relay Lists balancing load and availability to avoid downtime conditions?
are Chain Reorganizations occurring? Are there patterns of causes?
is the Consensus layer reaching Finality in accordance with expectations?
is MEV affecting performance or returns?
are Block Proposals, Block Height, Attestations proceeding in line with history and expecatations?
Are there anomalies in Sync Committees?

FIN5
All Slashing Risks
DOW1, DOW7, DOW10, DOW11, DOW12, DOW13, DOW15, DOW19
GIR4
HCK1, HCK2, HCK3, HCK4

do key operational metrics like CPU usage, memory usage, restarts, and uptime of nodes indicate Healthy Node conditions?
is Peering Connectivity normal?
are Failover Systems functional, ready to operate, and not operating unexpectedly?
are Cloud Systems functioning according to agreements?
do Cloud Service Notifications help effectively anticipate and manage expected downtime and maintenance?
are App-specific metrics within expected parameters?
are Redundant Monitors producing consistent results?

FIN3, FIN4
All Slashing Risks
DOW1, DOW2, DOW3, DOW4, DOW5, DOW6, DOW7, DOW9, DOW10, DOW11, DOW12, DOW13, DOW14, DOW15
HCK1, HCK2, HCK3, HCK4, HCK5
GIR7

Monitoring for unusual patterns or spikes can help detect a security breach or an exploit in progress. In many cases, even if security is breached, secure and accurate logs are important to determine how this took place, in order to protect against recurrence. The following are among indicators of a security issue, and information that can help determine what happened.

Key Usage, Authorised Access, and Access Control Changes anomalies, especially in sensitive systems such as 2FA configuration, security platforms, or network monitoring solutions VPNs.
Phishing and similar attempts to attack authorized users through social engineering.
Attacks on Firewalls and Endpoint Attacks, both for employee devices and infrastructure nodes, or directed at Bastion Nodes. These can be indicative of an attack or exploit in preparation or underway
Relay behavior such as compliance aspects and availability metrics.
Ideally, Bug Reports and Community Discussion will not be the first source of notification about a problem, but it is important to monitor them.
Various services can monitor whether Confidential Data are available publicly, demonstrating there has been a data breach.

FIN1, FIN2, FIN5, FIN6, FIN7
DOW4, DOW10, DOW12, DOW19
HCK1, HCK2, HCK3, HCK4, HCK6
GIR6, GIR9, GIR14, GIR15, GIR17, GIR22

does the Upgrade Process including client code source, configuration and testnet and production deployment, work as desired, consume unexpected time, or generate errors and issues?
how does Customized Code in Testnet behave compared to the code deployed in production? This is especially relevant for network updates.
is System Configuration stable?

SLS6, SLS7
DOW2, DOW11, DOW12, DOW13, DOW19
HCK3, HCK4
GIR11, GIR14, GIR18, GIR19, GIR20, GIR21

If two validators with the same identifiers are running at the same time is important to shut one down as fast as possible. Most validators provide built-in mechanisms to detect doppelgangers. Other tools and technicques can also detect and act on this.

Tools to support Doppelgänger protection

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Lighthouse
Prysm
Teku
Nimbus
Doppelganger protection in ssv.network
DoppelBuster
StatefulSet handling in Kubernetes

SLS1, SLS2, SLS5
DOW2, DOW10
SPS0

Tools to support Monitoring

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Within AWS, Cognito's Userpool Addons for auditing authentications and the WAF module to filter anomalies are just examples of the range of tools available
ELK stack
ESD monitors slashing events on the Ethereum chain
Ethereum validator monitoring
Grafana- an example of an alerting setup in Grafana
MEV monitoring tool from SimplyStaking
Prometheus
Wazuh

Communication is important both during normal operations, and when an exceptional security incident occurs that could adversely affect the normal operations, or the users of a system.

There are therefore two core parts to a Nore Operator's communication strategy:

Normal Operational Communication provides information about ongoing operations, to ensure confidence in and transparency of everyday operations.
Incident Communication is the collection of communications processes that occur as part of an Incident Response Plan

Developing appropriate communication procedures relies on understanding both the communications channels an organization has or can have, and its stakeholders. The goal is to ensure those stakeholders have timely access to relevant information in a useful format.

Some key stakeholders are Anonymous Stakeholders, who might follow a Node Operator's public information channels, or operate independently, but who do not provide individual communication information to Operators.

Low stake investors
Potential investors
Communities developing technical standards
Education Providers
Corporate Regulators

Regulators of various kinds can require that Node Operators provide them with specific information, but do not necessarily communicate with Node Operators on an individual basis

Node operators will also have Known Stakeholders, who have an identity known to the Node Operator that includes at least one direct communications channel such as messaging, email, or telephone. These typically include at least some of

High stake investors - with some of whom the Operator could also have contractual obligations
Service Partners, who might be involved in operating and managing protocols and requiring governance votes, or hosting, managing or operating infrastructure as part of the node operation setup
Media channels, platforms, and accounts covering technical and non-technical news and reports
Other Node Operators running validators on the same network
Staff such as those developing and maintaining critical node operations software
Individuals or organizations using additional service provided by Node Operators (e.g., API users, customers for white-label solutions etc.)

Stakeholders' preferences for communication channels differ. While many Known Stakeholders will have explicitly requested direct communication, it is important to have additional channels that enable Anonymous Stakeholders to follow important developments.

Broadly, communication channels can be considered two-way, enabling communication with an individual Known Stakeholder or with all of them at once, or broadcast, enabling Anonymous Stakeholders to receive important information, often while preserving their anonymity.

Additionally, some mechanisms allow for persistent information, while others are only temporary; A website can be maintained long-term or the information can be removed, information sent by email can easily be retained by the recipient in perpetuity, while information in e.g. a Slack or Telegram channel could be deleted after a matter of days or weeks

It is also important, especially for services used for two-way communication with Known Stakeholders, to consider the security and privacy of the channels used. While channels such as Telegram or Whatsapp use encryption, in the case of the former all communication is decoded at some unknown centralized point, in the latter large amounts of metadata are available to the service provider.

While many messaging services can behave in either manner, some such as websites are well-suited to broadcast communication and others are more suited to individual two-way communication.

As well as identifying the most appropriate channels for communication with Known Stakeholders or classes of Anonymous Stakeholders, it is important to understand what it is appropriate to communicate, and to whom. Some stakeholders will expect a "close management", with direct individualized two-way communication, and very rapid reporting on incidents and important information. Others will want to know that they are informed in case of security incidents, or important regulatory changes, but prefer a lower volume of information. It is likely that different circumstances will mean that a given Stakeholder moves between "categories", with different communications strategies or procedures being more appropriate depending on specific context.

Track and categorise Known Stakeholders
Assess communication tools relevant to Anonymous Stakeholders

Tools to support stakeholder management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Broadcast communication tools include Websites, X (the former Twitter), BlueSky, Facebook/Instagram
A Stakeholder Map
A Stakeholder Register Spreadsheet
CRM systems
Email
Messaging services such as Telegram, Discord, Slack, Signal, and Whatsapp

A number of jurisdictions (such as the EU, with the [GDPR]) regulate the use of information about individuals, and it is important to understand and comply with such regulations to avoid reputational, legal and financial risks.

FIN1, FIN6, FIN7
SPS0
RER1, RER3

An Incident Response Plan documents procedures for managing security incidents and events, as guidance for employees or incident responders who believe they have discovered, or are responding to, a security incident. A well-documented Incident Response Plan helps employees in a high-stress situation by providing a reminder of all important actions and considerations. To be useful, it is necessary that relevant employees know the plans exist, and how to find them.

Identify relevant participants in advance, with well-defined decision-making responsibilities
Redundancy against specific failures such as a key employee being unavailable
Clear information about how to investigate and triage incidents, including when to notify and involve particular participants and how to escalate issues to the most appropriate person or team.
Define clear procedures to follow for specific sets of circumstances. Where it is possible and appropriate, automated responses and alerting triggered by Monitoring can help ensure rapid response.
Data collection and distribution to enable effective response, external communication, and Post Mortem analysis
Identify relevant Stakeholders and define communication strategies for both internal and external communications

Tools to support incident response planning

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

FIN1, FIN6, FIN7
SLS17, SLS18, SLS19
DOW1, DOW2, DOW3, DOW4, DOW5, DOW6, DOW7, DOW9, DOW10, DOW21
HCK3
GIR13
RER1, RER3

There are several ways to identify that a security incident is taking place. Best practice is to have extensive monitoring in place, to identify anomalies early, with alerting and potentially direct reaction mechanisms. Although learning from third-party discussions is a terrible way to find out about an incident, it is still better than simply not discovering it, so monitoring channels where such discussions take place is a valuable part of an overall strategy.

FIN6
SLS17, SLS18, SLS19
All Downtime Risks
KEC6, KEC7, KEC9
HCK1, HCK2, HCK3, HCK4, HCK5, HCK6
GIR13, GIR14, GIR15, GIR22
SPS0
RER1, RER3, RER5

This is often referred to as a "Post Mortem", used to learn from the event and improve relevant Incident Response Plans.

Determine the root cause or causes of an incident
Examine how the incident was allowed to occur
Consider what changes can be implemented to prevent or mitigate similar events from occurring.

FIN7
SLS5, SLS14, SLS17, SLS18, SLS19
DOW4, DOW6, DOW10
KEC6, KEC7, KEC9
HCK1, HCK2, HCK3, HCK4, HCK5, HCK6
GIR6, GIR7, GIR13, GIR14, GIR15

A Disaster Recovery Plan is a specialized Incident Response Plan that gives guidance on recovering one or more information systems at an alternate facility, in response to a major hardware or software failure including the partial or complete destruction of facilities.

Maintain secured up-to-date copies of production environments to enable fast restoration.

Tools to support disaster recovery plans

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

NIST Disaster Response template

DOW1, DOW2, DOW3, DOW4, DOW5, DOW6, DOW7, DOW9
KEC6, KEC7, KEC9
HCK6
GIR15, GIR16, GIR19, GIR21, GIR22, GIR25

These are also known as "Pre-Mortems".

Regular simulations of implementing an Incident Response Plan ensure that relevant personnel are familiar with them and can efficiently follow them when necessary. "Pre-Mortems" simulating or "war-gaming" a specific failure also tests those procedures to give some idea of whether they are appropriate and adequate. It also often motivates participants to think about other risks, and whether appropriate procedures and mitigations are in place.

There are many possible approaches to an incident simulation, and many eventualities that they can cover. Example topics for Pre-Mortems include variations on themes such as

Unauthorized users gain access to the servers and set about making mischief
A complex security compromise where details are not immediately available
A specific scenario (environmental disaster, utility failure, operational error) results in system downtime

Articles such as How to Use Pre-mortems to Prevent Problems, Blunders, and Disasters offer further information on how to plan and implement simulations, and how to derive the maximum benefit from them.

National Institute of Standards & Technology Template

All risks

As well as direct financial losses, security incidents can also result in substantial reputational damage. Appropriate Incident Communication with stakeholders about security incidents, both during and after the relevant incident, can significantly mitigate this risk.

It is important to note that inappropriate communication during an incident can increase the damage. External communication has to balance stakeholders' need for information that enables them to respond in a well-informed manner against the importance of providing clear information with as much certainty as feasible that it will not later be contradicted.

Providing information as soon as possible
Providing a detailed post-incident summary.

FIN6
HCK4
SPS0
RER1, RER2, RER3

This section contains controls that are material to Node Operator risks. Some of these control criteria correspond to similar controls from three common frameworks:

Where relevant, corresponding controls from those frameworks are identified and linked from ValOS controls.

🔗 Node Operators MUST document how their processes and tools serve their business goals

[SOC2] CC 5.2

🔗 Node Operators MUST review their dependencies on staff and external suppliers and how to replace key staff or suppliers every year

External suppliers can change terms, shut down products or support, and key staff can leave or be indisposed for long enough to impact business functions.

🔗 Node Operators MUST document their assessments of risks, and what risks they class as acceptable

[SOC2] CC 3.1

🔗 Node Operators MUST ensure that processes for risk mitigation are followed in practice

Best practice is to ensure that where possible, processes are automated

FIN1, FIN2, FIN3, FIN4, FIN5, FIN6

🔗 Node Operators MUST document payment processes including currency and exchange details

🔗 Node Operators MUST review relevant regulation and update processes for compliance as necessary at least quarterly

FIN6
RER2

🔗 Node Operators MUST know the identity of entities who are authorized to manage operations

Best practice is to identify every individual who works for the Node Operator. In the case of corporate third-party providers, sensible due diligence does not always extend to identifying specific individuals.

[ISO27001] Annex A 5.16

🔗 Node Operators MUST implement documented procedures for evaluating and reviewing counterparty risks from vendors and partners

Establishing a process for Vendor and Business Partner engagement and assessing existing as well as new vendors and business partners
Ensuring that any identified issues are fixed, and regressions can be identified.
Terminating relationships efficiently where problems arise, or the relationship ends.

[SOC2] CC 9.2

🔗 Node Operators MUST ensure entities who are authorized to manage operations have and maintain the necessary knowledge to minimize risks to the Node Operator in the course of performing their work

🔗 Node Operators MUST keep third-party software up to date

This control does not imply that the latest available update is automatically applied, rather that Node Operators have clear and effective mechanisms to ensure they are aware of updates and apply them in accordance with their update management procedures, taking into account the controls in Controls for Development and Update Process.

Best practice is to monitor software in use, to know when an update is available, and to update as fast as possible while following procedures to manage those updates securely. In some cases, assessing an update will lead to a decision that there is no need to apply a specific update, or a risk in doing so that outweighs the benefits.

DOW19
HCK4

🔗 Node Operators MUST have a persistent local anti-slashing database

SLS1, SLS2, SLS3, SLS4, SLS14, SLS15, SLS18, SLS19

🔗 Node Operators MUST document signature requirements for high-value transactions, including the definitions used to identify such transactions

🔗 Node Operators SHOULD use signature management tools to help secure high-value transactions

🔗 The primary and backup/failover versions of Signature management tools MUST implement mechanisms to ensure data continuity

🔗 Node Operators MUST deploy at least 2 distinct client applications for any level of the blockchain where at least 3 clients are available

🔗 Devices that control critical functions MUST be dedicated to that purpose, and configured with only the necessary software for their intended purpose

This applies to servers acting as validators, but also to devices authorized to access and administer those servers remotely.

DOW2
HCK1, HCK2, HCK3, HCK4

🔗 Node Operators MUST implement processes to withdraw validators from a network in such a way that they are not penalised for disappearing

SLS2
SPS1

🔗 Node Operators MUST document configuration of software and hardware

[SOC2] CC 7.1
[ISO27001] Annex A 8.9

🔗 Node Operators MUST implement appropriate key management procedures

Best practice includes following a commonly recognized key management standard such as

[CCSS]: a set of requirements for securing Cryptocurrency systems, focusing on Key Management. Certification for systems is available at three levels, and is granted by certified CCSS Auditors.
[KMS]: a set of requirements for Key Management designed for organizations working in blockchain, allowing self-attestation of conformance.

All Key Custody risks
HCK1, HCK2, HCK3, HCK4

🔗 Node Operators MUST document and follow information lifecycle processes for important operational information

This includes the definition and enforcement of retention periods, and the use of thorough deletion mechanisms, such as shred.

[ISO27001] Annex A 8.10

SLS10
DOW17

🔗 Node Operators MUST implement backup procedures, at minimum daily, for important operational data

🔗 Backup Procedures SHOULD produce journaled backups covering relevant retention periods

🔗 Node Operators MUST implement protection against accidental or malicious deletion of data

These requirements cover all information required by controls in this specification.

🔗 Node Operators MUST record and maintain important operational information

Best practice is to use a documentation management system. While this is likely to have different levels of access control, it is important that no information is available to only one employee.

🔗 Node Operators MUST have a policy for data retention

This needs to provide adequate retention to enable historical analysis and checking for anomalous patterns, while minimizing stored data and ensuring compliance with relevant data protection regulation.

[OWASP_ACCESS_CONTROL]
[ISO27001] Annex A 5.15
[SOC2] CC 6.1

🔗 All services MUST require appropriate authentication privileges

For example, a Node does not respond to anonymous requests from an unknown user.

🔗 Networks MUST be segmented, to restrict access to systems that are identified as needing it

🔗 Nodes MUST NOT respond to requests from outside a defined network, except those that are explicitly defined as necessary

Fulfilling this requirement means maintaining a whitelist of individual services that are authorized to respond to requests from broader networks.

[ISO27001] Annex A 8.22

🔗 Entry to physical server locations MUST require authorization

For example, a biometric scan or the use of a keycard.

DOW3, DOW4
HCK6

🔗 Software MUST NOT run with, and a user MUST NOT have a higher level of privilege than necessary

For example, check that software does not run as root, that users do not log in directly with root privileges, and software and users are granted fine-grained access based on need rather than broad-based access for simplicity.

[SOC2] CC 6.3
[ISO27001] Annex A 8.2
[ISO27001] Annex A 8.18

🔗 A review of Access Rights MUST take place regularly

This covers both the processes and tools for granting and revoking access rights, and verifying that they are effectively managing access rights according to the relevant principles (Least Privilege, Role-based Access Control. Best practice for this review includes:

analyzing access logs for physical access to hardware, and ensuring authorized individuals are not given access to hardware
verifying access to signing keys is limited to individuals whose roles mean they need it, and that all who need that access have it
ensuring that processes are effectively followed and meet the Node Operator's business needs
verify that software is run in a way that minimizes its access

[ISO27001] Annex A 5.17
[ISO27001] Annex A 5.18
[ISO27001] Annex A 8.18

🔗 All data in transit MUST be encrypted, 🔗 and SHOULD use the most direct transmission available

🔗 All data "at rest" MUST be stored in encrypted form

This covers all services that communicate data, such as Databases, Web servers, Load balancers, Authentication systems, CI/CD pipeline tools, etc.

Best practices include ensuring that the latest version of TLS is being used, with secure algorithms.

Current best practice includes assessing the cost and risk associated with moving to quantum-safe cryptography, and appropriate timelines.

[CRYPTOFAIL]
[SOC2] CC 6.7

🔗 Node Operators MUST log network traffic, and analyze the logs for anomalous behavior

SLS9, SLS10, SLS11, SLS12, SLS13, SLS14, SLS15
DOW1

🔗 Any operation that requires privileged access MUST be logged

🔗 Any assignment of a key, or assignment of a role to or removal of a role from a particular key, MUST be logged

This includes monitoring software that has privileged access.

[ISO27001] Annex A 8.18

FIN1

🔗 Every change in the status of people who have access to any function of the Node, or physical access to any hardware, MUST be logged

FIN1
HCK3

🔗 Any event that results in slashing MUST be logged

SLS4, SLS17, SLS18, SLS19, SLS20
RER1, RER3

🔗 Logs MUST provide a sufficiently detailed view of hardware and network performance to enable upgrade needs to be forecast, and to alert if validators are operating with excess latency

Tools such as Zabbix can also display a live feed of CPU and memory usage of each compute instance.

[SOC2] A 1.1
[SOC2] CC 7.2
[ISO27001] Annex A 8.16
[ISO27001] Annex A 8.21

DOW3, DOW7, DOW10, DOW15
GIR4

🔗 Node Operators SHOULD have processes in place to manage environmental threats

This includes monitoring for such threats and physically hardened facilities (e.g. fire- and flood-resistant server rooms), and physically decentralized infrastructure. It can also incorporate the use of DVT or related approaches to managing physical decentralization.

[ISO27001] Annex A 7

SLS14, SLS15
DOW1, DOW5, DOW7, DOW9

🔗 Node Operators SHOULD implement failover validators in different physical locations

DOW1, DOW2, DOW3, DOW4, DOW5, DOW7, DOW9

🔗 Node Operators SHOULD have processes in place to manage equipment lifecycles

This includes monitoring performance and performing preventive maintenance, upgrades, or replacing equipment as appropriate, as well as processes that ensure equipment is correctly retired including removing data and any hardware-based authorization.

[ISO27001] Annex A 7

🔗 Code development MUST follow secure development processes to avoid introducing security risks

This is a broad area. A few specific controls are included in this specification, but this requirement is intended to ensure a general production philosophy.

[ISO27001] Annex A 8.25

🔗 Node Operators MUST document procedures for updates to code

[SOC2] CC 8.1
[ISO27001] Annex A 8.32

🔗 Source code MUST be managed in a repository

🔗 All changes to deployed production code MUST be tested and reviewed before deployment

This covers all changes to code, including when it is necessary to roll back an upgrade.

🔗 Updates to third-party software MUST be checked for vulnerabilities before deployment

This covers verifying that all software updates, including validator and other node clients as well as specifically written custom code or updates, have been audited to ensure they are not introducing known or new vulnerabilities.

Best practice is to perform both internal and independent external audit, and to ensure the identity of the coders is known. Likewise, in best practice third-party code developers are only given access to code they need to do their work, are held to high standards of confidentiality, and work with a well-defined set of expectations.

[ISO27001] Annex A 8.7
[ISO27001] Annex A 8.30
[ISO27001] Annex A 8.32
[SOC2] CC 8.1

🔗 Software update procedures MUST include an assessment and application of configuration settings

🔗 Code MUST verify that input is safe before operating on it

🔗 Code MUST NOT produce invalid outputs

🔗 Components SHOULD use Cross-Origin Resource Sharing and Content Security Policy Level 3 to protect against Server Side Request Forgery

These requirements ensure that data passed between software components can be handled safely by the receiving component. It includes data entered manually by users.

[SSRF]
[SOC2] PI 1.2
[SOC2] PI 1.3

HCK5
GIR16

🔗 Node Operators MUST have thorough test coverage of their software and operating procedures

There is no magic percentage figure, but ideally unit tests and integration tests cover every functionality and interaction managed by code the Node Operator uses, whether self-managed or provided by a third party.

[ISO27001] Annex A 8.29

🔗 Updates MUST include an audit of all code and user interactions they impact

This means testing not just the new code deployed, but also existing code that interacts with anything the update changes, to ensure that integration is not introducing a vulnerability. This extends to non-blockchain code used to interact with the Validator, where applicable.

🔗 Updates MUST be tested on a staging environment that as closely as possible matches the proposed deployment environment before deployment as "production" on a live network

[ISO27001] Annex A 8.31

🔗 Node Operators MUST have a process to enable emergency rollback of upgrades

🔗 Node Operators SHOULD provide regular normal operational communication

This covers general information similar to financial reporting, major changes in staffing (overall size, key positions, strategic focus), and operator-specific information such as governance of onchain systems, key third-party relationships, software partnerships, participation in standards-setting, and the like.

The purpose is to provide confidence to stakeholders that the Node Operator is effectively managed, to enable them to understand the overall goals, and to show operational strengths, and plans to address perceived weaknesses and strategic threats.

FIN6
RER5

🔗 The Node Operator MUST have documented Incident Response Plans corresponding to all risks identified in this specification

[SOC2] CC 7.4
[SOC2] CC 9.1 of Trust Services Criteria

🔗 The Node Operator MUST have documented Disaster Recovery Plans corresponding to risks identified in this specification that lead to destruction of crucial data or loss of assets

[SOC2] CC 7.5

🔗 Incident Response Plans and Disaster Recovery Plans MUST include revising the relevant plans whenever they are activated, based on lessons learned

This covers both responses to real incidents and Simulated activation, or Pre-mortems.

[SOC2] CC 7.3

🔗 Node Operators MUST perform a simulated Incident and activation of the associated Incident Response Plan or Disaster Recovery Plans at least twice per year

🔗 Node Operators MUST document Incident Communication strategies or policies

This requirement includes internal and external communication, both during and after incidents.

RER5

🔗 Node Operators MUST verify that third parties providing services, or with whom the Node Operator contracts, are in compliance with relevant standards (including this one) and regulations

This includes areas such as the uptime guarantees of cloud providers and other core counterparties, response times and Service Level Agreements, security procedures, and the like as well as relevant regulatory compliance.

[ISO27001] Annex A 8.30
[SOC2] CC 9.2

🔗 Service agreements MUST specify termination procedures and obligations

FIN1
HCK3

This section provides a summary of the Controls provided by this Specification.

Control Group	Control(s)	Risks	External Controls
Node Operators MUST document how their processes and tools serve their business goals	Node Operators MUST document how their processes and tools serve their business goals	SLS1, SLS2, SLS3, SLS4, SLS5, SLS11, SLS12, SLS13, SLS14, SLS15, SLS17, SLS18 DOW16, DOW18 GIR5	[SOC2] CC 5.2
Node Operators MUST review their dependencies on staff and external suppliers and how to replace key staff or suppliers every year	Node Operators MUST review their dependencies on staff and external suppliers and how to replace key staff or suppliers every year	FIN1 SLS6 DOW11, DOW14, DOW19, DOW20 GIR24, GIR25
Node Operators MUST document their assessments of risks, and what risks they class as acceptable	Node Operators MUST document their assessments of risks, and what risks they class as acceptable	All risks	[SOC2] CC 3.1
Node Operators MUST ensure that processes for risk mitigation are followed in practice	Node Operators MUST ensure that processes for risk mitigation are followed in practice	FIN1, FIN2, FIN3, FIN4, FIN5, FIN6
Node Operators MUST document payment processes including currency and exchange details	Node Operators MUST document payment processes including currency and exchange details	FIN2, FIN3, FIN4, FIN5, FIN6 HCK1 RER4
Node Operators MUST review relevant regulation and update processes for compliance as necessary at least quarterly	Node Operators MUST review relevant regulation and update processes for compliance as necessary at least quarterly	FIN6 RER2
Node Operators MUST know the identity of entities who are authorized to manage operations	Node Operators MUST know the identity of entities who are authorized to manage operations	FIN1 KEC7 HCK1, HCK2, HCK3 SPS0 RER2	[ISO27001] Annex A 5.16
Node Operators MUST implement documented procedures for evaluating and reviewing counterparty risks from vendors and partners	Node Operators MUST implement documented procedures for evaluating and reviewing counterparty risks from vendors and partners	FIN1 SLS9 GIR5 DOW1, DOW19	[SOC2] CC 9.2
Node Operators MUST ensure entities who are authorized to manage operations have and maintain the necessary knowledge to minimize risks to the Node Operator in the course of performing their work	Node Operators MUST ensure entities who are authorized to manage operations have and maintain the necessary knowledge to minimize risks to the Node Operator in the course of performing their work	FIN1, FIN7 SLS17 DOW21 GIR16, GIR22, GIR25 RER2
Node Operators MUST keep third-party software up to date	Node Operators MUST keep third-party software up to date	DOW19 HCK4
Node Operators MUST have a persistent local anti-slashing database	Node Operators MUST have a persistent local anti-slashing database	SLS1, SLS2, SLS3, SLS4, SLS14, SLS15, SLS18, SLS19
Node Operators MUST document signature requirements for high-value transactions, including the definitions used to identify such transactions	Node Operators MUST document signature requirements for high-value transactions, including the definitions used to identify such transactions Node Operators SHOULD use signature management tools to help secure high-value transactions The primary and backup/failover versions of Signature management tools MUST implement mechanisms to ensure data continuity	SLS1, SLS2, SLS3, SLS4, SLS7, SLS15 KEC6, KEC9 GIR7, GIR16
Node Operators MUST deploy at least 2 distinct client applications for any level of the blockchain where at least 3 clients are available	Node Operators MUST deploy at least 2 distinct client applications for any level of the blockchain where at least 3 clients are available	SLS6, SLS20 DOW2, DOW11, DOW12, DOW13, DOW14 GIR13, GIR24
Devices that control critical functions MUST be dedicated to that purpose, and configured with only the necessary software for their intended purpose	Devices that control critical functions MUST be dedicated to that purpose, and configured with only the necessary software for their intended purpose	DOW2 HCK1, HCK2, HCK3, HCK4
Node Operators MUST implement processes to withdraw validators from a network in such a way that they are not penalised for disappearing	Node Operators MUST implement processes to withdraw validators from a network in such a way that they are not penalised for disappearing	SLS2 SPS1
Node Operators MUST document configuration of software and hardware	Node Operators MUST document configuration of software and hardware	FIN3, FIN4, FIN5, FIN6 DOW12, DOW13, DOW14, DOW20, DOW21 KEC6, KEC9 GIR3, GIR18, GIR19, GIR20 RER1, RER4	[SOC2] CC 7.1 [ISO27001] Annex A 8.9
Node Operators MUST implement appropriate key management procedures	Node Operators MUST implement appropriate key management procedures	All Key Custody risks HCK1, HCK2, HCK3, HCK4
Node Operators MUST document and follow information lifecycle processes for important operational information	Node Operators MUST document and follow information lifecycle processes for important operational information	SLS10 DOW17	[ISO27001] Annex A 8.10
Node Operators MUST implement backup procedures, at minimum daily, for important operational data	Node Operators MUST implement backup procedures, at minimum daily, for important operational data Backup Procedures SHOULD produce journaled backups covering relevant retention periods Node Operators MUST implement protection against accidental or malicious deletion of data	FIN6 SLS4, SLS10, SLS11, SLS12 KEC6, KEC9 HCK2, HCK4 GIR4, GIR13 RER1, RER3
Node Operators MUST record and maintain important operational information	Node Operators MUST record and maintain important operational information	FIN1, FIN5, FIN6 SLS3, SLS4, SLS10, SLS14 DOW1, DOW4, DOW16, DOW18 KEC2, KEC3, KEC6, KEC9, KEC10 HCK4 GIR4, GIR25 SPS0 RER1, RER3
Node Operators MUST have a policy for data retention	Node Operators MUST have a policy for data retention	FIN6 DOW4, DOW13 GIR4
All services MUST require appropriate authentication privileges	All services MUST require appropriate authentication privileges	FIN1 DOW4 KEC7 HCK1, HCK2, HCK3, HCK4 GIR22
Networks MUST be segmented, to restrict access to systems that are identified as needing it	Networks MUST be segmented, to restrict access to systems that are identified as needing it Nodes MUST NOT respond to requests from outside a defined network, except those that are explicitly defined as necessary	DOW10 GIR9 HCK6	[ISO27001] Annex A 8.22
Entry to physical server locations MUST require authorization	Entry to physical server locations MUST require authorization	DOW3, DOW4 HCK6
Software MUST NOT run with, and a user MUST NOT have a higher level of privilege than necessary	Software MUST NOT run with, and a user MUST NOT have a higher level of privilege than necessary	DOW4, DOW5, DOW11, DOW12, DOW13, DOW14 KEC7 HCK1, HCK2, HCK3, HCK4 SPS0	[SOC2] CC 6.3 [ISO27001] Annex A 8.2 [ISO27001] Annex A 8.18
A review of Access Rights MUST take place regularly	A review of Access Rights MUST take place regularly	SLS9, SLS10, SLS11, SLS12, SLS13 DOW16, DOW17, DOW18 GIR1, GIR5, GIR7	[ISO27001] Annex A 5.17 [ISO27001] Annex A 5.18 [ISO27001] Annex A 8.18
All data in transit MUST be encrypted	All data in transit MUST be encrypted All data "at rest" MUST be stored in encrypted form	SLS11, SLS12, SLS13 DOW18 KEC1, KEC6, KEC7, KEC9 HCK1, HCK2, HCK3, HCK4, HCK6 GIR10	[CRYPTOFAIL] [SOC2] CC 6.7
Node Operators MUST log network traffic, and analyze the logs for anomalous behavior	Node Operators MUST log network traffic, and analyze the logs for anomalous behavior	SLS9, SLS10, SLS11, SLS12, SLS13, SLS14, SLS15 DOW1
Any operation that requires privileged access MUST be logged	Any operation that requires privileged access MUST be logged Any assignment of a key, or assignment of a role to or removal of a role from a particular key, MUST be logged	FIN1	[ISO27001] Annex A 8.18
Every change in the status of people who have access to any function of the Node, or physical access to any hardware, MUST be logged	Every change in the status of people who have access to any function of the Node, or physical access to any hardware, MUST be logged	FIN1 HCK3
Any event that results in slashing MUST be logged	Any event that results in slashing MUST be logged	SLS4, SLS17, SLS18, SLS19, SLS20 RER1, RER3
Logs MUST provide a sufficiently detailed view of hardware and network performance to enable upgrade needs to be forecast, and to alert if validators are operating with excess latency	Logs MUST provide a sufficiently detailed view of hardware and network performance to enable upgrade needs to be forecast, and to alert if validators are operating with excess latency	DOW3, DOW7, DOW10, DOW15 GIR4	[SOC2] A 1.1 [SOC2] CC 7.2 [ISO27001] Annex A 8.16 [ISO27001] Annex A 8.21
Node Operators SHOULD have processes in place to manage environmental threats	Node Operators SHOULD have processes in place to manage environmental threats	SLS14, SLS15 DOW1, DOW5, DOW7, DOW9	[ISO27001] Annex A 7
Node Operators SHOULD implement failover validators in different physical locations	Node Operators SHOULD implement failover validators in different physical locations	DOW1, DOW2, DOW3, DOW4, DOW5, DOW7, DOW9
Node Operators SHOULD have processes in place to manage equipment lifecycles	Node Operators SHOULD have processes in place to manage equipment lifecycles	DOW3, DOW6, DOW7, DOW9, DOW15, DOW20, DOW21 KEC1, KEC3, KEC6 HCK4, HCK6	[ISO27001] Annex A 7
Code development MUST follow secure development processes to avoid introducing security risks	Code development MUST follow secure development processes to avoid introducing security risks	All Slashing Risks DOW2, DOW10, DOW11, DOW12, DOW13, DOW14, DOW20 KEC7 HCK1, HCK2, HCK3, HCK4, HCK5 GIR6, GIR10, GIR11, GIR13, GIR14, GIR15 SPS0	[ISO27001] Annex A 8.25
Node Operators MUST document procedures for updates to code	Node Operators MUST document procedures for updates to code	SLS6, SLS7 DOW2, DOW11, DOW12, DOW13, DOW14, DOW19, DOW20, DOW21 GIR4, GIR13, GIR18, GIR19 SPS0	[SOC2] CC 8.1 [ISO27001] Annex A 8.32
Source code MUST be managed in a repository	Source code MUST be managed in a repository All changes to deployed production code MUST be tested and reviewed before deployment	SLS6, SLS7 DOW2, DOW11, DOW12, DOW13, DOW14, DOW19, DOW20, DOW21 GIR4, GIR13, GIR18, GIR19, GIR21 SPS0
Updates to third-party software MUST be checked for vulnerabilities before deployment	Updates to third-party software MUST be checked for vulnerabilities before deployment	FIN3, FIN4, FIN5, FIN6 SLS6, SLS7, SLS17, SLS18, SLS19 DOW11, DOW12, DOW13, DOW14 KEC7 HCK1 SPS0	[ISO27001] Annex A 8.7 [ISO27001] Annex A 8.30 [ISO27001] Annex A 8.32 [SOC2] CC 8.1
Software update procedures MUST include an assessment and application of configuration settings	Software update procedures MUST include an assessment and application of configuration settings	SLS7 DOW2, DOW13, DOW21 GIR3, GIR15
Code MUST verify that input is safe before operating on it	Code MUST verify that input is safe before operating on it Code MUST NOT produce invalid outputs Components SHOULD use Cross-Origin Resource Sharing and Content Security Policy Level 3 to protect against Server Side Request Forgery	HCK5 GIR16
Node Operators MUST have thorough test coverage of their software and operating procedures	Node Operators MUST have thorough test coverage of their software and operating procedures	All risks	[ISO27001] Annex A 8.29
Updates MUST include an audit of all code and user interactions they impact	Updates MUST include an audit of all code and user interactions they impact	SLS6, SLS7 DOW2, DOW11, DOW12, DOW13, DOW14, DOW19, DOW20, DOW21 GIR4, GIR13, GIR18, GIR19, GIR21 SPS0
Updates MUST be tested on a staging environment that as closely as possible matches the proposed deployment environment before deployment as "production" on a live network	Updates MUST be tested on a staging environment that as closely as possible matches the proposed deployment environment before deployment as "production" on a live network	SLS6, SLS7 DOW2, DOW11, DOW12, DOW13, DOW14, DOW19, DOW20, DOW21 GIR4, GIR11, GIR13, GIR18, GIR19, GIR21 SPS0	[ISO27001] Annex A 8.31
Node Operators MUST have a process to enable emergency rollback of upgrades	Node Operators MUST have a process to enable emergency rollback of upgrades	FIN3, FIN4, FIN5, FIN6 SLS6, SLS7 DOW2, DOW11, DOW12, DOW13, DOW14, DOW19, DOW20, DOW21 GIR4, GIR13, GIR18, GIR19, GIR21 SPS0
Node Operators SHOULD provide regular normal operational communication	Node Operators SHOULD provide regular normal operational communication	FIN6 RER5
The Node Operator MUST have documented Incident Response Plans corresponding to all risks identified in this specification	The Node Operator MUST have documented Incident Response Plans corresponding to all risks identified in this specification	All risks	[SOC2] CC 7.4 [SOC2] CC 9.1 of Trust Services Criteria
The Node Operator MUST have documented Disaster Recovery Plans corresponding to risks identified in this specification that lead to destruction of crucial data or loss of assets	The Node Operator MUST have documented Disaster Recovery Plans corresponding to risks identified in this specification that lead to destruction of crucial data or loss of assets	DOW1, DOW2, DOW3, DOW4, DOW5, DOW10 GIR13, GIR19 RER1, RER4, RER5	[SOC2] CC 7.5
Incident Response Plans and Disaster Recovery Plans MUST include revising the relevant plans whenever they are activated, based on lessons learned	Incident Response Plans and Disaster Recovery Plans MUST include revising the relevant plans whenever they are activated, based on lessons learned	All risks	[SOC2] CC 7.3
Node Operators MUST perform a simulated Incident and activation of the associated Incident Response Plan or Disaster Recovery Plans at least twice per year	Node Operators MUST perform a simulated Incident and activation of the associated Incident Response Plan or Disaster Recovery Plans at least twice per year	All risks
Node Operators MUST document Incident Communication strategies or policies	Node Operators MUST document Incident Communication strategies or policies	RER5
Node Operators MUST verify that third parties providing services, or with whom the Node Operator contracts, are in compliance with relevant standards (including this one) and regulations	Node Operators MUST verify that third parties providing services, or with whom the Node Operator contracts, are in compliance with relevant standards (including this one) and regulations	FIN1 SLS9 DOW1, DOW7, DOW9, DOW19 GIR5, GIR14, GIR22, GIR24, GIR25 SPS0 RER2, RER5	[ISO27001] Annex A 8.30 [SOC2] CC 9.2
Service agreements MUST specify termination procedures and obligations	Service agreements MUST specify termination procedures and obligations	FIN1 HCK3

ValOS

Abstract

1. Introduction

1.1 Purpose

2. Conformance

3. Risks

3.1 Financial and Regulatory Risk

3.2 Slashing Risk

3.3 Downtime Risk

3.4 Key Custody Risk

3.5 Hacking Risk

3.6 General Infrastructure Risk

3.7 Service Partner Specific Risk

3.8 Reputational Risk

4. Risk Mitigation Strategies

4.1 Risk Management

4.1.1 Assessing risks

4.1.1.1 Best practices for assessing risk include

Risks that risk assessment can mitigate

4.1.2 Assessing Financial Impact

Risks that assessing financial impact can mitigate

4.1.3 Assessing Incident Probability

4.1.3.1 Best practices for assessing incident probability include

Risks that assessing incident probability can mitigate

4.2 People Management

4.2.1 Identified Individuals

Risks that identifying individuals involved in managing Validators can mitigate

4.2.2 Training

Risks that training can mitigate

4.3 Technology Stack

4.3.1 Update Third-party Software

4.3.1.1 Best practices for updating software include

4.3.1.2 Relevant controls for updated software

Risks that updated software can mitigate

4.3.2 Local Anti-Slashing Database

Risks that a local anti-slashing database can mitigate

4.3.3 Signature Management

Risks that signature management can mitigate

4.3.4 Client Diversity

4.3.4.1 Best practice for client diversity includes

Risks that client diversity can mitigate

4.3.5 Delinquent State

Risks that handling delinquent state can mitigate

4.4 Information and Secret Management

4.4.1 Controlled and Audited Secret Access

Risks that secret access management can mitigate

4.4.2 Encrypted Data

Risks that data encryption can mitigate

4.4.3 Cold Storage

Risks that cold storage can mitigate

4.4.4 Key Management

4.4.4.1 Best practices for key management include

Risks that key management can mitigate

4.4.5 Operational Information Management

Risks that operational information management can mitigate

4.4.6 Deletion protection

Risks that deletion protection can mitigate

4.5 Access Controls and Access Management

Access control helps address the following risks

4.5.1 Least Privilege

4.5.1.1 Best practices for access control include

Risks that least privilege can mitigate

4.5.2 Employee Authorization Management

4.5.2.1 Best practices for employee authorization process includes

Risks that employee authorization process can mitigate

4.5.3 Managed Network Access to Nodes

4.5.3.1 Best practices for managed network access include

Risks that managed network access can mitigate

4.5.4 Authentication Policies

Risks that authentication policy can mitigate

4.6 Managing Hardware

4.6.1 Managed Physical Access

Risks that managed physical access can mitigate

4.6.2 Physically Distributed Infrastructure

Risks that distributed Infrastructure can mitigate

4.6.3 Protection against Utility Failure

Risks that protection against utility failure threats can mitigate

4.6.4 Protection against Environmental Threat

Risks that protection against environmental threats can mitigate

4.6.5 Manage Equipment Lifecycle