ValOS

Abstract

This specification defines risks that can apply when operating a blockchain node. It describes mitigations that can minimise the likelihood that particular risks will be realised and cause a problem, such as compromising the ability to manage a node or actions that result in reduced economic rewards, or penalties such as slashing. Finally, it provides a set of controls to verify that a Node Operator is appropriately managing the relevant risks.

## Introduction {#sec-introduction} ### Purpose {#sec-purpose} This specification builds on the [DUCK knowledge base](https://duck-initiative.gitbook.io/d.u.c.k.-knowledge-base) [[?DUCK]]. The risk framework and explanation of mitigation strategies have been updated, based on feedback from practitioners. A specific set of controls has been added; statements of requirement that can be tested, to ensure that as far as possible a Node Operator is following the recognised best practices to minimise risk and effectively maximise their returns. While other standards such as AICPA's SOC 2® [[?SOC2]] or ISO's 27001 standard [[?ISO27001]] can be applied to Node Operators, they often include more general requirements than this specification, reflecting a broader scope. The relevant controls from several such standards are explicitly linked to the controls in this specification. The purpose of this is twofold: to simplify the process of certifying conformance to this specification for Operators who have already undergone testing against those standards, and to simplify the process of assessing Node Operators who have been certified as conforming to this specification against those specifications.

Conformance to this specification is based on meeting the requirements expressed in the [Controls Catalog](#sec-controls-catalog).

## Risks {#sec-risks} This specification divides risk into seven categories for Node Operators to consider in ensuring the quality of their overall setup. Where applicable, risk identifiers correspond to those that were introduced in [[?DUCK]], and will remain stable. This may mean that some identifiers are retired and will not be re-used. ### Financial and Regulatory Risk {#sec-risks-financial} These are risks explicitly arising from management of financial and regulatory compliance processes. (Many other risk categories have a direct financial impact).

ID	Risk Group	Risk Vectors	Risk Vector Description
FIN1	Process	Onboarding	Onboarded entities are not adequately vetted to ensure financial, operational, regulatory, or reputational appropriateness, resulting in potential financial, legal, or reputational damage
FIN2	Infrastructure	Deposit	Fiat and digital assets deposited are not received in the appropriate currency, address, or fiat account, leading to financial loss
FIN3	Infrastructure	Deposit	Fiat and digital assets are not correctly processed and assets are misallocated to individuals, entities, or operational addresses leading to financial loss
FIN4	Process	Withdrawal	Fiat and digital assets are not correctly disbursed to individuals, entities, or addresses, leading to financial and reputational loss
FIN5	Infrastructure	Compounding	Staking rewards are not appropriately collected, governed, restaked, compounded, or allocated to clients leading to financial loss
FIN6	Process	Reporting	Financial reporting requirements are not adhered to or inconsistently applied, leading to regulatory, legal, and financial consequences
FIN7	Process	Up to date compliance	Failure to review relevant regulation and update compliance procedures leading to financial, legal, and regulatory repercussions

### Slashing Risk {#sec-risks-slashing} These risks arise from performing slashable actions, that lead to penalties. Note that frequent slashing penalties are likely to incur a reputational risk.

ID	Risk Group	Risk Vectors	Risk Vector Description
SLS1	Infrastructure	Operational Failure: Single validator signs two different blocks	Single node signs two different blocks through failure in setting up the anti-slashing mechanism correctly (e.g. local anti-slashing database is disabled or has been deleted) or failure in the validator migration process.
SLS2	Infrastructure	Operational Failure: Shutting down validator only temporarily	Validator shuts down temporarily. System spins up a new validator with the same key
SLS3	Infrastructure	Operational Failure: Validator keys are used on 2 different validators	System takes the same keys twice from the key database and deploys them on two different validators.
SLS4	Infrastructure	Operational Failure: Failure in setting up the anti-slashing mechanisms correctly	Failure in setting up the anti-slashing mechanisms correctly (e.g. Web3Signer has no slashing protection enabled, no database, database only in memory and not on disk, 2 or several copies of Web3Signer, slashing database can be deleted)
SLS5	Infrastructure	Double key usage in the CI/CD pipeline	Usage of same key within different environments causing a slashing
SLS6	Software	Software Bug (e.g. Validator Client) (Intentional or accidental) through update	New versions of a validator client that may cause errors that lead to slashing Supply chain attack
SLS7	Software	Software Bug (e.g. Validator Client) through software customization	New versions of a validator client has errors that lead to slashing
SLS8	Replaced by HCK1
SLS9	Replaced by HCK2
SLS10	Replaced by HCK3
SLS11	Replaced by HCK4
SLS12	Replaced by HCK4
SLS13	Replaced by HCK4
SLS14	Process	Operational Failure: Incorrect implementation of the failover mechanism: Failover system comes unexpectedly online	If the failover does not ensure that old system is not still alive in some way or is using a stale version of the anti-slashing database, e.g.: failover system starts accidentally although primary system is not down
SLS15	Process	Operational Failure: Incorrect implementation of the failover mechanism: Primary system comes unexpectedly back online	If the failover does not ensure that old system is not still alive in some way or is using a stale version of the anti-slashing database, e.g.: failover system starts (manually / automatically) because primary system is down and primary system comes back online
SLS16	Removed
SLS17	Process	Operational Failure: Slashing monitoring ignores alerts	Slashing events continue or recur because alerts are not monitored
SLS18	Process	Operational Failure: Slashing monitoring does not shut down the validators	Slashing continues because monitoring system fails to automatically shut down malfunctioning validator
SLS19	Process	Incident Response does not update Slashing Database	A slashing event recurs, because the database is not updated as part of Incident Response
SLS20	Infrastructure	Chainsplit increases slashing penalty	After a chainsplit occurs, continuing to support the leading fork can lead to a larger penalty if it is later rejected

### Downtime Risk {#sec-risks-downtime} These risks are due to connectivity issues. Depending on the network these can lead to reduced rewards, in effect an opportunity cost, or to more direct financial penalties.

ID	Risk Group	Risk Vectors	Risk Vector Description
DOW1	Infrastructure	External: Operational Failure of Cloud Service Provider	Cloud Downtime, malfunction
DOW2	Infrastructure	Operational Failure of own bare metal set-up due to malfunction software	Malfunction of software (e.g. validator client or third party software) leads to downtime
DOW3	Infrastructure	Operational Failure of own bare metal set-up due to malfunction hardware	Malfunction of hardware (e.g. physical network, computer system, CPU, RAM) leads to downtime
DOW4	Infrastructure	External: Operational Failure of own bare metal set-up due to people (ManMade)	Employees are responsible for the downtime event (accidentally or intentionally)
DOW5	Infrastructure	External: Operational Failure of own bare metal set-up due to natural causes	A natural event (e.g. earthquake, flood, hurricane,...) leads to an downtime
DOW6	Infrastructure	Failure to design for high availability	Having too few beacon nodes relative to validator clients, leading to: - opportunity costs - slashing on some networks
DOW7	Infrastructure	External: Internet connectivity	Loss of infrastructure network connection due to: - Sudden cloud outage - Sudden internet failure in on-premise machines - Accidental firewall change locks out access.
DOW8	This risk has been merged into DOW9
DOW9	Infrastructure	Power supply	Volatile power supply damages infrastructure or causes system downtime
DOW10	Infrastructure	External: DDOS attack	Systems unresponsive, slowed down, and compromized
DOW11	Software	Software Bug in the Validator Client	Downtime or accidental interpretation of dishonest behavior
DOW12	Software	Software Bug in the Validator Client (Intentional or accidental) through software update	New versions of a validator client that may cause errors that lead to downtime (Supply chain attack)
DOW13	Software	Software Bug in the Validator Client through software customization	New versions of a validator client may cause errors that lead to downtime
DOW14	Software	Software Bug in third party software	Third party software failure can lead to downtime of the whole system
DOW15	Software	Latency / Failure of relays	Latency / Failure of relays
DOW16	Replaced by HCK2
DOW17	Replaced by HCK3
DOW18	Replaced by HCK4
DOW19	Software	Running outdated validator software	The node operator os not updating its validator software
DOW20	Software	Validator client update incompatible with IT system	System downtime after validator client update caused by incompatibility
DOW21	Software	Updates take too long	System downtime caused by software update processes taking longer than planned, with no failover capacity

### Key Custody Risk {#sec-risks-keys} The risks associated with key custody cover all private keys and key material. However in some cases the way a risk manifests or impacts depends on the type of key. For example: - Validator Keys enable the operator to manage their nodes so compromises have a direct impact on operations, which is very likely to have a financial and reputational impact. - Withdrawal Keys enable the operator to manage their digital assets so compromises often have a direct financial impact, as well as a likely reputational impact.

ID	Risk Group	Risk Vectors	Risk Vector Description
KEC1	Infrastructure	Failure to use vault system	No audit trail and controlled access to secrets
KEC2	Replaced by HCK1, HCK2
KEC3	Replaced by HCK1, HCK2
KEC4	Replaced by HCK4
KEC5	Replaced by HCK4
KEC6	Process	Loss of Signing Keys (Operational Failure)	Signing keys are lost in an operational process
KEC7	Process	Privilege escalation mechanisms not prevented	Someone with access to one service/node can increase their privileges and do more harm on further nodes.
KEC8	Replaced by HCK6
KEC9	Process	Loss of Withdrawal Keys (Operational Failure)	Loss of Withdrawal Keys (Operational Failure)
KEC10	Replaced by HCK1, HCK2
KEC11	Replaced by HCK4

### Hacking Risk {#sec-risks-hacking} Many risks arise through vulnerability to hacking, whether carried out by a malicious external actor, or facilitated by a current or former member of the operational team.

ID	Risk Group	Risk Vectors	Risk Vector Description
HCK1 (replaces SLS8, KEC2, KEC3, KEC10)	People	Malicious Internal Employee intentionally causes operational failure with appropriate user rights	Anything that an internal employee has access to is at risk of being exploited to sabotage the operation resulting in a slashing incident.
HCK2 (replaces SLS9, DOW16, KEC2, KEC3, KEC10, GIR2)	People	Malicious Internal Employee intentionally causes operational failure via privilege escalation	A malicious internal employee can get additional rights via through privileges escalation.
HCK3 (replaces SLS10, DOW17, GIR2, GIR5)	People	Malicious Ex-Employee intentionally causes an operational failure	A former employee whose access is not blocked or removed
HCK4 (replaces SLS11, SLS12, SLS13, DOW18, KEC4, KEC5, KEC11, GIR1)	People	Malicious External Hacker intentionally causes operational failure	Malicious External Hacker gets system access through absence of or weak cyber security standards
HCK5 (replaces GIR8)	Process	No Input validation	Attacks induce buffer overflow, DoS, code injection, etc.
HCK6 (replaces KEC8)	Infrastructure	Failure to protect infrastructure against physical access	Someone who gains physical access to a server can have access to locally exposed ports and can access the software API

### General Infrastructure Risk {#sec-risks-infra} Risks related to process errors, inefficiencies, and weak general infrastructure.

ID	Risk Group	Risk Vectors	Risk Vector Description
GIR1	Replaced by HCK4
GIR2	Replaced by HCK2, HCK3
GIR3	Infrastructure	Fix versions on every deploy	Downtime if a system needs to be just re-started if newest version is accidentally pulled
GIR4	Process	Insufficient monitoring/logging	- Inability to learn from incidents - Late detection of incidents - insufficient automation to react to incidents
GIR5	Replaced by HCK3
GIR6	Process	No password rotation	- Leak of passwords - brute force
GIR7	Process	Use of direct auth	Authentication information does not expire timely and can be used later.
GIR8	Replaced by HCK5
GIR9	Infrastructure	Failure to properly perform network segmentation	Having containers or nodes accessible from any IP addresses increases the attack vector enormously
GIR10	Infrastructure	Lack of encrypted traffic between services and deployment scripts	Anyone on the network can sniff out packages with secrets included, and may be able to steal passwords and tokens in this way
GIR11	Infrastructure	No separate tests and staging environments	Improper change management and testing of software updates "in production"
GIR13	Infrastructure	High Blast radius of software bug in overall system	A small error affects the whole system and all clients right away, instead of being caught early with limited effect on the whole system.
GIR14	Infrastructure	Low Infrastructure provider security	Hacks through the apis of the infrastructure provider
GIR15	Infrastructure	CVE Monitoring	Attack on the system suddenly possible once published
GIR16	People	Human error	Anything a human can touch can go wrong
GIR17	Process	Use of non-hardened images	Attack on the system using the weakest link of a given node/container
GIR18	Process	Insufficient change management mechanisms in place	- Downtime on update - Slow down in reaction time to incident
GIR19	Process	Lack of automation for deployment	- Downtime on update - Slow down in reaction time to incident
GIR20	Process	Lack of testing (software and infrastructure)	- Downtime on update - Slow down in reaction time to incident
GIR21	Process	Lack of enforced code review	- Downtime on update - Slow down in reaction time to incident
GIR22	Process	Lack of Security training (password hygiene, phishing attacks, ...)	Employees spill secrets
GIR23	Process	Make-shift container orchestration procedures	Failure when e.g. failover is actually needed to be performed
GIR24	Software	Third party software and vendors	Suboptimal third-party software practices
GIR25	People	Centralized knowledge	If the infrastructure knowledge is not shared across the team, this could lead to a heavy dependency on a single person

### Service Partner Specific Risk {#sec-risks-partner} Risk related to partners and specific third-party services.

ID	Risk Group	Risk Vectors	Risk Vector Description	Relevant Mitigations
SPS0	Counterparty	General Counterparty Risk	Whenever a service is provided by a third party, the relevant risks are run by the third party, but in most case at least some and often the bulk of the consequences for a failure will be borne by the node operator.
SPS1	Process	Exit Risk - Delinquent state	No new stake will be allocated to the Node Operator (happens automatically) the daily rewards sent to the Node Operator will be halved (with the remaining half sent towards that day’s rebase) (happens automatically) reduced rewards will continue for the duration of a cooldown period long enough to determine whether, immediately after service restoration by the Node Operator, subsequently received validator exit requests are processed in a timely manner.

### Reputational Risk {#sec-risk-reputation} Risks that impact the reputation of a Node Operator, likely to lead potential and actual customers to choose a different partner to work with.

ID	Risk Group	Risk Vectors	Risk Vector Description
RER1	Process	Mismanagement during incident	Reputation damage due to mismanagement of slashing, downtime or access loss to keys
RER2	People	Negative appearance in public	Damage to reputation due to bad behavior in public
RER3	Process	Mismanagement of Post-Incident	Reputation damage due to mismanagement of Post-slashing, -downtime or access loss to keys
RER4	Infrastructure	Withdrawal	Staking withdrawal requests cannot be met efficiently, leading to delays in payment processing causing reputational loss
RER5	Process	Poor Communication	Poor reputation or reputational damage due to insufficient operational communication and overall transparency

## Risk Mitigation Strategies {#sec-mitigation} This Mitigation Strategies section serves as a go-to resource for node operators, providing actionable insights and mitigation options to enhance the security, reliability, and efficiency of their operations. Most of the best practices that optimize up-time, access control and general stability directly apply to operating a node properly. However, for some risks specific to running a node operator, high levels of process segregation need to be achieved for mitigation to be effective. ### Risk Management {#sec-mitigations-risk-management} A core principle for mitigating risks is to actively identify and manage the risks. This means understanding the particular risks, the likelihood of something going wrong, and the likely impact if that does occur. That information enables a Node Operator to decide what level of risk is reasonable and how to prioritise available resources to mitigate risk. Risk management decisions need to take into account any regulation that obliges a Node Operator to meet specific benchmarks or implement specific mitigation strategies or other activities. A first step for effective risk management is to document the potential risks, as well as the tools and processes currently in place to address those risks. Documentation needs to include an assessment of the relevant risks, an explanation of what level of risk is acceptable and why, and how each process or infrastructure component contributes to and protects against risks. This enables Node Operators to identify activities that are not contributing to the business, or that actually increase the potential risks they face. The accuracy, availability and completeness of this information is of crucial import. #### Assessing risks {#sec-mit-assess-risk} A common industry approach to assessing risks is to consider the probability of an event occurring and the likely impact of that event. If these are ranked on a linear numerical scale (e.g. probability between 0 and 1), and an approximate overall financial impact, they can be multiplied, provide a simple initial ranking for priority of mitigating each risk. Since the cost of risk mitigation varies considerably, the overall priority for addressing risk, or deciding that a given level of risk is acceptable, generally depends on comparing the risk ranking with the cost of mitigation, and available resources. ##### Best practises for assessing risk include * Identify relevant staff and others responsible for identifying, assessing, and determining how to manage risks * Ensure that every service, where possible, is configuration hardened. Common benchmarks such as [CIS](https://www.cisecurity.org) provide helpful guidance. * Analyze each infrastructure component's security, availability, processing integrity, confidentiality and privacy. * Creation and continuous analysis of a Software Bill of Materials [[?SBOM]].

##### Risks that risk assessment can mitigate - All risks

#### Assessing Financial Impact {#sec-mitigating-assess-risk-impact} There are a number of factors to take into account when assessing the overall financial impact of a given risk, with the direct cost incurred as the most obvious. It is important to understand the time required to mitigate the impact of an event, and the cost that will be incurred over that time. An incident can incur a variety of costs in terms of employee time spent managing the incident, communication, and follow-up, new mitigations implemented to mitigate concrete or reputational damage such as replacement or additional infrastructure, as well as potential costs of compensation or legal costs. It is also useful to consider opportunity costs such as competitors taking advantage of an incident to promote themselves as a better alternative.

Tools to support assessing financial impact

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Validator Penalty Simulator

##### Risks that assessing financial impact can mitigate - All risks

#### Assessing Incident Probability {#sec-mitigating-assess-risk-probability} Predicting the likelihood of an unexpected future event is generally difficult, and results are unlikely to precisely match the predictions. Nevertheless it is important to consider the context of a specific operation and attempt useful predictions. ##### Best practices for assessing incident probability include * Analyzing historical data to understand past trends and incidents (external, internal incidents, and near-miss incidents) * Reviewing industry reports for insights into common risks and their fiscal consequences in similar scenarios * Consulting with experts in the field to gain a comprehensive perspective on risk probabilities and impacts * Using risk assessment tools or software for a more data-driven analysis

##### Risks that assessing incident probability can mitigate - All risks

### People Management {#sec-mitigations-manage-people} Unless a validator system is immutable and fully automated, there will be people involved in managing it. It is therefore important that appropriate management of people is part of managing the validator node. This impacts in various areas, from mitigating the risk of hacking by unknown parties with access to privileged roles, to the ability to provide timely incident response and minimize the damage caused by a security incident. As well as the [Controls for People Management](#controls-for-people-management) some relevant controls are grouped with other areas, such as - [Limit Physical Access](#req-protect-server-locations) - [Minimize Authorization](#req-least-privilege) - [Log Personnel Information](#req-log-personnel) #### Identified Individuals {sec-mit-identified-individuals} It is important to identify individuals who have access to and can control aspects of the operations of the Validator node. While a globalized workforce can provide multiple benefits, it is difficult to hold an anonymous individual accountable. This fact is repeatedly used by large-scale hacking operations to infiltrate valuable targets with a goal of eventually using access granted willingly to rob, damage, or destroy the target.

##### Risks that identifying individuals involved in managing Validators can mitigate * [FIN1](#risk-fin-1) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3) * [SPS0](#risk-sps-0) * [RER2](#risk-rer-2)

#### Training {#sec-mit-training} It is important that individuals whose actions influence the Validator node have appropriate skills, and as the ecosystem evolves training helps maintain a relevant skillset. As well as themes specific to the individuals' tasks and Node Operator internal policies (such as this document), there are a number of areas where up-to-date skills matter, such as: - Security practices, including protection from social engineering attacks such as phishing - Relevant regulatory requirements, a broad topic possibly including privacy, anti-bribery, conflict of interest, and more

##### Risks that training can mitigate * [FIN1](#risk-fin-1), [FIN7](#risk-fin-7) * [SLS17](#risk-sls-17) * [DOW21](#risk-dow-21) * [GIR16](#risk-gir-16), [GIR22](#risk-gir-22), [GIR25](#risk-gir-25) * [RER2](#risk-rer-2)

### Technology Stack {#sec-mitigations-tech-stack} In a nutshell: technology needs to serve the business goal, not the other way around. To ensure this happens, it is important to consider both the business goals and the available technology, and then use appropriate technology to meet those goals. #### Update Third-party Software {#sec-mit-update-software} Updates to software components provided by third-parties often address newly-discovered or longstanding vulnerabilities. It is a best practice to update software regularly, but it is important to [check for vulnerabilities](#req-check-vulnerabilities) that can be introduced by an upgrade as part of a supply-chain attack, and to verify that any customisation of open-source software, or [specific configuration](#req-check-config-on-update) options, as well as other software used by the node operator, are all compatible with an update and do not create new vulnerabilities on updating. ##### Best practises for updating software include - "version-pinning" - actively managing dependencies - testing updates before automatically deplying them ##### Relevant controls for updated software - [Controls for Development and Update](#sec-controls-updates)

##### Risks that updated software can mitigate * [FIN1](#risk-fin-1) * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3) * [DOW4](#risk-dow-4), [DOW12](#risk-dow-12), [DOW21](#risk-dow-21) * [GIR4](#risk-gir-4), [GIR6](#risk-gir-6), [GIR16](#risk-gir-16), [GIR18](#risk-gir-18), [GIR21](#risk-gir-21), [GIR22](#risk-gir-22), [GIR18](#risk-gir-25) * [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [SPS1](#risk-sps-1) * [RER1](#risk-rer-1), [RER2](#risk-rer-2), [RER3](#risk-rer-3), [RER4](#risk-rer-4), [RER5](#risk-rer-5)

#### Local Anti-Slashing Database {#sec-mit-antislash-db} To avoid double signing, validators can maintain a history of messages they signed. This data is crucial, as inconsistencies can cause a double-signing event. The data needs to be reliably persistent, and properly connected to the systems that use it. A common format for anti-slashing data is defined by [[[?EIP3076]]].

Tools to support anti-slashing databases

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

##### Risks that a local anti-slashing database can mitigate * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19)

#### Signature Management {#sec-mit-signature-management} Tools that manage signatures for transactions generally provide a workflow that includes passive and active protection against a variety of risks. Using these tools helps minimise the chances that a signature is given without checking what is being signed, and that risk-bearing transactions require appropriate authorization. Properly configured signature management tools also provide the ability to recover, or mitigate any problems, in the case where a transaction was not completed. As well as the use of various kind of "multi-sig", which can include simple requirements for multiple signatures, or incorporate such techniques as multi-part compute ("MPC") or the like, signature management tools can include automated verification steps in the process of authorizing a transaction.

Tools to support signature management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

##### Risks that signature management can mitigate * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4) * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS14](#risk-sls-14), [SLS15](#risk-sls-15) * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [HCK3](#risk-hck-3) * [GIR7](#risk-gir-7)

#### Client Diversity {#sec-mit-client-diversity} A diverse set of clients for different protocols can reduce "blast radius" in a case where one client has a protocol error or other bug. This can be especially important if the bug causes a chain split. A common scenario is when an upgrade introduces a problem. The ability to migrate relevant keys to a different client, if a specific client error is observed, provides an important layer of protection. In addition, maintaining client diversity helps ensure that the network as a whole does so, ideally providing real protection against a vulnerability present in a single version of a single client by ensuring that particular version does not dominate the network. Note that there are often a different range of clients available at different levels of the infrastructure. For example in Ethereum, it is possible to run different clients on each of the Execution and Consensus layers. ##### Best practice fo client diversity includes - Running multiple Execution and Consensus clients. See also [[[?ETHdiverse]]] [[?ETHdiverse]]

##### Risks that client diversity can mitigate * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7), [SLS20](#risk-sls-20) * [DOW2](#risk-dow-2), [DOW19](#risk-dow-19), [DOW21](#risk-dow-21)

#### Delinquent State {#sec-mit-delinquent-state} Node operators need to withdraw validators correctly, as they can otherwise be put into a delinquent state. This can result in direct penalties, or an opportunity cost realised as monetary losses.

##### Risks that handling delinquent state can mitigate * [SPS1](#risk-sps-1)

### Information and Secret Management {#sec-mitigations-secret-management} Information management can mitigate many risks. One aspect is the management of highly confidential information, such as the management of signing keys or withdrawal keys, but it is also important to manage operational information. #### Controlled and Audited Secret Access {#sec-mit-control-secret-access} Best practise for credential management is to use a [Single Sign on](https://en.wikipedia.org/wiki/Single_sign-on) system, that gives users authorised access to secrets through e.g. [certificates](https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Certificate-based_Authentication), and/or [vault mechanisms](https://developer.hashicorp.com/vault/docs/secrets/ssh/signed-ssh-certificates). In this way, everything is audited, and anomaly detection can be activated for those vaults. Using [=multi-sig=] wallets requiring authorization from multiple parties for specific actions, helps to ensure both that relevant access is monitored and that it is correctly controlled.

##### Risks that secret access management can mitigate * [FIN1](#risk-fin-1) * [SLS5](#risk-sls-5) * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK4](#risk-hck-4), [HCK6](#risk-hck-6) * [GIR25](#risk-gir-25)

#### Encrypted Data {#sec-mit-encrypt-data} Many different components interplay while a staking operation is going on. If confidential information is not protected by encryption, it can be intercepted and read during transmission. There is also a risk of accidental or malicious leaking of stored information, which can be somewhat mitigated if that information is stored in encrypted form. It is therefore crucial to ensure that confidential data is only stored and transmitted in an encrypted state.

##### Risks that data encryption can mitigate * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK4](#risk-hck-4), [HCK6](#risk-hck-6) * [GIR10](#risk-gir-10), [GIR17](#risk-gir-17) * [RER1](#risk-rer-1), [RER2](#risk-rer-2)

#### Cold Storage {#sec-mit-cold-storage} Cold Storage, in particular "air-gapped" storage, can help protect information not used often such as withdrawal keys, private key generation materials, and the like, by making it more difficult for malicious entities to access the information and by reducing the chance that it will be leaked in the event of accidentally publishing data.

##### Risks that cold storage can mitigate * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [GIR10](#risk-gir-10), [GIR17](#risk-gir-17)

#### Key Management {#sec-mit-key-management} Operating a node normally entails the use of a range of keys, such as * Keys used by signature management tools * A vault * SSH keys * API keys for cloud infrastructure It is important to protect private keys from accidental or malicious misuse, and in particular unplanned deletion. It is not normal to provide broad access to unencrypted signing keys. ##### Best practise for key management include - follow relevant standards such as [[[?CCSS]]] and [[[?KMS]]] - ensuring that there are no single individuals with the capability to access or delete them, - having backups with strong acess control, - actively managing access to keys and key material, and - "key rotation", i.e. periodic changes of keys as well as rapid managed changes if a data breach occurs. Modern vault systems enable the enforcement of policies to ensure that access to keys is only available with verified roles, and deletion is managed according to established protocols.

Tools to support key management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

##### Risks that key management can mitigate * [FIN1](#risk-fin-1) * [SLS1](#risk-sls-1), [SLS3](#risk-sls-3), [SLS5](#risk-sls-5) * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK](#risk-hck-4), [HCK6](#risk-hck-6) * [GIR6](#risk-gir-6), [GIR7](#risk-gir-7), [GIR14](#risk-gir-14), [GIR16](#risk-gir-16), [GIR18](#risk-gir-18)

#### Operational Information Management {#sec-mit-operational-info-management} Node operators are likely to rely on a wide range of operational information, including internal procedures, understanding software configurations, plans for future development, and employee management. Best practise includes ensuring there is no single point of failure due to centralized information being held by a single external provider or only being known to a single employee. Documentation, even if rarely actively read by those responsible for operations (who presumably know their job), is important for many reasons including - to enable onboarding new employees and service partners, or helping employees take on new roles - to ensure smooth continued operation in the case that a key employee's role changes, particularly where they leave the organisation - to enable accurate reporting as necessary - to enable monitoring of operations and investigation of security incidents and other failures

##### Risks that operational information management can mitigate: * [FIN2](#risk-fin-2), [FIN3](#risk-fin-3), [FIN4](#risk-fin-4), [FIN6](#risk-fin-6), [FIN7](#risk-fin-7) * [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS14](#risk-sls-14) * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW21](#risk-dow-21) * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [GIR4](#risk-gir-4), [GIR15](#risk-gir-15), [GIR16](#risk-gir-16), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR20](#risk-gir-20), [GIR21](#risk-gir-21), [GIR22](#risk-gir-22), [GIR25](#risk-gir-25) * [SPS0](#risk-sps-0), [SPS1](#risk-sps-1) * [RER1](#risk-rer-1), [RER3](#risk-rer-3), [RER5](#risk-rer-5)

#### Deletion protection {#sec-mit-deletion-protection} Loss of important information, especially loss of control over keys, can have a crippling impact. It is important to have mechanisms to protect against, and recover from, unintentional or malicious deletion of important data. Best Practise includes having journaled backups of important information.

##### Risks that deletion Protection can mitigate: * [FIN6](#risk-fin-6) * [SLS4](#risk-sls-4) * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW6](#risk-dow-6), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW20](#risk-dow-20) * [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [GIR4](#risk-gir-4), [GIR13](#risk-gir-13) * [RER1](#risk-rer-1), [RER3](#risk-rer-3)

### Access Controls and Access Management {#sec-mitigations-access-management} Access Control covers physical access to devices and facilities, the ability to connect to servers through networks, and the ability to perform specific tasks, such as getting answers to requests. The core principle to follow in granting authorization is [=Least Privilege=]. This is generally achieved by using some form of [=Role-Based Access Control=], in combination with an inventory of assets and services, to ensure that only those who need access are granted that access, and that it is revoked as soon as appropriate. Tracking this information is important to ensure that access can be audited and verified.

Tools to support access control

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

#### Access control helps address the following risks * [FIN1](#risk-fin-1) * [DOW7](#risk-dow-7), [DOW16](#risk-dow-16) * [GIR1](#risk-gir-1), [GIR7](#risk-gir-7), [GIR9](#risk-gir-9), [GIR16](#risk-gir-16), [GIR22](#risk-gir-22) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [KEC2](#risk-kec-2), [KEC4](#risk-kec-4) * [SPS0](#risk-sps-0)

#### Least Privilege {#sec-mit-least-privilege} The core of Least Privilege is that access is only granted to those who need it, and only for as long as it is relevant. This means that an individual user's privileges are likely to change over time, and in particular any offboarding process includes a rapid revocation of user's assigned roles. Almost all Least Privilege implementation is managed through Role-based Access Control (commonly known as "RBAC"), where a set of roles are defined according to the tasks they need to perform, and access rights are based on holding a particular role, with individual users assigned relevant roles that are revoked or deliberately renewed on a timely basis. It is important to ensure that individuals can fulfil their designated tasks, without having authorizations they do not need. ##### Best practises for access control include * A [Single Sign on](https://en.wikipedia.org/wiki/Single_sign-on) mechanism that allows rapid assigning and revoking of roles * Authentication tokens that have a limited lifetime * Regular review of roles and permissions for both users and software * Disable privilege escalation mechanisms ([e.g. executing as root user in Docker](https://docs.docker.com/engine/reference/commandline/container_exec/), `docker exec -uroot`, or [impersonation in Keycloak](https://github.com/keycloak/keycloak/blob/main/docs/documentation/server_admin/topics/users/con-user-impersonation.adoc)) * Use of roles on the API endpoint level to determine the correct authorization.

Tools to support least privilege control

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

##### Risks that least privilege can mitigate * [FIN1](#risk-fin-1) * [GIR7](#risk-gir-7), [GIR9](#risk-gir-9), [GIR16](#risk-gir-16), [GIR22](#risk-gir-22), [GIR25](#risk-gir-25) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK6](#risk-hck-6) * [SPS0](#risk-sps-0)

#### Employee Authorization Management {#sec-mit-employee-auth-management} Ensuring that employees whose roles have changed do not have lingering credentials reduces the risk they or others can misuse those credentials to cause harm. ##### Best practises for employee authorization process includes - ensure authorization changes are automated as part of management of employee lifecycles, covering role changes as well as termination, transfer, and promotion procedures

##### Risks that employee authorization process can mitigate * [FIN1](#risk-fin-1) * [DOW4](#risk-dow-4) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [GIR4](#risk-gir-4), [GIR7](#risk-gir-7), [GIR13](#risk-gir-13), [GIR25](#risk-gir-25) * [SPS0](#risk-sps-0)

#### Managed Network Access to Nodes {#sec-mit-manage-network-access} Following the principles of defense in depth and [**least privilege**](#def-least-privilege), it is important that nodes are not directly accessible without permission, and that they do not leak information to the Web that can help malicious parties gain unauthorized access. ##### Best practises for managed network access include * An internal virtual private network with only have well-defined endpoints accessible from the web * A load-balancer that has a firewall * Disable meta-data serving through public endpoints (e.g. port scans, or what server is running in what version) * Limits on outbound traffic of a node that runs a certain service * Rate limits to ensure that internal services cannot unintentionally DDos each other * Require explicit authorization of external access capability

##### Risks that managed network access can mitigate * [HCK4](#risk-hck-4) * [GIR9](#risk-gir-9), [GIR17](#risk-gir-17)

#### Authentication Policies {#sec-mit-auth-policies} Best practice is to use password and related authentication policies to ensure that access control mechanisms are sufficiently strong at every layer of the infrastructure. This can include appropriate requirements for the strength of passwords and the use of Multi-Factor Authentication as well as [=Multi-Sig=] requirements.

##### Risks that authentication policy can mitigate * [FIN1](#risk-fin-1) * [DOW4](#risk-dow-4) * [GIR6](#risk-gir-6), [GIR7](#risk-gir-7), [GIR9](#risk-gir-9), [GIR13](#risk-gir-13), [GIR17](#risk-gir-17), [GIR22](#risk-gir-22), [GIR25](#risk-gir-25) * [KEC1](#risk-kec-1), [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5), [HCK6](#risk-hck-6) * [SPS0](#risk-sps-0)

### Managing Hardware {#sec-mitigations-environment} Physical devices are subject to physical changes, including environmental issues such as temperature extremes that can cause damage, and utility failures such as power or internet failure. #### Managed Physical Access {#sec-mit-manage-physical-access} This covers all physical devices that can access the Node, as well as all areas in which such devices are kept, whether "on-premises", distributed, hosted by a third party, or remote mobile devices such as laptops. Best practice for managing physical access includes ensuring that authorization is only granted as necessary, following the principles of [Least Privilege](#def-least-privilege). Generally this means some devices are physically segregated in areas where access is restricted according to function. Note that this covers the use of devices authorized to access the networks that nodes operate on, and is particularly important for devices authorized to access management and analytical functions of nodes. Ideally all physical access to premises and facilities is monitored, to deter and determine whether the facility is subject to piggybacking. This term refers to the situation where an unauthorized entrant is allowed in by someone who has a valid authorization for themselves. In the context of remote operators' access through a computer, controlling this is particularly challenging in practice. [=Piggybacking=] can occur inadvertently through politely holding a door for someone without checking that they have current valid authorization to enter, negligently by allowing someone to enter for a legitimate purpose despite knowing that person does not have valid authorization, or maliciously allowing someone to enter knowing that their purpose is nefarious. In the inadvertent case, relevant mitigations include - ensuring that all those with authorization understand the necessity to enforce physical access control, - providing simple and effective ways to check authorization, - ensuring that remote access devices as far as possible are dedicated to the defined purposes (rather than allowing the use of general-purpose laptops that could be attacked when being used for a different task such as general email, or playing games). To minimize negligently allowed access, it is important to ensure that access systems are effectively maintained and managed to ensure there is no good reason to allow an unauthorized person access. This can range from the design of onboarding systems to the effectiveness of internal management feedback systems for discovering unanticipated problems faced by operators. Best practice includes managing physical access with systems that can efficiently enable access to authorised parties (keycards, biometric scanners), and monitor actual access such as visual verification that the authorized party is the one entering. It is important to log and audit access sufficiently frequently to detect problems - see also [Monitoring](#sec-mitigations-monitoring).

##### Risks that managed physical access can mitigate * [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW10](#risk-dow-10) * [GIR9](#risk-gir-9), [GIR17](#risk-gir-17) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5), [HCK6](#risk-hck-6)

#### Physically Distributed Infrastructure {#sec-mit-distribute-hardware} A single validator represents a single point of failure, that can introduce slashing or downtime risks. [[?DVT]] (Distributed Validator Technology) provides an approach to mitigating this problem, by distributing the keys and the hardware that runs validation, in such a way that multiple clients physically located in different places share the task of validation. Thus if a single client or small number of them fail, the overall validation is unaffected. (Note that while the Ethereum Foundation provides a specific technical specification for DVT that has been implemented the principes can be implemented in different ways.) Likewise, maintaining multiple validators running on separate hardware and software can increase resilience to a failure in any one platform.

##### Risks that distributed Infrastructure can mitigate * [SLS1](#risk-sls-1) * [DOW1](#risk-dow-1), [DOW5](#risk-dow-5), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9) * [KEC6](#risk-kec-6) * [HCK6](#risk-hck-6)

#### Protection against Utility Failure {#sec-mit-protect-utilities} To ensure that a local utility failure does not impact a validator, it is useful to have redundant systems, such as a backup power supply e.g. through local batteries or power generation, and for connectivity e.g. physical connection such as fibre-optic cable, and one or more modes of wireless connection. The level of mitigation that is appropriate depends on the level of risk, and the costs of both failure and mitigating failure. These calculations mean economies of scale often enable larger-scale operations to be more robust than smaller ones, for a given price.

##### Risks that protection against utility failure threats can mitigate * [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW15](#risk-dow-15)

#### Protection against Environmental Threat {#sec-mit-protect-from-environment} It is also important to ensure that facilities have appropriate protection from relevant environmental risks such as fire, flooding, extreme wind, as well as earthquakes and destructive physical attacks. Appropriate mitigations will depend in part on the specific location and nature of the facility, but will generally revolve around siting of facilities, their architecture, and specific measures to ensure resilience.

##### Risks that protection against environmental threats can mitigate * [SLS14](#risk-sls-14), [SLS15](#risk-sls-15) * [DOW1](#risk-dow-1), [DOW5](#risk-dow-5), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW15](#risk-dow-15)

[Monitoring](#sec-mitigations-monitoring) can also identify specific conditions that adversely affect equipment and suggest that a lifecycle plan needs adjustment - whether writing off equipment destroyed by fire, or increasing preventive maintenance for physical access systems that are being used far in excess of expectations that drove the existing maintenance plan. #### Manage Equipment Lifecycle {#sec-mit-manage-equipment-life} The lifecycle of equipment, most particularly node servers and computers used to access and manage them, is a determinant of overall security. ##### Best practices for lifecycle management include - a capability to remotely pause, shut down, and wipe devices clean [Monitoring](#sec-mitigations-monitoring) can also identify specific conditions that adversely affect equipment and suggest that a lifecycle plan needs adjustment - whether writing off equipment destroyed by fire, or increasing preventive maintenance for physical access systems that are being used far in excess of expectations that drove the existing maintenance plan.

##### Risks that equipment lifecycle management can mitigate * [DOW3](#risk-dow-3), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW15](#risk-dow-15), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6) * [HCK4](#risk-hck-4), [HCK6](#risk-hck-6)

### Software Development and Update Process {#sec-mitigations-development-and-updates} #### Secure Development Lifecycle {#sec-mit-ssdlc} A secure development lifecycle helps ensure that vulnerabilities are not introduced to codebases, and subsequently deployed. ##### Best practices for secure development lifcycle include - auditable version control systems - thorough testing and authorisation before changes are accepted

##### Risks that secure development lifecycle can mitigate * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4) * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS6](#risk-sls-6), [SLS7](#risk-sls-7), [SLS14](#risk-sls-14), [SLS15](#risk-sls-15), [SLS17](#risk-sls-17), [SLS19](#risk-sls-19) * [DOW2](#risk-dow-2), [DOW6](#risk-dow-6), [DOW10](#risk-dow-10), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5), [HCK6](#risk-hck-6) * [GIR6](#risk-gir-6), [GIR7](#risk-gir-7), [GIR9](#risk-gir-9), [GIR10](#risk-gir-10), [GIR13](#risk-gir-13), [GIR15](#risk-gir-15), [GIR17](#risk-gir-17), [GIR21](#risk-gir-21) * [SPS1](#risk-sps-1)

#### Comprehensive Testing for Changes to Code {#sec-mit-code-testing} A comprehensive test suite helps ensure changes do not introduce new vulnerabilities or situations that lead to operational failures. Equally, it is important that someone other than the developer who produces Code changes reviews them. Static and Dynamic analysis is important, as well as user testing wherever changes impact user interface or user-generated content. Measuring test coverage, and requiring new tests that are reviewed as part of and code review, help ensure that coverage is sufficiently comprehensive to detect errors that can arise through later changes. ##### Best practices for testing code changes include - incorporating static and dynamic testing in the integration pipeline for code development.

##### Risks that testing and code review can mitigate * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4) * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS6](#risk-sls-6), [SLS7](#risk-sls-7), [SLS14](#risk-sls-14), [SLS15](#risk-sls-15), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19) * [DOW2](#risk-dow-2), [DOW6](#risk-dow-6), [DOW10](#risk-dow-10), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5), [HCK6](#risk-hck-6) * [GIR6](#risk-gir-6), [GIR7](#risk-gir-7), [GIR9](#risk-gir-9), [GIR10](#risk-gir-10), [GIR11](#risk-gir-11), [GIR13](#risk-gir-13), [GIR15](#risk-gir-15), [GIR17](#risk-gir-17), [GIR18](#risk-gir-18), [GIR21](#risk-gir-21), [GIR21](#risk-gir-21), [GIR23](#risk-gir-23), [GIR24](#risk-gir-24) * [SPS1](#risk-sps-1)

#### Validated Inputs and Outputs {#sec-mit-validate-inputs-outputs} Unchecked inputs are a major vector for a range of attacks. These include - brute force authorization, or denial of service (including DDoS) attacks, often identifiable by a high rate of failing requests using inputs with minimal variation - overflow attacks, where excessive input causes a problem, generally mitigated by programming practices or overflow-safe languages - targeted efforts to inject code that executes functionality that should not be authorized, or causes an adverse system reaction including a crash Ideally, the load balancer in front of the node filters out all traffic with payloads that cause overflow. Additionally, it is important to validate inputs against the relevant parameters, particularly where these allow a range of functionalities to be triggered. ##### Best practices for input and output validation include - using a data schema such as [JSON schema](https://json-schema.org) with [schema evolution techniques](https://en.wikipedia.org/wiki/Schema_evolution), - defining minimum and maximum input sizes and MIME types.

Tools to support input and output validation

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

ajv
Apache Ranger
In the Apache web-server, control request sizes of different pieces of the request:
- LimitRequestBody
- LimitRequestFields
ORM systems exist for almost all programming languages and frameworks, such as
validatorjs

##### Risks that input checking can mitigate * [DOW10](#risk-dow-10) * [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5)

### Manage Software Updates {#sec-mitigations-manage-updates} Updating software is a major risk vector. Good processes for software development and managing the deployment of updates are important to mitigate some of this risk. As well as having control over the update process, it is important to have the capacity to revert to a known environment in an emergency where an update has been found to introduce unexpected problems. #### Avoid Customizing Third-party Software {#sec-mit-minimize-customizing-software} Validator software, and other software validators use, is very often open source. However, customising software can introduce errors. In addition customizations can produce incompatibilities when software is updated. This means that any customization introduces a need for continued extra testing, in particular whenever relevant software is updated. Customization also increases the risk that test coverage is inadequate, meaning a future error will not be found in pre-deployment testing and only discovered through a failure operating in production, with attendant risks of reputational damage, direct losses, and increased cost for incident management.

##### Risks that not customising third-party software can mitigate * [SLS5](#risk-sls-5), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW13](#risk-dow-13), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [HCK2](#risk-hck-2), [HCK3](#risk-hck-3) * [GIR3](#risk-gir-3), [GIR16](#risk-gir-16) * [RER1](#risk-rer-1), [RER4](#risk-rer-4)

#### Configuration Management {#sec-mit-configuration-management} It is important to manage the configuration of hardware, and software. A minimal profile helps reduce possible attack surface, while minimising, and carefully tracking, customisation is important to ensure smooth and safe upgrades. Software configuration to follow includes, among others: * Firewall configurations * Docker image setups * Container orchestration configurations * Database configurations * Webserver/Load balancer configurations

Tools to support configuration management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

CIS benchmarks
CoGuard
Using GIT to manage configurations
Liquibase

##### Risks that managing configuration can mitigate * [SLS1](#risk-sls-1), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS6](#risk-sls-6) * [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW21](#risk-dow-21) * [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK6](#risk-hck-6) * [GIR3](#risk-gir-3), [GIR4](#risk-gir-4)

#### Protection against Supply-chain Malware {#sec-mit-protect-against-malware} Protection against malware needs to be implemented on all assets and users need to exercise proper caution. ##### Best practices for protecting against supply-chain malware include - Regularly check the latest [CVE entries.](https://cve.mitre.org), to cover all software tools used. - Specifically check for any announcements of vulnerabilities before upgrading any software component

Tools to support supply-chain protection

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Trivy

##### Risks that protection against supply-chain malware can mitigate * All [Slashing Risks](#sec-risks-slashing) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW14](#risk-dow-14) * [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK2](#risk-hck-2), [HCK4](#risk-hck-4), [HCK6](#risk-hck-6) * [GIR15](#risk-gir-15), [GIR17](#risk-gir-17), [GIR15](#risk-gir-15), [GIR15](#risk-gir-15), [GIR15](#risk-gir-15) * [SPS0](#risk-sps-0)

#### Deployment testing environments {#sec-mit-test-predeployment} Use separate tests and staging environments This minimizes a potential blast radius. It is important to run any change (even an update of a validator software or Web3Signer) through a test environment first to maximize the likelihood that any errors can be discovered before they impact a production environment.

Tools to support deployment testing

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

The "Blue-Green Deployment pattern" [[[?WikipediaBG]]] [[?WikipediaBG]]

##### Risks that deployment testing can mitigate * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7), [SLS14](#risk-sls-14) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [GIR11](#risk-gir-11), [GIR13](#risk-gir-13), [GIR18](#risk-gir-18), [GIR20](#risk-gir-20), [GIR21](#risk-gir-21)

#### Containerized and Orchestrated Environments {#sec-mit-containerized-environments} Containerized and orchestrated environments are designed to reinforce security by automating many good practices, with mechanisms that have been widely tested in diverse environments. As tools that can be used well or badly, their best practice recommendations are important to ensure the the full benefits are realised.

##### Risks that containerized environments can mitigate * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW21](#risk-dow-21) * [HCK2](#risk-hck-2) * [GIR13](#risk-gir-13) [GIR23](#risk-gir-23)

#### Process Automation {#sec-mit-process-automation} Human error is always a risk. An automated script, whether or not invoked by a human, can help minimise inadvertent errors. Another benefit of properly set up automation is that it can help reduce the risk of exposing secrets.

Tools to support process automation

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

##### Risks that process automation can mitigate * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6) * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19) * [DOW4](#risk-dow-4), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20) * [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [GIR16](#risk-gir-16) [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR20](#risk-gir-20), [GIR21](#risk-gir-21), [GIR25](#risk-gir-25)

### Monitoring, Logging and Alerting {#sec-mitigations-monitoring} Monitoring is an important tool to identify risks and gain relevant data, and some requirement for it is a very common feature of compliance and security frameworks. Monitoring takes many forms. It can be done internally, and provided as a service. The latter is especially common for monitoring the health of widely available third-party infrastructure such as blockchains, and cloud services. Monitoring can take place throughout the ecosystem. Low-level indicators such as whether network traffic is within expected or design parameters, whether databases are being updated at expected rates, or whether server facilities are maintaining an appropriate temperature are all examples of motitoring with fairly obvious value, and where immediate remediations or further investigation is straightforward. Monitoring access to physical infrastructure is more complex, and the resulting information about people is subject to privacy requirements, but can be a useful diagnostic tool if something goes very wrong, or if you just want to know who keeps blocking the server-room door open on warm days. As well as monitoring in real time, logging information allows analysis to discover information that is only observable though variations (or non-variations) in specific monitored information over time. Given the importance of logged information, and of privacy requirements, best practise is to have a clearly documented policy for record retention. This needs to retain enough information to enable historical analysis and comparison. Some data are best only retained in anonymized form, or stored with extra security provisions applied. A good monitoring system provides very broad coverage, with redundancy both as an aspect that can be monitored to detect anomalies and to eliminate the risk of a single point of failure - when monitoring is compromised it can indicate a simple failure of the monitoring system, but can also mask a broader issue that the system is expected to detect. With a good monitoring system in place providing broad coverage of operations, there needs to be useful and targeted alerting system based on the monitoring system. To learn that a potential problem has been identified, as soon as possible, and act on it effectively, a monitoring system needs a robust targeted alerting system. A system that overloads its watchers with alerts is likely to lead to alert fatigue, where the alerts are ignored in practice because too often they require an onerous human response when they are not identifying a real problem. Like monitoring systems in general, redundancy in alert systems is important. Knowing an incident has occurred can trigger an [=Incident Response Plan=], but if it relies on individuals, it is important to provide 24/7 response. Many attacks are deliberately targeted for times when responders are less likely to have high availability. Alert systems can in turn drive automated emergency responses, ranging from capture of increased levels of detail, through requesting additional authorization beyond the normal requirements, to full system shutdowns. Here again, there are important trade-offs between ensuring a highly responsive system, and one that is robust in the face of real-world variability. For example, a system that can automatically suspend [=multi-sig=] transactions unless they are authorized within a short time is not always appropriate, because it can interfere with normal operations over a high-latency network or where a number of individuals are expected to coordinate extensively, taking a significant amount of time, before authorizing a particular action. Among many aspects of Validator Operations to monitor directly are the following: #### Blockchain {#sec-mit-monitor-blockchain} * are **Slashing Events** occurring on the beacon chain? To whom? How is this impacting the network? * is the **Anti-Slashing Database** functioning correctly? * how well are **Relay Lists** balancing load and availability to avoid downtime conditions? * are **Chain Reorganizations** occurring? Are there patterns of causes? * is the Consensus layer reaching **Finality** in accordance with expectations? * is **MEV** affecting performance or returns? * are **Block Proposals**, **Block Height**, **Attestations** proceeding in line with history and expecatations? * Are there anomalies in **Sync Committees**?

###### Risks that blockchain monitoring can mitigate * [FIN5](#risk-fin-5) * All [Slashing Risks](#sec-risks-slashing) * [DOW1](#risk-dow-1), [DOW7](#risk-dow-7), [DOW10](#risk-dow-10), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW15](#risk-dow-15), [DOW19](#risk-dow-19) * [GIR4](#risk-gir-4) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4)

#### Node, System and Network Health {#sec-mit-monitor-systems} * do key operational metrics like CPU usage, memory usage, restarts, and uptime of nodes indicate **Healthy Node** conditions? * is **Peering Connectivity** normal? * are **Failover Systems** functional, ready to operate, and not operating unexpectedly? * are **Cloud Systems** functioning according to agreements? * do **Cloud Service Notifications** help effectively anticipate and manage expected downtime and maintenance? * are **App-specific** metrics within expected parameters? * are **Redundant Monitors** producing consistent results?

##### Risks that system monitoring can mitigate * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4) * All [Slashing Risks](#sec-risks-slashing) * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW10](#risk-dow-10), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW15](#risk-dow-15) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5) * [GIR7](#risk-gir-7)

#### Security and Compliance {#sec-mit-monitor-compliance} Monitoring for unusual patterns or spikes can help detect a security breach or an exploit in progress. In many cases, even if security is breached, secure and accurate logs are important to determine how this took place, in order to protect against recurrence. The following are among indicators of a security issue, and information that can help determine what happened. * **Key Usage**, **Authorised Access**, and **Access Control Changes** anomalies, especially in sensitive systems such as 2FA configuration, security platforms, or network monitoring solutions VPNs. * **Phishing** and similar attempts to attack authorized users through social engineering. * Attacks on **Firewalls** and **Endpoint Attacks**, both for employee devices and infrastructure nodes, or directed at **Bastion Nodes**. These can be indicative of an attack or exploit in preparation or underway * **Relay behavior** such as compliance aspects and availability metrics. * Ideally, **Bug Reports** and **Community Discussion** will not be the first source of notification about a problem, but it is important to monitor them. * Various services can monitor whether **Confidential Data** are available publicly, demonstrating there has been a data breach.

##### Risks that security and compliance monitoring can mitigate * [FIN1](#risk-fin-1), [FIN2](#risk-fin-2), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6), [FIN7](#risk-fin-7) * [DOW4](#risk-dow-4), [DOW10](#risk-dow-10), [DOW12](#risk-dow-12), [DOW19](#risk-dow-19) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK6](#risk-hck-6) * [GIR6](#risk-gir-6), [GIR9](#risk-gir-9), [GIR14](#risk-gir-14), [GIR14](#risk-gir-15), [GIR17](#risk-gir-17), [GIR22](#risk-gir-22)

#### Upgrades {#sec-mit-monitor-upgrades} * does the **Upgrade Process** including client code source, configuration and testnet and production deployment, work as desired, consume unexpected time, or generate errors and issues? * how does **Customized Code in Testnet** behave compared to the code deployed in production? This is especially relevant for network updates. * is **System Configuration** stable?

##### Risks that monitoring upgrades can mitigate * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW19](#risk-dow-19) * [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [GIR11](#risk-gir-11), [GIR14](#risk-gir-14), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR20](#risk-gir-20), [GIR21](#risk-gir-21)

#### Doppelgänger Protection {#sec-mit-doppelganger-protection} If two validators with the same identifiers are running at the same time is important to shut one down as fast as possible. Most validators provide built-in mechanisms to detect doppelgangers. Other tools and technicques can also detect and act on this.

Tools to support Doppelgänger protection

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Lighthouse
Prysm
Teku
Nimbus
Doppelganger protection in `ssv.network`
DoppelBuster
StatefulSet handling in Kubernetes

##### Risks that doppelgänger protection can mitigate * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS5](#risk-sls-5) * [DOW2](#risk-dow-2), [DOW10](#risk-dow-10) * [SPS0](#risk-sps-0)

Tools to support Monitoring

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Within AWS, Cognito's [Userpool Addons for auditing authentications](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-cognito-userpool-userpooladdons.html) and the [WAF module](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-wafv2-webacl.html) to filter anomalies are just examples of the range of tools available
ELK stack
ESD monitors slashing events on the Ethereum chain
Ethereum validator monitoring
Grafana](https://grafana.com) - an example of [alerting setup in Grafana
MEV monitoring tool from SimplyStaking
Prometheus
Wazuh

### Communications and Incident Response {#sec-mitigations-incident-response} Communication is important both during normal operations, and when an exceptional security incident occurs that could adversely affect the normal operations, or the users of a system. There are therefore two core parts to a Nore Operator's communication strategy: - Normal Operational Communication provides information about ongoing operations, to ensure confidence in and transparency of everyday operations. - Incident Communication is the collection of communications processes that occur as part of an [=Incident Response Plan=] Developing appropriate communication procedures relies on understanding both the communications channels an organisation has or can have, and its stakeholders. The goal is to ensure those stakeholders have timely access to relevant information in a useful format. #### Stakeholder Communication Management {#sec-mit-comms-stakeholders} Some key stakeholders are Anonymous Stakeholders, who might follow a Node Operator's public information channels, or operate independently, but who do not provide individual communication information to Operators. * Low stake investors * Potential investors * Communities developing technical standards * Education Providers * Corporate Regulators Regulators of various kinds can require that Node Operators provide them with specific information, but do not necessarily communicate with Node Operators on an individual basis Node operators will also have Known Stakeholders, who have an identity known to the Node Operator that includes at least one direct communications cannel such as messaging, email, or telephone. These typically include at least some of * High stake investors - with some of whom the Operator could also have contractual obligations * Service Partners, who might be involved in operating and managing protocols and requiring governance votes, or hosting, managing or operating infrastructure as part of the node operation setup * Media channels, platforms, and accounts covering technical and non-technical news and reports * Other Node Operators running validators on the same network * Staff such as those developing and maintaining critical node operations software * Individuals or organizations using additional service provided by Node Operators (e.g., API users, customers for white-label solutions etc.) Stakeholders' preferences for communication channels differ. While many [=Known Stakeholders=] will have explicitly requested direct communication, it is important to have additional channels that enable [=Anonymous Stakeholders=] to follow important developments. Broadly, communication channels can be considered two-way, enabling communication with an individual Known Stakeholder or with all of them at once, or broadcast, enabling [=Anonymous Stakeholders=] to receive important information, often while preserving their anonymity. Additionally, some mechanisms allow for persistent information, while others are only temporary; A website can be maintained long-term or the information can be removed, information sent by email can easily be retained by the recipient in perpetuity, while information in e.g. a Slack or Telegram channel could be deleted after a matter of days or weeks It is also important, especially for services used for two-way communication with [=Known Stakeholders=], to consider the security and privacy of the channels used. While channels such as Telegram or Whatsapp use encryption, in the case of the former all communication is decoded at some unknown centralized point, in the latter large amounts of metadata are available to the service provider. While many messaging services can behave in either manner, some such as websites are well-suited to broadcast communication and others are more suited to individual two-way communication. As well as identifying the most appropriate channels for communication with [=Known Stakeholders=] or classes of [=Anonymous Stakeholders=], it is important to understand what it is appropriate to communicate, and to whom. Some stakeholders will expect a "close management", with direct individualized two-way communication, and very rapid reporting on incidents and important information. Others will want to know that they are informed in case of security incidents, or important regulatory changes, but prefer a lower volume of information. It is likely that different circumstances will mean that a given Stakeholder moves between "categories", with different communications strategies or procedures being more appropriate depending on specific context. ##### Best practises for stakeholder management include - Track and categorise [=Known Stakeholders=] - Assess communication tools relevant to [=Anonymous Stakeholders=]

Tools to support stakeholder management

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

Broadcast communication tools include Websites, X (the former Twitter), BlueSky, Facebook/Instagram
A Stakeholder Map](https://duck-initiative.gitbook.io/d.u.c.k.-knowledge-base/~gitbook/image?url=https%3A%2F%2F3935398949-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxTRnDyIanlwU7cCKcAju%252Fuploads%252Fhdg4kca5vuEMKsr01CxP%252FStakeholder_Map_Template.png%3Falt%3Dmedia%26token%3Dc83db7a4-f0ed-464f-ab69-630ba1450597&width=768&dpr=4&quality=100&sign=7ea4928c&sv=2)
A Stakeholder Register Spreadsheet
CRM systems
Email
Messaging services such as Telegram, Discord, Slack, Signal, and Whatsapp

A number of jurisdictions (such as the EU, with the [[?GDPR]]) regulate the use of information about individuals, and it is important to understand and comply with such regulations to avoid reputational, legal and financial risks.

##### Risks that stakeholder management can mitigate * [FIN1](#risk-fin-1), [FIN6](#risk-fin-6), [FIN7](#risk-fin-7) * [SPS0](#risk-sps-0) * [RER1](#risk-rer-1), [RER3](#risk-rer-3)

#### Incident Response Plans {#sec-mit-incident-response} An Incident Response Plan documents procedures for managing security incidents and events, as guidance for employees or incident responders who believe they have discovered, or are responding to, a security incident. A well-documented Incident Response Plan helps employees in a high-stress situation by providing a reminder of all important actions and considerations. To be useful, it is necessary that relevant employees know the plans exist, and how to find them. ##### Best practices for incident response plans include - Identify relevant participants in advance, with well-defined decision-making responsibilities - Redundancy against specific failures such as a key employee being unavailable - Clear information about how to investigate and triage incidents, including when to notify and involve particular participants and how to escalate issues to the most appropriate person or team. - Define clear procedures to follow for specific sets of circumstances. Where it is possible and appropriate, automated responses and alerting triggered by [Monitoring](#sec-mitigations-monitoring) can help ensure rapid response. - Data collection and distribution to enable effective response, external communication, and [=Post Mortem=] analysis - Identify relevant Stakeholders and define communication strategies for both internal and external communications

Tools to support incident response planning

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

NIST Incident Response template [[?NIST800_61]]
DUCK Incident Response Template

##### Risks that incident response planning can mitigate * [FIN1](#risk-fin-1), [FIN6](#risk-fin-6), [FIN7](#risk-fin-7) * [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19) * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW10](#risk-dow-10), [DOW21](#risk-dow-21) * [HCK3](#risk-hck-3) * [GIR13](#risk-gir-13) * [RER1](#risk-rer-1), [RER3](#risk-rer-3)

#### Identifying and Responding to Security Incidents {#sec-mitigations-identify-incidents} There are several ways to identify that a security incident is taking place. Best practice is to have extensive monitoring in place, to identify anomalies early, with alerting and potentially direct reaction mechanisms. Although learning from third-party discussions is a terrible way to find out about an incident, it is still better than simply not discovering it, so monitoring channels where such discussions take place is a valuable part of an overall strategy.

##### Risks that identifying and responding to security incidents can mitigate * [FIN6](#risk-fin-6) * [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19) * All [Downtime Risks](#sec-risks-downtime) * [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5), [HCK6](#risk-hck-6) * [GIR13](#risk-gir-13), [GIR14](#risk-gir-14), [GIR15](#risk-gir-15), [GIR22](#risk-gir-22) * [SPS0](#risk-sps-0) * [RER1](#risk-rer-1), [RER3](#risk-rer-3), [RER5](#risk-rer-5)

#### Analyzing Security Events {#sec-mit-incident-learning} This is often referred to as a "Post Mortem", used to learn from the event and improve relevant Incident Response Plans. ##### Best practices for analyzing security events include - Determine the root cause or causes of an incident - Examine how the incident was allowed to occur - Consider what changes can be implemented to prevent or mitigate similar events from occurring.

##### Risks that analyzing security events can mitigate * [FIN7](#risk-fin-7) * [SLS5](#risk-sls-5), [SLS14](#risk-sls-14), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19) * [DOW4](#risk-dow-4), [DOW6](#risk-dow-6), [DOW10](#risk-dow-10) * [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5), [HCK6](#risk-hck-6) * [GIR6](#risk-gir-6), [GIR7](#risk-gir-7), [GIR13](#risk-gir-13), [GIR14](#risk-gir-14), [GIR15](#risk-gir-15)

#### Disaster Recovery Plans {#sec-mit-disaster-recovery} A Disaster Recovery Plan is a specialized [=Incident Response Plan=] that gives guidance on recovering one or more information systems at an alternate facility, in response to a major hardware or software failure including the partial or complete destruction of facilities. ##### Best practices for disaster recovery plans include - Maintain secured up-to-date copies of production environments to enable fast restoration.

Tools to support disaster recovery plans

The following list is an uncurated selection, alphabetically sorted, and not a specific recommendation

NIST Disaster Response template

##### Risks that disaster recovery plans can mitigate * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9) * [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK6](#risk-hck-6) * [GIR15](#risk-gir-15), [GIR16](#risk-gir-16), [GIR19](#risk-gir-19), [GIR21](#risk-gir-21), [GIR22](#risk-gir-22), [GIR25](#risk-gir-25)

#### Incident Simulations {#sec-mit-incident-simulation} These are also known as "Pre-Mortems". Regular simulations of implementing an [=Incident Response Plan=] ensure that relevant personnel are familiar with them and can efficiently follow them when necessary. "Pre-Mortems" simulating or "war-gaming" a specific failure also tests those procedures to give some idea of whether they are appropriate and adequate. It also often motivates participants to think about other risks, and whether appropriate procedures and mitigations are in place. There are many possible approaches to an incident simulation, and many eventualities that they can cover. Example topics for Pre-Mortems include variations on themes such as * Unauthorized users gain access to the servers and set about making mischief * A complex security compromise where details are not immediately available * A specific scenario (environmental disaster, utility failure, operational error) results in system downtime Articles such as [[[?PREMORTEM]]] offer further information on how to plan and implement simulations, and how to derive the maximum benefit from them.

##### Tools and templates for incident simulation * [National Institute of Standards & Technology Template](https://csrc.nist.gov/files/pubs/sp/800/34/r1/upd1/final/docs/sp800-34-rev1_cp_template_high_impact_system.docx) * [#automation](../mitigation-and-controls-library/collection-of-tools-scripts-and-templates.md#automation "mention") ##### Risks that incident simulations can mitigate - All risks

#### Incident Communication {#sec-mit-incident-communication} As well as direct financial losses, security incidents can also result in substantial reputational damage. Appropriate [=Incident Communication=] with stakeholders about security incidents, both during and after the relevant incident, can significantly mitigate this risk. It is important to note that inappropriate communication during an incident can increase the damage. External communication has to balance stakeholders' need for information that enables them to respond in a well-informed manner against the importance of providing clear information with as much certainty as feasible that it will not later be contradicted. ##### Best practice for incident communication include - Providing information as soon as possible - Providing a detailed post-incident summary.

##### Risks that incident communications help address: * [FIN7](#risk-fin-6) * [HCK4](#risk-hck-4) * [SPS0](#risk-sps-0) * [RER1](#risk-rer-1), [RER2](#risk-rer-2), [RER3](#risk-rer-3)

## Controls Catalog {#sec-controls-catalog} This section contains controls that are material to Node Operator risks. Some of these control criteria correspond to similar controls from three common frameworks: * [[[?OWASP_TOP10]]] * [[[?ISO27001]]] * [[[?SOC2]]] Where relevant, corresponding controls from those frameworks are identified and linked from ValOS controls. ### Controls for Risk Management {#sec-controls-risk-management} #### Ensure Activities Support Operational Goals 🔗 Node Operators MUST document how their processes and tools serve their business goals

##### Relevant external Controls for aligning processes and tools with business goals * [[?SOC2]] CC 5.2 ##### Assessment of activities' relevance helps address the following risks * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS5](#risk-sls-5), [SLS11](#risk-sls-11), [SLS12](#risk-sls-12), [SLS13](#risk-sls-13), [SLS14](#risk-sls-14), [SLS15](#risk-sls-15), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18) * [DOW16](#risk-dow-16), [DOW18](#risk-dow-18) * [GIR5](#risk-gir-5)

#### Assess Dependencies and Replacement Strategies and Costs 🔗 Node Operators MUST review their dependencies on staff and external suppliers and how to replace key staff or suppliers every year External suppliers can change terms, shut down products or support, and key staff can leave or be indisposed for long enough to impact business functions.

##### Assessing dependency replacement helps address the following risks - [FIN1](#risk-fin-1) - [SLS6](#risk-sls-6) - [DOW11](#risk-dow-11), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20) - [GIR24](#risk-gir-24), [GIR25](#risk-gir-25)

#### Document Risk Assessments 🔗 Node Operators MUST document their assessments of risks, and what risks they class as acceptable

##### Relevant external controls for risk assessment * [[?SOC2]] CC 3.1 ##### Internal risk assessment is an important part of addressing all risks

#### Follow Processes 🔗 Node Operators MUST ensure that processes for risk mitigation are followed in practice Best practice is to ensure that where possible, [processes are automated](#process-automation)

##### Following processes helps address the following risks * [FIN1](#risk-fin-1), [FIN2](#risk-fin-2), [FIN3](#risk-fin-3), [FIN4](#risk-fin-4), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6)

### Controls for Financial and Regulatory Risk {#sec-controls-fin-reg} #### Payment rails 🔗 Node Operators MUST document payment processes including currency and exchange details

##### Payment rails help address the following risks * [FIN2](#risk-fin-2), [FIN3](#risk-fin-3), [FIN4](#risk-fin-4), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6) * [HCK1](#risk-hck-1) * [RER4](#risk-rer-4)

#### Updated Regulatory Compliance 🔗 Node Operators MUST review relevant regulation and update processes for compliance as necessary at least quarterly

##### Updated regulatory compliance helps address the following risks: * [FIN6](#risk-fin-6) * [RER2](#risk-rer-2)

### Controls for People Management #### Identify Relevant Entities 🔗 Node Operators MUST know the identity of entities who are authorized to manage operations Best practise is to identify every individual who works for the Node Operator. In the case of corporate third-party providers, sensible due diligence does not always extend to identifying specific individuals.

##### Relevant external controls for Identifying individuals * [[?ISO27001]] Annex A 5.16 ##### Identifying individuals involved in managing Validators helps address the following Risks * [FIN1](#risk-fin-1) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3) * [SPS0](#risk-sps-0) * [RER2](#risk-rer-2)

#### Document Vendors and Partner Risk 🔗 Node Operators MUST implement documented procedures for evaluating and reviewing counterparty risks from vendors and partners * Establishing a process for Vendor and Business Partner engagement and assessing existing as well as new vendors and business partners * Ensuring that any identified issues are fixed, and regressions can be identified. * Terminating relationships efficiently where problems arise, or the relationship ends.

##### Relevant external controls for counterparty risk management * [[?SOC2]] CC 9.2 ##### Counterparty risk management helps address the following risks * [FIN1](#risk-fin-1) * [SLS9](#risk-sls-9) * [GIR5](#risk-gir-5) * [DOW1](#risk-dow-1), [DOW19](#risk-dow-19)

#### Provide Training 🔗 Node Operators MUST ensure entities who are authorized to manage operations have and maintain the necessary knowledge to minimize risks to the Node Operator in the course of performing their work

##### Training helps address the following Risks * [FIN1](#risk-fin-1), [FIN7](#risk-fin-7) * [SLS17](#risk-sls-17) * [DOW21](#risk-dow-21) * [GIR16](#risk-gir-16), [GIR22](#risk-gir-22), [GIR25](#risk-gir-25) * [RER2](#risk-rer-2)

### Controls for Technology Stack {#sec-controls-tech-stack} #### Keep Software Updated 🔗 Node Operators MUST keep third-party software up to date This control does not imply that the latest available update is automatically applied, rather that Node Operators have clear and effective mechanisms to ensure they are aware of updates and apply them in accordance with their update management procedures, taking into account the controls in . Best practise is to monitor software in use, to know when an update is available, and to update as fast as possible while following procedures to manage those updates securely. In some cases, assessing an update will lead to a decision that there is no need to apply a specific update, or a risk in doing so that outweighs the benefits.

##### Updated software helps address the following risks * [DOW19](#risk-dow-19) * [HCK4](#risk-hck-4)

#### Anti-slashing Database 🔗 Node Operators MUST have a persistent local anti-slashing database

##### A local anti-slashing database helps address the following risks * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS14](#risk-sls-14), [SLS15](#risk-sls-15), [SLS17](#risk-sls-18), [SLS19](#risk-sls-19)

#### Signature management 🔗 Node Operators MUST document signature requirements for high-value transactions, including the definitions used to identify such transactions 🔗 Node Operators SHOULD use signature management tools to help secure high-value transactions 🔗 The primary and backup/failover versions of Signature management tools MUST implement mechanisms to ensure data continuity

##### Signature management helps address the following risks * [SLS1](#risk-sls-1), [SLS2](#risk-sls-2), [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS7](#risk-sls-7), [SLS15](#risk-sls-15) * [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [GIR7](#risk-gir-7), [GIR16](#risk-gir-16)

#### Client Diversity 🔗 Node Operators MUST deploy at least 2 distinct client applications for any level of the blockchain where at least 3 clients are available

##### Client diversity helps address the following risks * [SLS6](#risk-sls-6), [SLS20](#risk-sls-20) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14) * [GIR13](#risk-gir-13), [GIR24](#risk-gir-24)

#### Secure Devices 🔗 Devices that control critical functions MUST be dedicated to that purpose, and configured with only the necessary software for their intended purpose This applies to servers acting as validators, but also to devices authorized to access and administer those servers remotely.

##### Securing devices helps address the following risks * [DOW2](#risk-dow-2) * [HCK](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4)

#### Validator Withdrawal 🔗 Node Operators MUST implement processes to withdraw validators from a network in such a way that they are not penalised for disappearing

##### Managed validator withdrawal helps address the following risks * [SLS2](#risk-sls-2) * [SPS1](#risk-sps-1)

#### Manage Software and Hardware Configuration 🔗 Node Operators MUST document configuration of software and hardware

##### Relevant external controls for configuration management * [[?SOC2]] CC 7.1 * [[?ISO27001]] Annex A 8.9 ##### Configuration management helps address the following risks * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6) * [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [GIR3](#risk-gir-3), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR20](#risk-gir-20) * [RER1](#risk-rer-1), [RER4](#risk-rer-4)

### Controls for Information and Secret Management {#sec-controls-info-secrets} #### Key Management 🔗 Node Operators MUST implement appropriate key management procedures Best Practise includes following a commonly recognised key management standard such as - [[?CCSS]]: a set of requirements for securing Cryptocurrency systems, focusing on Key Management. Certification for systems is available at three levels, and is granted by certified CCSS Auditors. - [[?KMS]]: a set of requirements for Key Management designed for organisations working in blockchain, allowing self-attestation of conformance.

##### Key management helps address the following risks * All [Key Custody risks](#sec-risks-keys) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4)

#### Manage Information Lifecycles 🔗 Node Operators MUST document and follow information lifecycle processes for important operational information This includes the definition and enforcement of retention periods, and the use of thorough deletion mechanisms, such as [shred](https://man.archlinux.org/man/shred.1.en).

##### Relevant external controls for information lifecycles: * [[?ISO27001]] Annex A 8.10 #### Information Lifecycle management helps address the following risks: * [SLS10](#risk-sls-10) * [DOW17](#risk-dow-17)

#### Backup and Protect Data against Loss 🔗 Node Operators MUST implement backup procedures, at minimum daily, for important operational data 🔗 Backup Procedures SHOULD produce journaled backups covering relevant retention periods 🔗 Node Operators MUST implement protection against accidental or malicious deletion of data These requirements cover all information required by controls in this specification.

##### Protection against information loss helps address the following risks: * [FIN6](#risk-fin-6) * [SLS4](#risk-sls-4), [SLS10](#risk-sls-10), [SLS11](#risk-sls-11), [SLS12](#risk-sls-12) * [KEC6](#risk-kec-6), [KEC9](#risk-kec-9) * [HCK2](#risk-hck-2), [HCK4](#risk-hck-4) * [GIR4](#risk-gir-4), [GIR13](#risk-gir-13) * [RER1](#risk-rer-1), [RER3](#risk-rer-3)

#### Record Important Operational Knowledge 🔗 Node Operators MUST record and maintain important operational information Best practice is to use a documentation management system. While this is likely to have different levels of access control, it is important that no information is available to only one employee.

##### Recording operational knowledge helps address the following risks: * [FIN1](#risk-fin-1), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6) * [SLS3](#risk-sls-3), [SLS4](#risk-sls-4), [SLS10](#risk-sls-10), [SLS14](#risk-sls-14) * [DOW1](#risk-dow-1), [DOW4](#risk-dow-4), [DOW16](#risk-dow-18), [DOW16](#risk-dow-18) * [KEC2](#risk-kec-2), [KEC3](#risk-kec-3), [KEC6](#risk-kec-6), [KEC9](#risk-kec-9), [KEC10](#risk-kec-10) * [HCK4](#risk-hck-4) * [GIR4](#risk-gir-4), [GIR2](#risk-gir-25) * [SPS0](#risk-sps-0) * [RER1](#risk-rer-1), [RER3](#risk-rer-3)

#### Document data retention policy 🔗 Node Operators MUST have a policy for data retention This needs to provide adequate retention to enable historical analysis and checking for anomalous patterns, while minimizing stored data and ensuring compliance with relevant data protection regulation.

##### Data retention policy helps address the following risks * [FIN6](#risk-fin-6) * [DOW4](#risk-dow-4), [DOW13](#risk-dow-13) * [GIR4](#risk-gir-4)

### Controls for Access Control {#sec-controls-access}

#### Relevant external controls for Access Management in General * [[?OWASP_ACCESS_CONTROL]] * [[?ISO27001]] Annex A 5.15 * [[?SOC2]] CC 6.1

#### Authentication required for services 🔗 All services MUST require appropriate authentication privileges For example, a Node does not respond to anonymous requests from an unknown user.

##### Authenticating access to services helps address the following risks * [FIN1](#risk-fin-1) * [DOW4](#risk-dow-4) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [GIR22](#risk-gir-22)

#### Segment Networks to Limit Access 🔗 Networks MUST be segmented, to restrict access to systems that are identified as needing it 🔗 Nodes MUST NOT respond to requests from outside a defined network, except those that are explicitly defined as necessary Fulfilling this requirement means maintaining a whitelist of individual services that are authorized to respond to requests from broader networks.

##### Relevant external controls for network segmentation * [[?ISO27001]] Annex A 8.22 ##### Segmenting networks helps address the following risks * [DOW10](#risk-dow-10) * [GIR9](#risk-gir-9) * [HCK6](#risk-hck-6)

#### Access to physical hardware is limited 🔗 Entry to physical server locations MUST require authorization For example, a biometric scan or the use of a keycard.

#### Limiting access to hardware helps address the following risks * [DOW3](#risk-dow-3), [DOW4](#risk-dow-4) * [HCK6](#risk-hck-6)

#### Least Privilege is applied to individuals and software 🔗 Software MUST NOT run with, and a user MUST NOT have a higher level of privilege than necessary For example, check that software does not run as root, that users do not log in directly with root privileges, and software and users are granted fine-grained access based on need rather than broad-based access for simplicity.

##### Least Privilege helps address the following risks * [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4) * [SPS0](#risk-sps-0) ##### Relevant external controls for Least Privilege * [[?SOC2]] CC 6.3 * [[?ISO27001]] Annex A 8.2 * [[?ISO27001]] Annex A 8.18

#### Regularly Review Access Rights Management 🔗 A review of Access Rights MUST take place regularly This covers both the processes and tools for granting and revoking access rights, and verifying that they are effectively managing access rights according to the relevant principles ([=Least Privilege=], [=Role-based Access Control=]. Best practice for this review includes: - analyzing access logs for physical access to hardware, and ensuring authorized individuals are not given access to hardware - verifying access to signing keys is limited to individuals whose roles mean they need it, and that all who need that access have it - ensuring that processes are effectively followed and meet the Node Operator's business needs - verify that software is run in a way that minimises its access

##### Regular review of access rights helps address the following risks * [SLS9](#risk-sls-9), [SLS10](#risk-sls-10), [SLS11](#risk-sls-11), [SLS12](#risk-sls-12), [SLS13](#risk-sls-13) * [DOW16](#risk-dow-16), [DOW17](#risk-dow-17), [DOW18](#risk-dow-18) * [GIR1](#risk-gir-1), [GIR5](#risk-gir-5), [GIR7](#risk-gir-7) ##### Relevant external controls for Access Rights Review * [[?ISO27001]] Annex A 5.17 * [[?ISO27001]] Annex A 5.18 * [[?ISO27001]] Annex A 8.18

#### Protect Data in Transit and Storage 🔗 All data in transit MUST be encrypted, 🔗 and SHOULD use the most direct transmission available 🔗 All data "at rest" MUST be stored in encrypted form This covers all services that communicate data, such as Databases, Web servers, Load balancers, Authentication systems, CI/CD pipeline tools, etc. Best practices include ensuring that the latest version of TLS is being used, with secure algorithms. Current best practice includes assessing the cost and risk associated with moving to quantum-safe cryptography, and appropriate timelines.

##### Relevant external controls for encrypted data * [[?CRYPTOFAIL]] * [[?SOC2]] CC 6.7 ##### Data encryption helps address the following risks * [SLS11](#risk-sls-11), [SLS12](#risk-sls-12), [SLS13](#risk-sls-13) * [DOW18](#risk-dow-18) * [KEC1](#risk-kec-1), [KEC6](#risk-kec-6), [KEC7](#risk-kec-7), [KEC9](#risk-kec-9) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK6](#risk-hck-6) * [GIR10](#risk-gir-10)

### Controls for Automated Monitoring {#sec-controls-monitoring} #### Log and Analyze Network Traffic 🔗 Node Operators MUST log network traffic, and analyze the logs for anomalous behaviour

##### Traffic log analysis helps address the following risks - [SLS9](#risk-sls-9), [SLS10](#risk-sls-10), [SLS11](#risk-sls-11), [SLS12](#risk-sls-12), [SLS13](#risk-sls-13), [SLS14](#risk-sls-14), [SLS15](#risk-sls-15) - [DOW1](#risk-dow-1)

#### Log privileged access 🔗 Any operation that requires privileged access MUST be logged 🔗 Any assignment of a key, or assignment of a role to or removal of a role from a particular key, MUST be logged This includes monitoring software that has privileged access.

##### Relevant external controls for privileged access logging * [[?ISO27001]] Annex A 8.18 ##### Logging privileged access helps address the following risks: * [FIN1](#risk-fin-1)

#### Log personnel changes 🔗 Every change in the status of people who have access to any function of the Node, or physical access to any hardware, MUST be logged

##### Logging personnel changes helps address the following risks: * [FIN1](#risk-fin-1) * [HCK3](#risk-hck-3)

#### Log slashing events 🔗 Any event that results in slashing MUST be logged

##### Logging slashing events helps address the following risks * [SLS4](#risk-sls-4), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19), [SLS20](#risk-sls-20) * [RER1](#risk-rer-1), [RER3](#risk-rer-3)

#### Monitor hardware and network performance 🔗 Logs MUST provide a sufficiently detailed view of hardware and network performance to enable upgrade needs to be forecast, and to alert if validators are operating with excess latency Tools such as [Zabbix](tool-zabbix) can also display a live feed of CPU and memory usage of each compute instance.

#### Relevant external controls for Monitoring - [[?SOC2]] A 1.1 - [[?SOC2]] CC 7.2 - [[?ISO27001]] Annex A 8.16 - [[?ISO27001]] Annex A 8.21 ##### Monitoring hardware helps address the following risks * [DOW3](#risk-dow-3), [DOW7](#risk-dow-7), [DOW10](#risk-dow-10), [DOW15](#risk-dow-15) * [GIR4](#risk-gir-4)

### Controls for Environmental Threat Management {#sec-controls-environment} #### Manage Environmental Threats 🔗 Node Operators SHOULD have processes in place to manage environmental threats This includes monitoring for such threats and physically hardened facilities (e.g. fire- and flood-resistant server rooms), and physically decentralized infrastructure. It can also incorporate the use of DVT or related approaches to managing physical decentralization.

##### Relevant external controls for environmental threats * [[?ISO27001]] Annex A 7 ##### Environmental threat management helps address the following risks * [SLS14](#risk-sls-14), [SLS15](#risk-sls-15) * [DOW1](#risk-dow-1), [DOW5](#risk-dow-5), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9)

#### Distribute Validators physically 🔗 Node Operators SHOULD implement failover validators in different physical locations

##### Distributed failover validators help address the following risks * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9)

#### Manage Equipment Lifecycles 🔗 Node Operators SHOULD have processes in place to manage equipment lifecycles This includes monitoring performance and performing preventive maintenance, upgrades, or replacing equipment as appropriate, as well as processes that ensure equipment is correctly retired including removing data and any hardware-based authorization.

##### Relevant external controls for equipment lifecycles * [[?ISO27001]] Annex A 7 ##### Equipment life-cycle management helps address the following risks * [DOW3](#risk-dow-3), [DOW6](#risk-dow-6), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW15](#risk-dow-15), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [KEC1](#risk-kec-1), [KEC3](#risk-kec-3), [KEC6](#risk-kec-6) * [HCK4](#risk-hck-4), [HCK6](#risk-hck-6)

### Controls for Development and Update Process {#sec-controls-updates} #### Develop Software as Secure by Design 🔗 Code development MUST follow secure development processes to avoid introducing security risks This is a broad area. A few specific controls are included in this specification, but this requirement is intended to ensure a general production philosophy.

##### Relevant external controls for secure development * [[?ISO27001]] Annex A 8.25 ##### Secure software development lifecycle (SSDLC) helps address the following risks * All [Slashing Risks](#sec-risks-slashing) * [DOW2](#risk-dow-2), [DOW10](#risk-dow-10), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW20](#risk-dow-20) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1), [HCK2](#risk-hck-2), [HCK3](#risk-hck-3), [HCK4](#risk-hck-4), [HCK5](#risk-hck-5) * [GIR6](#risk-gir-6), [GIR10](#risk-gir-10), [GIR11](#risk-gir-11), [GIR13](#risk-gir-13), [GIR14](#risk-gir-14), [GIR15](#risk-gir-15) * [SPS0](#risk-sps-0)

#### Follow Update Procedures 🔗 Node Operators MUST document procedures for updates to code

##### Relevant external controls for managed software updates * [[?SOC2]] CC 8.1 * [[?ISO27001]] Annex A 8.32 ##### Managed software updates help address the following risks * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [GIR4](#risk-gir-4), [GIR13](#risk-gir-13), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19) * [SPS0](#risk-sps-0)

#### Use Code Repositories 🔗 Source code MUST be managed in a repository 🔗 All changes to deployed production code MUST be tested and reviewed before deployment This covers all changes to code, including when it is necessary to roll back an upgrade.

##### Using Code Repositories help address the following risks * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [GIR4](#risk-gir-4), [GIR13](#risk-gir-13), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR21](#risk-gir-21) * [SPS0](#risk-sps-0)

#### Check Third-party Code for Vulnerabilities before Updating 🔗 Updates to third-party software MUST be checked for vulnerabilities before deployment This covers verifying that all software updates, including validator and other node clients as well as specifically written custom code or updates, have been audited to ensure they are not introducing known or new vulnerabilities. Best practice is to perform both internal and independent external audit, and to ensure the identity of the coders is known. Likewise, in best practice third-party code developers are only given access to code they need to do their work, are held to high standards of confidentiality, and work with a well-defined set of expectations.

##### Relevant external controls for checking third-party software * [[?ISO27001]] Annex A 8.7 * [[?ISO27001]] Annex A 8.30 * [[?ISO27001]] Annex A 8.32 * [[?SOC2]] CC 8.1 ##### Checking third-party software helps address the following Risks * [FIN3](#risk-fin-3), [FIN4](#risk-fin-4), [FIN5](#risk-fin-5), [FIN6](#risk-fin-6) * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7), [SLS6](#risk-sls-16), [SLS17](#risk-sls-17), [SLS18](#risk-sls-18), [SLS19](#risk-sls-19) * [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14) * [KEC7](#risk-kec-7) * [HCK1](#risk-hck-1) * [SPS0](#risk-sps-0)

#### Verify Configuration on Update 🔗 Software update procedures MUST include an assessment and application of configuration settings

##### Verifying configurations on update helps address the following risks * [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW13](#risk-dow-13), [DOW21](#risk-dow-21) * [GIR3](#risk-gir-3), [GIR15](#risk-gir-15)

#### Validate Inputs and outputs 🔗 Code MUST verify that input is safe before operating on it 🔗 Code MUST NOT produce invalid outputs 🔗 Components SHOULD use [[[CORS]]] and [[[CSP]]] to protect against Server Side Request Forgery These requirements ensure that data passed between software components can be handled safely by the receiving component. It includes data entered manually by users.

##### External controls for validating data passed between components * [[?SSRF]] * [[?SOC2]] PI 1.2 * [[?SOC2]] PI 1.3 #### Data validation helps address the following risks * [HCK5](#risk-hck-5) * [GIR16](#risk-gir-16)

#### Ensure Good Test Coverage 🔗 Node Operators MUST have thorough test coverage of their software and operating procedures There is no magic percentage figure, but ideally unit tests and integration tests cover every functionality and interaction managed by code the Node Operator uses, whether self-managed or provided by a third party.

##### Relevant external controls for test coverage * [[?ISO27001]] Annex A 8.29 ##### Good test coverage helps address all risks

#### Test All Interactions Impacted by Software Updates 🔗 Updates MUST include an audit of all code and user interactions they impact This means testing not just the new code deployed, but also existing code that interacts with anything the update changes, to ensure that integration is not introducing a vulnerability. This extends to non-blockchain code used to interact with the Validator, where applicable.

##### Testing all updated interactions helps address the following risks * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [GIR4](#risk-gir-4), [GIR13](#risk-gir-13), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR21](#risk-gir-21) * [SPS0](#risk-sps-0)

#### Deploy via staging test environments 🔗 Updates MUST be tested on a staging environment that as closely as possible matches the proposed deployment environment before deployment as "production" on a live network

##### Relevant external controls for pre-deployment testing * [[?ISO27001]] Annex A 8.31 ##### Using test environments helps address the following risks * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [GIR4](#risk-gir-4), [GIR11](#risk-gir-11), [GIR13](#risk-gir-13), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR21](#risk-gir-21) * [SPS0](#risk-sps-0)

#### Maintain Emergency Rollback Procedures 🔗 Node Operators MUST have a process to enable emergency rollback of upgrades

##### Emergency rollback helps address the following risks * [FIN3](#risk-fin-3), [FIN](#risk-fin-4), [FIN](#risk-fin-5), [FIN](#risk-fin-6) * [SLS6](#risk-sls-6), [SLS7](#risk-sls-7) * [DOW2](#risk-dow-2), [DOW11](#risk-dow-11), [DOW12](#risk-dow-12), [DOW13](#risk-dow-13), [DOW14](#risk-dow-14), [DOW19](#risk-dow-19), [DOW20](#risk-dow-20), [DOW21](#risk-dow-21) * [GIR4](#risk-gir-4), [GIR13](#risk-gir-13), [GIR18](#risk-gir-18), [GIR19](#risk-gir-19), [GIR21](#risk-gir-21) * [SPS0](#risk-sps-0)

### Controls for Communication and Incident Response {#sec-controls-response} #### Provide Operational Communication 🔗 Node Operators SHOULD provide regular normal operational communication This covers general information similar to financial reporting, major changes in staffing (overall size, key positions, strategic focus), and operator-specific information such as governance of onchain systems, key third-party relationships, software partnerships, participation in standards-setting, and the like. The purpose is to provide confidence to stakeholders that the Node Operator is effectively managed, to enable them to understand the overall goals, and to show operational strengths, and plans to address perceived weaknesses and strategic threats.

##### Operational communications help address the following risks * [FIN6](#risk-fin-6) * [RER5](#risk-rer-5)

#### Document Adequate Incident Response Plans 🔗 The Node Operator MUST have documented [=Incident Response Plans=] corresponding to all risks identified in this specification

##### Relevant external controls for incident response * [[?SOC2]] CC 7.4 * [[?SOC2]] CC 9.1 of Trust Services Criteria ##### Incident Response Planning help address almost all risks

#### Document Disaster Recovery Plans 🔗 The Node Operator MUST have documented [=Disaster Recovery Plans=] corresponding to risks identified in this specification that lead to destruction of crucial data or loss of assets

##### Relevant external controls for disaster recovery plans * [[?SOC2]] CC 7.5 ##### Disaster recovery plans help address the following risks * [DOW1](#risk-dow-1), [DOW2](#risk-dow-2), [DOW3](#risk-dow-3), [DOW4](#risk-dow-4), [DOW5](#risk-dow-5), [DOW10](#risk-dow-10) * [GIR13](#risk-gir-13), [GIR19](#risk-gir-19) * [RER](#risk-rer-1), [RER4](#risk-rer-4), [RER5](#risk-rer-5)

#### Analyze Incidents 🔗 [=Incident Response Plans=] and [=Disaster Recovery Plans=] MUST include revising the relevant plans whenever they are activated, based on lessons learned This covers both responses to real incidents and Simulated activation, or [=Pre-mortems=].

##### Relevant external controls for analyzing security events * [[?SOC2]] CC 7.3 ##### Analyzing incidents helps address all risks

#### Perform Regular Incident Response Simulations 🔗 Node Operators MUST perform a simulated Incident and activation of the associated [=Incident Response Plan=] or [=Disaster Recovery Plans=] at least twice per year

##### Incident simulations help address all risks

#### Plan Incident Communication 🔗 Node Operators MUST document [=Incident Communication=] strategies or policies This requirement includes internal and external communication, both during and after incidents.

##### Incident communications help address the following risks * [RER5](#risk-rer-5)

#### Verify Counterparty Compliance 🔗 Node Operators MUST verify that third parties providing services, or with whom the Node Operator contracts, are in compliance with relevant standards (including this one) and regulations This includes areas such as the uptime guarantees of cloud providers and other core counterparties, response times and Service Level Agreements, security procedures, and the like as well as relevant regulatory compliance.

##### Relevant external controls for counterparty verification * [[?ISO27001]] Annex A 8.30 * [[?SOC2]] CC 9.2 ##### Counterparty verification helps address the following risks * [FIN1](#risk-fin-1), [FIN1](#risk-fin-1) * [SLS9](#risk-sls-9) * [DOW1](#risk-dow-1), [DOW7](#risk-dow-7), [DOW9](#risk-dow-9), [DOW19](#risk-dow-19) * [GIR5](#risk-gir-5), [GIR14](#risk-gir-14), [GIR22](#risk-gir-22), [GIR24](#risk-gir-24), [GIR25](#risk-gir-25) * [SPS0](#risk-sps-0) * [RER2](#risk-rer-2), [RER5](#risk-rer-5)

#### Manage Counterparty Relationship Lifecycles 🔗 Service agreements MUST specify termination procedures and obligations

##### Managing counterparty relationship lifecycles helps address the following risks * [FIN1](#risk-fin-1) * [HCK3](#risk-hck-3)

## Status and Feedback This document is an Editor's draft, for a proposed revision to the [DUCK Knowledge Base (version 1)](https://duck-initiative.gitbook.io/d.u.c.k.-knowledge-base). Feedback is welcome, and is preferred as Issues, Pull requests and comments in this Github Repository. Please note the [Conditions of Contributing](./CONTRIBUTING.md). ### History and Future The original content of this specification was developed as the D.U.C.K Knowledge Base, and the current work is a direct evolution of that content. In updating it, there are several changes being made. The key change is to move from a general explanation of risks and good practices to a specification that is well-suited to assessment of conformance. Several somewhat cosmetic changes have been made. Most obviously, the name has been changed to ValOS - the Validator Operator Standard - and instead of a multi-page website it is available primarily as a single-page specification, in particular enabling easier use offline. More importantly, there is a set of controls specific to ValOS, rather than only references to individual controls from other frameworks. The update process aims to meet some general goals: - Simplify redundancy - Use linking more effectively - Respond to feedback from real-world use, to improve the utility of the specification - Increase the transparency of and community participation in the maintenance of the specification ### Versions and Version Numbers The approach to versions for this specification is to maintain a publicly visible "latest Editor's draft", representing the current state of what has been proposed and agreed as updates for a new version, and release versions, numbered 1, 2, 3 etc. The "Editor's Draft" version may change frequently, for example weekly. It is primarily to serve the needs of the community involved or interested in the process of updating the specification. Part of that community is practitioners such as Node Operators themselves, developers and service providers, and assessors, who want to understand changes that they will need to make to their workflows in the short- to medium-term future. We seek to provide transparency into proposed changes, and the process by which they are agreed or rejected, as well as the history of changes that have been made. The release versions are intended to provide stable reference points, primarily for clarity in understanding the meaning of a specific assessment against a specific version. The timing of new release versions seeks to balance keeping up with current best practice, and providing a stable target for learning and implementing. It is likely that a release cycle will be on the order of 6 to 18 months. The motivation for a new release can be the time elapsed since the last version, a major change to best practices or risks, or a combination of these factors, among others.

Summary of Controls

This section provides a summary of the Controls provided by this Specification.