Designing High Availability for HashiCorp Vault in AWS
As businesses increasingly look to adopt DevOps practices and culture within their organizations, it’s not uncommon for new tools to be introduced to provide additional capabilities for automation and build pipelines. One of these key tools that organizations are quickly adopting places a spotlight on a subject long overlooked by many – secrets management. A secrets management solution provides a centralized repository for the storage and maintenance of privileged accounts and credentials across a variety of environments within the organization. As a result, this solution can also help strengthen an organization’s security and compliance posture by limiting access to credentials, reducing the number of long-lived credentials, and enabling enhanced auditing for accountability and compliance.
HashiCorp Vault is an enterprise-ready secrets management solution being adopted by many businesses today. With a wealth of features focusing on secrets management, Vault offers a solution that makes secret management adoption simple for organizations looking to introduce or even consolidate existing solutions that aren’t scalable or fit for enterprise consumption. A wide variety of capabilities, along with many plugins available, make Vault an ideal solution for many use cases, including applications accessing data in MySQL, a continuous delivery pipeline making dynamic API calls to AWS, or an operations team storing persistent secrets for future consumption. With so many options, it’s easy to see why Vault is quickly becoming an integral solution for organizations across all industries deploying workloads on-premises or public cloud.
As the dependency and adoption of Vault increases, the Vault design itself must be accommodating and provide all the features of an enterprise-ready solution, such as high availability, scalability, and data protection. More so, when deploying Vault in a public cloud such as AWS, the design needs to ensure that requirements are being met from a software level, as access to the underlying hardware and related solutions are not available to consumers. Consequently, AWS constructs such as regions and Availability Zones must be considered to ensure a deployment that is highly available, scalable, and provides ample data protection. Therefore the question remains, how do you deploy a critical service such as Vault on a platform where you must always plan for failure?
For high availability, Vault supports clustering to protect against downtime due to unplanned outages or planned maintenance activities when using an HA supported backend, like HashiCorp Consul. A single node becomes the Active node while the other nodes are referred to as Standby nodes. Standby nodes can assume the responsibility of the Active node if the existing node becomes unavailable, however, they forward all requests to the primary. Using a load balancer, or Consul Service Discovery, downstream clients can continue interacting with Vault after a failure. Therefore, nodes should be deployed in separate fault domains for protection against simultaneous failures.
From an AWS perspective, this means that the following design considerations should be taken into account before deploying a cluster:
- Cluster nodes should be provisioned in separate Availability Zones for physical redundancy within a given region
- In an Availability Zone where multiple nodes reside, a Spread Placement Group should be used to ensure nodes do not reside on the same underlying host
- Security groups should permit the listener port (default: tcp/8200) from downstream clients for Vault access and permit the proper port (default: 8201) between nodes for cluster advertisements
- Self-referencing security groups can be used to permit communication among cluster nodes while reducing administrative changes when adding/replacing a cluster node.
- Use of an Application Load Balancer to front-end client access to the Vault service. Health checks should be used to determine the Active node, which will return an HTTP status code: 200
To support redundancy and business continuity, many organizations provision applications and critical services in multiple regions and Vault should be no different. Vault Enterprise has replication functionality built-in and offers a variety of methods to ensure the Vault service is highly-available, even across AWS regions. Replication ensures that a consistent set of secrets and enforced policies are available to applications in either region. HashiCorp offers three distinct options to extend the primary cluster for high availability and failure.
Disaster Recovery (DR) Replication has a single purpose – to be a warm-standby if its primary cluster has a complete failure. While DR Replication does indeed mirror all secrets, policies, and even authentication tokens and leases, it will not service client requests unless manually, or programmatically, promoted to a primary cluster. Generally, within AWS, DR Replication is used to protect against regional failures as traditional clustering should be sufficient for high availability within a single region. Other use cases may dictate the configuration of DR replication within the same region, though. The following design aspects should be considered when provisioning DR replication in AWS:
- Connectivity between regions should be established for communication between the primary and secondary clusters while minimizing latency. This can be accomplished via cross region VPC peering, using a transit network, or even region-to-region VPN connectivity.
- Similar to the primary configuration, cluster nodes should be spread across multiple Availability Zones with the region.
- Security groups and NACLs should permit communication between the primary and secondary clusters and all possible downstream clients.
- DR Replicated clusters should be provisioned in a second region, where practical.
- DR Replicated clusters provisioned in the same region should deviate from the same Availability Zones as the primary cluster, where available.
- To maintain a single URL for Vault services, use a Route53 Failover routing policy, along with health checks, to fail the primary Vault DNS record to the DR Replicated cluster.
Performance Replication, on the other hand, is a way to extend Vault services to additional regions, or even on-premises, and have the cluster service its local clients. Similar to DR Replication, Performance Replication mirrors the configuration and static secrets from the primary cluster, however, it does not replicate the tokens and leases, requiring local clients to authenticate with the local cluster to obtain leases and tokens. Performance Replication also detects any writes to Vault and forwards them to the Primary cluster. The resulting benefit is a local Vault cluster that can service all Vault reads, allowing faster queries and less network traffic between regions. From an AWS perspective, Performance Replication is generally used to extend Vault functionality across accounts, VPCs, or regions for high availability or to provide scale-out capacity for any heavy read workloads. Reduced network traffic between regions and VPCs, especially through a transit VPC, can reduce outbound network charges on monthly AWS bills. Design considerations for Performance Replication include:
- Place Performance Replicated clusters near applications that it will service (i.e., same region, same VPC, same Availability Zones, etc.).
- A separate Route53 DNS should be used for communication to the Performance Replicated cluster and should be used heavily by applications needing read-only access.
- To reduce outbound network traffic, use Mount Filters to limit the secrets that need to be replicated to the secondary cluster, where applicable. Mount Filters can be used to satisfy GDPR requirements as well.
- Performance Replicated clusters require appropriate access across the AWS infrastructure to create dynamic secrets, communicate with AWS APIs, and access additional resources for all secret engines enabled.
Last but not least, HashiCorp recently released functionality for Performance Standby nodes. Similar to Performance Replication clusters, Performance Standby nodes can locally service read-only requests to clients while transparently forwarding write-requests to the Primary. Within AWS, this can be used to easily horizontally scale read-only requests rather than having to scale-up the primary node. A few design practices to consider when deploying Performance Standby nodes:
- Provision Performance Standby nodes as close to the application it will be servicing (same subnet, same VPC, same Availability Zone, same region).
- Create separate Route53 DNS record to direct read heavy applications to use the Performance Standby nodes. An alternative would be to build logic into the application to use the Performance Standby for reads and the primary cluster address for writes or updates.
- Performance Standbys are still Standby nodes, therefore provision in Availability Zones where existing nodes do not reside, where applicable.
While ensuring Vault services are highly available, it’s equally as important to ensure the storage backend is highly available as well. Many of the same practices listed above should also be used to provision HashiCorp Consul across multiple regions, accounts, VPCs, and subnets to ensure its availability. In addition, architects can make use of Consul Enterprise features such as Redundancy Zones and Enhanced Read Scalability to help ensure Consul is always available in the event of Availability Zone or Regional failure. Using AWS best practices for high availability, along with knowledge on HashiCorp Vault and Consul, it’s simple to see how the combination of AWS services and Vault Enterprise features can be used to ensure Vault is highly available to the organization as it becomes a critical service to the IT organization.