Consul cluster in AWS with autoscaling, Lambda and lifecycle hooks

Introduction

Recently I have been playing around with microservices-based applications using the Amazon EC2 Container Service (ECS). Amazon ECS is a container manager service that facilitates running a cluster of containers on top of EC2 instances.

Service discovery is an important component of most systems based on a microservices architecture. One option in AWS is to rely upon Elastic Load Balancing, Route 53 and Lambda to implement service discovery. An example of this described  in this blog post.

Another option for service discovery is  Consul by Hashicorp and in this post I will outline how to deploy a Consul server cluster in AWS using Amazon Cloudformation. The cluster will be deployed in Docker containers on EC2 instances in an autoscaling group and I will also go through how I solved automatic bootstrapping of the cluster as well as graceful (kind of..) termination of Consul nodes using a combination of LifeCycle hooks and a Lambda function.

Cloudformation templates and other resources referred to in this post can be found on GitHub.

Solution

The Cloudformation template for deploying the Consul cluster stack can be found here.

The cloudformation template does the following:

  1. Creates a VPC stack with a VPC and two subnets. The VPC stack is described in a separate template and linked from the main template. The VPC stack template also creates an Application Load Balancer that later will be used in front of the Consul Web GUI.
  2. Creates a security group to allow communication between the Consul servers.
  3. Creates a security group to allow to SSH into the EC2 instances
  4. Creates the IAM role for the Consul server instances
  5. Creates a launch configuration for the EC2 instances, using metadata and cfn-init script to configure the instances e.g. to install and start the Consul docker containers. See below for more details about bootstrapping of the Consul cluster.
  6. Launches a autoscaling group
  7. Create a loadbalancer TargetGroup for the EC2 Consul server instances
  8. Creates a new listener for port 8080 on the Load balancer and set the TargetGroup above as default. This will be used to load balance request to Consul GUI and HTTP API.
  9. Creates an SNS topic, a Lambda function and a autoscaling Lifecycle hook to handle removal of the Consul node when an EC2 instance terminates as part of scaling down. More details around this is described below.

 

vpc
VPC design
consul_cf
Consul cluster design

 

Deploy the stack

In the GitHub repo for this post there are a couple of scripts that can be used to easily create and update the complete Cloudformation stack. Alternatively you can create the stack using the AWS Console.

Prerequisites

To deploy the stack using the script you need run on Mac or Linux and have:

  • AWS CLI installed and configured
  • jq installed
  • AWS permissions to create VPC, EC2, SNS and Lambda resources
  • An AWS S3 bucket. This will be used to upload the Cloudformation templates.
  • An AWS Key Pair.

The Cloudformation template used in this post currently only supports deployment in the EU or US regions.

Create the stack

  1. Clone the GitHub repo
  2. Run the script to create the Consul cluster. Replace <your_stack_name> with a name that choose for your stack. Replace <s3_bucket> with the name of your S3 bucket. Replace <key_pair_name> with the name of your EC2 Key Pair name. Replace <availability_zones> with a list of two availability zones to be used, e.g. eu-west-1a,eu-west-1b
    cd limestone-consul/cloudformation
    chmod 700 create-consul-stack
    ./create-consul-stack <your_stack_name> <s3_bucket> <key_pair_name> <availability_zones>
    

    Example:

    ./create-consul-stack Consul MyConsulBucket mykeypair eu-west-1a,eu-west-1b
    

The scripts does the following:

  1. Zip the NodeJS code for the Lambda function
  2. Upload the Lambda zip file to S3
  3. Upload the Cloudformation templates for the VPC and Consul stacks to S3
  4. Creates the stack using AWS CLI
  5. Waits for the stack to be created. Go and grab a coffee, it will take some minutes to complete.
  6. Outputs the URL where the Consul GUI can be accessed. The user name for the Gonsul GUI is admin and the default password is conSuL@aws1.

Now you can try to add and remove instances from the autoscaling group and verify that the Consul nodes are added and removed from the cluster as expected.

Bootstrapping details

One challenge with Consul in combination with auto scaling is how to handle bootstrapping of the cluster. The manual way of bootstrapping Consul is to first launch one server in bootstrap mode and then let the other servers join the bootstrap server. This however requires that the IP of the bootstrap server (or any other node) is known. Hashicorp have a service called Atlas which can be used to auto-join servers into a cluster. In this blog I will however use another technique to join the servers into cluster.

The approach I am using is as follows:

  1. When starting a Consul server EC2 instance we use the AWS CLI to lookup to IPs of the other instances in the auto scaling group
  2. We then supply all these IPs to the –join parameter when starting the Consul server docker container

To find out which auto scaling group an EC2 instance belongs to we can do like this (note that if you want to test this you have to run it from within an EC2 instance):

ASG_NAME=$(aws autoscaling describe-auto-scaling-instances
--instance-id --region eu-west-1 /
$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id) /
| jq -r '.AutoScalingInstances[0].AutoScalingGroupName')

Now when we know the auto scaling group name we can use that to find out which EC2 instances that are currently running within that group.

INSTANCE_IDS=$(aws autoscaling describe-auto-scaling-groups /
--region eu-west-1 --auto-scaling-group-name $ASG_NAME | /
jq -r '.AutoScalingGroups[0].Instances[] | .InstanceId')

Finally we can now obtain the private DNS names for the Consul server instances and use them to join the new instance to the cluster

INSTANCE_DNS_NAMES=$(aws ec2 describe-instances --region eu-west-1 /
--instance | jq -r '.Reservations[].Instances[] | .PrivateDnsName')

Termination of Consul nodes

Removing Consul servers from a cluster must be done with care in order to avoid availability outage. In a cluster with N servers, at least (N+1)/2 servers must be available for the cluster to function.

When the auto scaling group terminates an Consul server instance the remaining nodes in the Consul cluster cannot know if the terminated node is really dead or just not reachable due to networking issue. Nodes that are not reachable will be put in “failed” mode by Consul, but they are still included in the Raft quorum. After a timeout (default 72 hours), Consul will “reap” a failed node and force it into the “left” state and remove it from the quorum.

There is a configuration option in Consul called leave_on_terminate which if enabled will send a Leave message to the Consul cluster when a TERM signal is received by a node. When running Consul in a Docker container it seems that this approach however does not works, probably due to how Docker terminates containers upon instance shutdown.

So,  to avoid having to wait for Consul to “reap” failed nodes when auto scaling terminates an instance, we want to tell Consul to remove the node on the terminating instance in a more graceful manner. To accomplish this we can leverage on the concept of Lifecycle hooks that can be configured for AWS autoscaling groups. More information about autoscaling lifecycle hooks can be found here.

The process the achieve this is:

  1. Configure a lifecycle hook, on our autoscaling group, that will trigger an SNS notification to be published when an instance is about to be terminated. When the lifecycle hook is triggered, the autoscaling group will wait for a callback before proceeding to terminated the instance.
  2. Create a AWS Lambda function that subscribes to the SNS topic.
  3. When invoked, the Lambda function will do two things
    • Tell Consul to put the node in the “left” state. This is done by sending a force-leave command via the Consul HTTP API.
    • Signal to the autoscaling group that the lifecycle hook is completed.

Lambda function

The code for the Lambda function is outlined below. Note that information about the Consul HTTP API hostname and port is passed in the SNS message as metadata.

const AWS = require('aws-sdk');
const http = require('http');
var as = new AWS.AutoScaling();


exports.handler = (event, context, callback) => {
  var message = JSON.parse(event.Records[0].Sns.Message);
  console.log("SNS message contents. \nMessage:\n", message);
  var params = {
      "AutoScalingGroupName" : message.AutoScalingGroupName,
      "LifecycleHookName" : message.LifecycleHookName,
      "LifecycleActionToken" : message.LifecycleActionToken,
      "LifecycleActionResult" : "ABANDON"
    };

  // Get host name and port from metadata
  var metaData = JSON.parse(message.NotificationMetadata);

  forceLeaveConsulNode(message.EC2InstanceId, metaData.LoadBalancerDNSName, metaData.Port, function() {
    completeLifecycleAction(params);
  });

  function forceLeaveConsulNode(instanceId, host, port, callback) {
     var req = http.get('http://'+host+':'+port+'/v1/agent/force-leave/' + instanceId, function(res) {
         console.log(`Got response: ${res.statusCode}`);
         res.resume();
         callback();
     }).on('error', (e) => {
         console.log(`Got error: ${e.message}`);
         callback();
     });
  }


  function completeLifecycleAction(params) {
      as.completeLifecycleAction(params, function(err, data){
          if (err) {
            console.log("CompleteLifecycleAction lifecycle completion failed.\nDetails:\n", err);
            callback(err);
          } else {
            console.log("CompleteLifecycleAction Successful.\nReported:\n", data);
            callback(null);
          }
        });
  };
};

Clodformation snippets

Lets have a look on some snippets, from the cloudformation template, needed to put it all together.

We begin with creating the Lambda functions. To start with we need an IAM role to allow the Lambda function to perform the CompleteLifecycleAction on the autoscaling group.

"LambdaAsgLifecycleHookRole": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "lambda.amazonaws.com"
                ]
              },
              "Action": [
                "sts:AssumeRole"
              ]
            }
          ]
        },
        "Path": "/",
        "Policies": [
          {
            "PolicyName": "root",
            "PolicyDocument": {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": [
                    "autoscaling:CompleteLifecycleAction"
                  ],
                  "Resource": "*"
                },
                {
                  "Effect": "Allow",
                  "Action": [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                  ],
                  "Resource": "arn:aws:logs:*:*:*"
                }
              ]
            }
          }
        ]
      }
    }

Then we can create the Lambda function resource.


"LambdaAsgLifeCycleHookFunction": {
      "Type": "AWS::Lambda::Function",
      "DependsOn": [
        "LambdaAsgLifecycleHookRole"
      ],
      "Properties": {
        "Code": {
          "S3Bucket": {"Ref": "S3Bucket"},
          "S3Key": "limestone/asg-lifecycle-hook-lambda.zip"
        },
        "Role": {
          "Fn::GetAtt": [
            "LambdaAsgLifecycleHookRole",
            "Arn"
          ]
        },
        "Timeout": 10,
        "Handler": "index.handler",
        "Runtime": "nodejs4.3",
        "MemorySize": 128
      }
    }

Now we need an SNS topic that and make the Lambda function subscribe to that topic.

"ConsulAsgTopic": {
      "Type": "AWS::SNS::Topic",
      "Properties": {
        "Subscription": [
          {
            "Endpoint": {
              "Fn::GetAtt": [
                "LambdaAsgLifeCycleHookFunction",
                "Arn"
              ]
            },
            "Protocol": "lambda"
          }
        ],
        "TopicName": {
          "Fn::Join": [
            "",
            [
              {"Ref": "AWS::StackName"},
              "AsgLifecycleHookTopic"
            ]
          ]
        }
      }
    }

We also need to give the SNS topic permission to invoke the Lambda function

"LambdaInvokePermission": {
      "Type": "AWS::Lambda::Permission",
      "Properties": {
        "Action": "lambda:InvokeFunction",
        "Principal": "sns.amazonaws.com",
        "SourceArn": {"Ref": "ConsulAsgTopic"},
        "FunctionName": {
          "Fn::GetAtt": [
            "LambdaAsgLifeCycleHookFunction",
            "Arn"
          ]
        }
      }
    }

Now it is time to create the autoscaling lifecycle hook. First we need to give autoscaling permission to publish to the SNS topic, so we need to define a role for this.

"ConsulAsgLifecycleRole": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "autoscaling.amazonaws.com"
                ]
              },
              "Action": [
                "sts:AssumeRole"
              ]
            }
          ]
        },
        "Path": "/",
        "Policies": [
          {
            "PolicyName": "root",
            "PolicyDocument": {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Resource": "*",
                  "Action": [
                    "sns:Publish"
                  ]
                }
              ]
            }
          }
        ]
      }
    }

…and then the actual lifecycle hook

"ConsulAsgLifecycleHook": {
      "Type": "AWS::AutoScaling::LifecycleHook",
      "DependsOn": [
        "VPCStack"
      ],
      "Properties": {
        "AutoScalingGroupName": {"Ref": "ConsulInstanceAsg"},
        "LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
        "NotificationTargetARN": {"Ref": "ConsulAsgTopic"},
        "RoleARN": {
          "Fn::GetAtt": [
            "ConsulAsgLifecycleRole",
            "Arn"
          ]
        },
        "NotificationMetadata" : {
          "Fn::Join": [
            "",
            [
              "{\"LoadBalancerDNSName\":\"",
              {
                "Fn::GetAtt": [
                  "VPCStack",
                  "Outputs.LoadBalancerDNSName"
                ]
              },
              "\",\"Port\":8080}"
            ]
          ]
        }
      }
    }

Note that we define the loadbalancer DNS name and the Consul HTTP port as metadata to be included in the notification message. This is then used by the Lambda function to invoke the Consul HTTP API to force-leave the terminating node.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s