A (very) high level overview on AWS CloudFormation

What is AWS CloudFormation?

  • AWS CloudFormation is an AWS native service with main focus on IaC (Infrastructure as Code) and one for the core components towards DevOps common practices in context of service deployments into AWS
  • AWS CloudFormation is meant to handle AWS related infrastructure
  • As with other services AWS CloudFormation is a so-called regional service while service such like R53 would be global ones

Things around AWS CloudFormation

  • CloudFormation (aka CF) is the AWS „swiss army nife” service to codify resources in the AWS cloud
  • Native AWS service that is available to be used with no extra costs (which obviously does not imply for the services that get deployed with AWS CF)
  • AWS CF shines within the AWS cloud and therefore has no native integration into other clouds
  • IaC outside AWS with CloudFormation is a bit tricky but AWS Lambda (so by building AWS custom resources) could become your friend
  • AWS Quick Starts shares a repository of Third-Party tools
  • Allows to deploy anything that can can be done manually through the Console
    • EC2, VPC, Subnet, RDS, …
  • Native JSON or YAML to write high-level descriptive configuration files
  • Graphical UI to generate deployments by drag-and-drop methodology
  • CF always knows state of a given deployment at any time
  • Deletion protection otherwise consistent cleanups with the benefit to not be left with leftover artifact
    • Though: additions outside the CF template’s deployment may not be deleted 
      (example a deployed service will generate SSM Parameter Store entries – those would not be deleted automatically)
  • Deployment logs are available through CloudWatch
  • In case of larger deployments and/or when to find the needle in the haystack AWS Athena can be used to search CloudWatch logs with native SQL
  • Automated roll backs in case of failures
    • Helps to
      • Keep the environment secured so that in case of half-ready deployments security would be kept
      • Keep costs low since a half-ready and not functioning deployment still could cost money
    • “Nested stacks” with templates as modules
      • An individual template in a nested stack acts as a silo or block to create resources, infrastructure
      • Allows to create fully self-contained infrastructure deployments
        Therefore during removal no artifacts would be left over
      • Simplification to hand over and route parameters
      • Resources can be deployed by being dependent on other stacks
  • Change sets to modify, change and update given deployments 
    (it though depends on the service if it would be an update or a replacement)
  • Parameters can be handed through other templates at runtime (inside and outside a nested stack)
  • AWS SSM can access AWS CloudFormation parameters at runtime for a given stack
  • AWS AppConfig and/or AWS SSM Parameter Store to host configurations and parameters at an application level in addition to AWS CloudFormation makes it a very powerful combo
  • Re-usable IaC
    • A best practice is to write templates as region, account, AZ agnostic as possible. Therefore  
      • the deployment itself will work as it has no hard bindings
      • the code can be re-used by others
  • CF can be interacted with via the Console, the CLI or the API
    … think about the capabilities when mixing CF API, Automation, Ticketing system …

High Speed Logging meets Serverless Design

In previous articles on this blog it was talked about High Speed Logging as well as designs on how to process HSL, High Speed Logs.

There’s more than one way to skin a cat …

This article and a series of others will handle design and architectural ideas and options to turn a self-hosted, AWS EC2 based infrastructure into a fully automated and full service based deployment by utilizing public Cloud provider offerings such like AWS provides.

It furthermore will be shown which benefits various deployment options would provide since – as always – there’s more than one way to skin a cat.

The Architecture

HSL High Speed Log with AWS ELB Logs
ELB Logs on a serverless High Speed Log processing architecture

AWS ELB’s logs are sourced to provide an infrastructure and security type monitoring. The log format itself looks similar to a web access log. Typically an AWS ELB – Elastic Load Balancer – creates a new log file every 5 minutes. In very high traffic situations this can happen to be more often.

The Steps:

  • The log file is stored in an AWS S3 bucket as configured with in the AWS ELB’s configuration.
  • AWS S3 generates a trigger to an AWS SQS queue which an AWS Lambda function listens to.
  • Whenever new messages arrived in AWS SQS Lambda picks them up. The individual message contains an information about the object’s key (so the path to the log file as it has been stored by AWS ELB).
  • The AWS Lambda Function reads the object to break the various log lines into key/values to process further. In this use case where data is processed into AWS Elasticsearch the k/v is translated into a JSON formatted string. This ensures on the processing later in the chain that AWS Elasticsearch can get the document consumed and indexed easily.
  • The AWS Lambda Function code also could enrich data where and if needed to meet business requirements.
  • Next step then is to get the data pushed forward which is where AWS Kinesis joins as it is highly capable to consume loads of records in short amount of time.
  • AWS Firehose acts as the glue in the chain to consume records out of AWS Kinesis and to move those forward into AWS Elasticsearch as the use case scenario in this article. If desired AWS Firehose furthermore could manipulate records via AWS Lambda if it would have to be.
  • Final step then is for AWS Elasticsearch to index and store the data for further consumptions, analytics or search in general.

What does the architecture help to meet?

Increasing traffic patterns

In case of growing traffic volumes more log files would be generated. In a different article the processing of ELBs logs has been described as an asynchronous and self-learning environment which is hosted on EC2 instances. While the solution has pros it has one hard limitation which is throughput fr high scale. The architecture described in the schema above is meant to allow high throughput in a given AWS Region with the intention to store data into a search engine.

Automated deployment

All components chosen can be deployed by utilizing AWS CloudFormation which allows a full abstraction away from traditional strategies on deployments. Therefore a move towards Infrastructure as Code principles. A different article handles AWS CloudFormation Stacks and StackSets so that a full environment can be turned into a push-button deployment.

All-in-all even manually deployed through the AWS Console or the AWS CLI the utilization of CloudFormation templates makes a deployment re-usable so that a deployment by region or into various AWS accounts can be done in a repeating pattern while consistency is ensured.

Maintenance

In traditional instance based deployments every other while certain routines would have to be taken care about. A prominent example certainly would be OS security patch routines to keep an EC2 based deployment up to date on latest security patches. A procedure which typically requires lots of overhead and attention by application owners, IT teams as well as teams where all maintenance efforts have to be handled.Each patch cycle though provides a risk on stability as each patching effort is followed by an instance reboot which simply can cause struggles to applications which are executed on those EC2 instances or those which would be dependent and may struggle on the absence during reboot.

Scale

Components chosen to support the architecture described in this article are meant to scale and to meet traffic patterns as those may raise. Monitoring can be implemented to let AWS Kinesis increase or decrease shards when certain thresholds would be crossed. Certain traffic patterns even could require to let incoming events into AWS Kinesis to be throttled down. A variety of options would be possible.

Lambda would scale alongside the amount of messages which may wait in AWS SQS to be picked up. A scale here would very likely be faster compared to a full AWS EC2 deployment including application deployment and configurations. During low traffic time windows it would be ensured to operate with less to no overhead as it would be with AWS EC2 instances otherwise while they could end up in idle time.

AWS EC2 metadata

The 169.254.169.254 address

The 169.254.169.254 endpoint can be reached by an EC2 instance locally during runtime. The address provides all sorts of meta data which helps to operate, audit, maintain an instance during its life time.

The pseudo address can be reached locally only (on the instance):

http://169.254.169.254/latest/meta-data/

Beside meta data information such like ami-id, hostname, instance-id, placement (so the AZ, availability zone) it also provides security related information.

Example:

curl http://169.254.169.254/latest/meta-data/instance-type

m5.xlarge

If utilized correctly the information provided through the 169.254.169.254 endpoint provides tremendous help to maintain code more effectively, get risk risks reduced where access keys may be leaked into code, code repositories (git, svn) or even may be posted in documentation. Depending to what policy would be attached to those keys it could open high security concerns.

Increased security and simplified code

As an IAM profile can be assigned to any EC2 instance (one at a time) it allows to increase security significantly and on top to let key handling become much easier. The need to share keys, the risk to have them embedded in code, the need to track keys to get them rotated and such can be eliminated in favor towards higher security.

Example:

#curl http://169.254.169.254/latest/meta-data/iam/security-credentials/<role name>

{
  “Code” : “Success”,
  “LastUpdated” : “2020-11-19T16:47:54Z”,
  “Type” : “AWS-HMAC”,
  “AccessKeyId” : “ABCDEFGHIJKLMNOPQRST”,
  “SecretAccessKey” : “1234567890ABCDEFGHIJKLMNOPQRST”,
  “Token” : “…”,
  “Expiration” : “2020-11-19T23:14:05Z”
}

The official documents are provided at AWS here.

Code samples

PHP

GetRegion reads the placement information from the pseudo address at http://169.254.169.254/latest/meta-data/placement. With the CredentialsProvider::instanceProfile() it allows to avoid configuration handling as described.

public function __construct()
{
    $provider = CredentialProvider::instanceProfile();
    $memoizedProvider = CredentialProvider::memoize($provider);

    $this->awsclient = new Client(
        array(
            'credentials' => $memoizedProvider,
            'region'  => $this->GetRegion();
            'version' => 'latest'
        )
    );
} 

Python

session = boto3.Session(region_name=Region)
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()

Key = credentials.access_key
Secret = credentials.secret_key
Token = credentials.token

client = boto3.client('',
            aws_access_key_id=Key,
            aws_secret_access_key=Secret,
            aws_session_token=Token,
            region_name=Region
            )

Cloud native services vs self-maintained deployments

Many articles talk about how to get AWS SQS deployed. This blog as well handles an overview on how to do a CloudFormation based deployed of an AWS SQS into the AWS Cloud.

However, this article rather is about the benefit and value Cloud native services can provide. Furthermore how they can ease operations and add operational stability – while AWS SQS is taken as an example.

What is AWS SQS?

AWS SQS stands for Amazon Simple Queue Service and provides a message queuing service, fully managed by AWS.

Similar to other queuing technologies it helps to design and build decoupled and event driven applications where a queue or broker receives messages from producers and provides those to consumers to process further.

Costs

Similar to other cloud native services only the consumed and used service would have to be paid for. Therefore it allows a very cost effective usage since no idle time on resources would have to be considered.

Services such like AWS SQS ideally would be deployed based on CloudFormation templates which allow a quick and re-usable pattern into a fully automated – for example event driven, on-demand driven – setup. Of course a setup through the AWS CLI and the AWS Console can be done as well.

A significant cost savings factor which would have to be taken into consideration is the deployment itself where time and effort would have to be included into a cost model to host services such like RabbitMQ, AMQ and similar and compare those with native cloud services such like AWS SQS in this example. Once deployed follow-up efforts would have to be considered into a full cost model such like OS patching, service update/upgrade, service health monitoring and so forth which again impacts time and efforts. Surrounding frameworks to install and maintain a service such like Chef or Puppet which by itself need maintenance, uptime, monitoring and so forth would have to be taken into a full 360° view as well. Very quickly it can end up in a hairball of dependencies where various teams have to interact accordingly to get a service deployed and maintained.

Conclusion: The chain of dependencies may have an influence on how complex and time consuming a self-hosted deployment for a given service can become. All-in-all how costly it may be potentially. These efforts compared to Cloud native deployments would provide a significant parameter which would have to be taken into consideration when Cloud native services get compared with self-maintained deployments.

Performance and Scale

AWS SQS can handle messages in a FIFO pattern in case an order of messages is a key requirement – however, a FIFO has limits regarding message handling throughput. The standard message handling is built for scale and throughput at maximum levels. While a SQS FIFO queue can handle up to 3000 messages per second the standard SQS message queue function can handle almost unlimited transactions per second – the so called TPS value.

While this article takes AWS SQS as an example specifically to compare Cloud native services vs self-maintained deployments the above described performance and scale for AWS SQS typically can be expected around Cloud natives services in general. Therefore scale up or down to meet ongoing needs but also leads into an ability to effectively manage performance and costs.

Deployment

Since AWS SQS is a fairly non-complex service from a usage perspective the deployment experience itself is similarly easy. A little attention should be taken into the producers and consumers on how they handle messages. Specifically how often a read/pull would be planned for. Each call to the SQS API to receive a message will be charged – no matter if the returned answer would contain an empty body or not.

AWS SQS configuration parameters such like the receive message wait time help to control costs and to lower empty body replies. On the application side an added logic which enforces a slow down when an empty queue is discovered are additional methods to control costs for empty calls. Since only what is used would be paid for it makes sense to avoid empty calls as much as possible so that costs would stay as low as possible.

Operations, Maintenance

An alternative to a cloud managed service would be a self-deployed, hosted and managed service. Earlier in this article the impact a self-maintained and hosted service can result into got reflected from a costs and efforts perspective. Given that queuing services such like RabbitMQ, AMQ are easy to deploy and maintain the efforts to deploy, maintain, monitor can become cumbersome.

Typically a certain uptime and reliability is required which may require a failover and may end up into a clusterized setup to support business needs. Rather simple tasks such like OS patching can causes all sorts of challenges such like when to patch, what is the roll back path, how long can a given service be taken offline, failover options, what if patching is ignored and potential vulnerabilities are not closed, … – the list can be endless. In highly regulated environments certain actions have to be followed – such like OS patching.

Given shared responsibility in a Cloud environment questions and tasks such like patching, environmental stability and so forth is handled by the Cloud provider – such like Amazon AWS, Google GCP, Microsoft Azure to just name a very few.

What does AWS SQS provide from an operations and maintenance perspective?

Basically AWS ensures high reliability. While SQS is a regional service AWS ensures that a compute node failure, network outage even up to an AZ (availability zone) failure would not impact the SQS service itself.

Typically in a self hosts and maintained environment this level of reliability is not reached. If it is it would come with a number of efforts such like containerization to abstract the dependency to compute, cluster functionality to ensure High-Availability and resiliency.

In case producers and consumers rely on a queuing service with an ongoing, always available, never slow down experience – so a true 24×7 availability – a self-hosted queue service will hit limitations.

  • What if the underlying compute needs to be patched from an OS perspective?
  • Who can be reached at any time (24×7) if the service needs maintenance?
  • What are the impacts and follow up costs and efforts if an internally hosts queue service would become unresponsive?
  • Which efforts does it take to ensure High-Availability
  • Will the queue be able to store messages in a reliable pattern?
  • What compute needs to be provided to ensure average but also peak loads?
  • What if producers keep sending messages but consumers cannot process?
  • How large needs a backlog to be considered?

Use case experience

After couple years on a self-maintained environment, running a queuing service (RabbitMQ in this case) and all implications had to be handled as described throughout this article a transition was made to integrate AWS SQS as a replacement.

The adaption from a code, software perspective had been straight forward, no issues at all with little planing and design beforehand.

Overall conclusion

While AWS SQS was taken specifically as an example for this article the overall outcome is obvious and likely not a secret:

Utilization of Cloud native services can outrun self-hosted, self-maintained services on various levels. Specifically when it comes to deployment, maintenance and operations Cloud native services can provide a significant advantage and always may want to be kept in mind as valid alternatives and options.

In case planning is done wisely costs can be controlled very precisely and foreseeably. Overhead costs, dependencies on various layers can be removed entirely.

Ad-hoc changes in requirement can be fulfilled very easily up to a fully automated fashion.

Last but not least with the shared responsibility model a public cloud provider such like AWS helps to take away the burden to operate and maintain a given service.

Processing of High Speed Logs

Within the article HSL – High Speed Logging it was talked about High Speed Logging itself and how this type of data which purely is technical and has an infrastructure source can be used to support SLA reporting.

This article will provide an insight on how to process and basically ship the amount of data from its source into a data store. On purpose the design does not utilize Cloud native services to keep it re-usable with various deployment strategies.

Source data generation on premises vs in the Cloud

In an on premises environment the data generation and shipping has to be handled in a different way as it would have to be handled in a Cloud based deployment.

Log data collected with a Load Balancer will have to be shipped to a central syslog host. Often some data manipulation comes on top (such like date format translation, data which may have to be manipulated, information to be added for later processing). Logstash’s syslog input plugin can be used to create a central syslog message endpoint. With Logstash’s filters data then can be massaged as needed. Elsewhere in this blog the so called ELK stack will be reflected in more depth.

In a Cloud based deployment the data collection changes entirely. Basically the Cloud provider takes care to provide the logs into storage from where it can be picked up. AWS for such purpose uses Simple Storage aka S3.

Taking AWS as an example: AWS ELBs can store their respective logs into a S3 bucket from where logs (actually objects located in a bucket) can be pulled individually for further processing.

S3 Location for ELB logs

Components

Each component in the stack is chosen to a allow to be changed by a different solution and/or deployment strategy. Connections and communications between the components is handled in a loosely composed pattern to avoid a hard wired setup.

Redis in this design is used to provide a flexible, fast acting, data structure store used as an in-memory database.

RabbitMQ in this design is used to provide a broker mechanism to handle the actual objects (logs).

Containers are used to ensure reliability, availability and scaleability for the individual microservices . A single container image is designed where an environment parameter switches the container’s behavior during startup. Docker Swarm Service provides an easy to use environment to handle a scalability as well as an ability to recover a microservice.

3 software components

  • Listing – Creation of a library of available, configured ELBs in a Region. Basically the listing function sets the stage for the next to components.
  • Receiving – Sort of an orchestration job which creates work packages for processing.
  • Parsing – The actual work horse in the design with a need to scale up and down where it translates the ELB logs into JSON arrays and enriches data for later analytics.

Logistics

  • Region based deployment such like us-east-1, eu-west-1
  • Self-learning system which identifies ELBs in a given region. This eliminates ongoing hands-on tasks to maintain newly added ELB’s log processing
  • Automated discovery if an ELB’s configuration would have logging disabled for reporting, alerting purposes
  • Systems tracks itself when an ELB was seen first time vs reviewed last time vs when objects got processed last time
  • Message brokerage to handle work packages to process (download and unpack) chunks of objects (logs)
  • Automated removal of objects in the various S3 buckets after processing which turns the individual S3 bucket into a buffer. Furthermore adds a layer to allow fail-safe operations
  • In case a process dies insight a container it will recover itself after a moment to ensure a highly available processing mechanics.

HSL – High Speed Logging

Monitoring at the Edge

While traditional monitoring based on CPU, load, used memory etc. is still a key component from an infrastructure health perspective it otherwise does not fully tell if or how requests sent by customers were fulfilled. That’s where HTTP response codes and processing times become an important metric from a monitoring perspective.

In short – everything could look healthy from a system’s perspective but customers may have a difference experience. And if customers (such end users using a store to do a purchase) get into low a performing experience it would turn them away which obviously could become an impact from a revenue perspective.

To get to a more holistic and business supporting level of monitoring the classic set of metrics can been extended by a functionality called HSL also known as High Speed Logging.

What is HSL – High Speed Logging?

A search for HSL or High Speed Logging suggests an infrastructure and security type monitoring. While that’s totally true otherwise traffic monitoring at a network’s edge, gateway level turns an infrastructure type monitoring into a SLA and business type monitoring up to the point that such metrics can be used feed BI solutions to perform further analytics on collected data.

The gateway’s log syntax for sure looks unique per vendor (such like F5, Cisco etc) and system. However AWS’ ELB log syntax as an example can be looked up at the AWS documentation: ELB Log example log entry. In some way it looks like a web server’s log file with extended information.

HSL does not care about the time and performance outside a given network. Therefore HSL typically does not care about the time it takes to deliver a response back to a requestor once it left the inner 4 walls. High Speed Logging though knows how long an instance took to get an answer processed. Furthermore HSL does not it care if a request would have been rendered successfully at a customer’s device. This would be something like the Boomerang project which is a JavaScript library to help to support page load times, user experience – in general called Real User Measurement (RUM).

In short – what is HSL able to answer?

  • What was requested? (URL, …)
  • How was it answered? (HTTP status response code)
  • Where did the request came from? (IP address)
  • How long did it take until an answer got returned (therefore how much time was spent within the inner 4 walls)

As mentioned earlier in this article HSL in that context does not care about the time it would take to deliver a response through the Internet back to a customer’s device since all measuring is done inside a given network’s 4 walls (well nowadays those are virtual walls).

SLA – Service Level Agreements

While metrics generated by HSL are purely technical and at an infrastructure level they otherwise can support to get in-line with contract wording such like Uptime, Unavailability and Performance. HSL is able to support SLA reporting.

Questions which can be answered from SLA and SLO perspectives:

  • How often did a vendor fail to fulfill a request?
  • How long did it take to fulfill requests of a certain type?
  • What is the consecutive time systems had been unavailable?
  • What is the correlation of certain events in relation to requests, endpoints and similar?
  • How’s a given endpoint performing?
  • How does a code release or change impact a given endpoint?

Uptime and Unavailability in general are able to be defined precisely since it all is based on HTTP status codes classifications. In a very high level (and leaving a few other things aside) it could be said that any status code is recorded – so user and server side – however only those where a server side status code is identified (so a status above 500) it then will be reflected into Uptime and Unavailability calculations. 

What is done with all the data and how can it be collected and processed?

That answer for sure would start with an ‘it depends on …’. It depends on the volume that needs to be processed per time. In any case and as a good practice in general the so called CIA triad may be considered as a best practice to handle data wisely.

For sure a challenging key component is the amount of data which has to be processed and stored. This function has to be highly scalable, robust and self-healing to avoid hard dependencies. A design and solution is talked about in the Processing of High Speed Logs article in this blog which prefers a loose composition of components to ensure each part is changeable and scalable at each level.

Typically a near real-time function comes on top which requires to let the data become searchable right after it got processed. A solution stack that is able to keep the pace would be the so called ELK stack which is being talked about elsewhere in this blog.

AWS SSM Parameter Store

What is the AWS SSM Parameter Store?

The SSM Parameter store belongs to a fleet of services under the AWS Systems Manager umbrella.

The AWS Systems Manager Parameter Store provides a service to maintain k/v parameters. Data can be store in plain or encrypted formats.

A huge benefit is provided through an option which allows to create a hierarchy of parameters – such like a parent-child relation. Since parameters can be pulled either by a dedicated key (parameter name) or as path it allows to group parameters logically.

During initial architecture and design it may make sense to invest a bit of time to outline a horizon on how parameters would want to be organized. While it help to keep things clean and in order it also helps to reduce API interactions with the AWS SSM Parameter API. Which in effectively results in less on-air traffic and API interactions – so less budget spend overall. Therefore an invest of time and planning at the very beginning will help to avoid increased budget needs later on.

Why AWS SSM Parameter Store?

The AWS SSM Parameter Store is an option which allows to provide configurations for applications. Also applications can just interact with the AWS SSM Parameter Store to not just read but also write and update parameters where and whenever needed.

The AWS SSM Parameter Store is a regional service. Example: whatever parameter, k/v would be stored in eu-west-1 (so Ireland) would not automatically be replicated to other regions which may be used.

Use case

While the number of uses cases where and how to involve the Parameter Store would be countless there is an example in this blog to illustrate and automated, programmatic usage of the AWS SSM Parameter Store:

https://into-the-cloud.mechmann.com/2020/06/09/aws-ec2-automated-docker-swarm-deployment/

The logic that is used to run the Docker Swarm cluster utilizes the SSM Parameter Store to maintain the various roles of nodes within a cluster and to automatically outline which role a newly joined node would have to consume.

Docker Swarm Service

How to place containers within a Docker Swarm deployment.

Use Cases

Constraints can be helpful to maintain a cluster environment for various needs.

  • Maintain resource hungry services to avoid a deployment to similar cluster nodes. Therefore to avoid an overlap of resource demanding services.
  • Services which cannot be moved around to random hosts within a cluster (i.e. due to underlying IP etc)
  • Deployment of services to a specific group of hosts based on infrastructure needs (i.e. certain volume assignments, memory, CPU etc)

Node constraints

When a swarm cluster is created and starts to provide services those will be started on any available node with a given cluster.

Usage of constraints will help to label/tag nodes within a Docker Swarm cluster in a way that it allows to place a given subset of services to a certain group of cluster nodes.

Nodes may be added with a single label/tag to deploy services efficiently. Various services otherwise though can be deployed by utilizing the same constraint. Therefore it would be a 1:N relation.

A label is independent to the cluster node’s role. A worker as well as a leader/manager node can set with similar labels.

Setting constraints in a cluster

Generate a list of node IDs with docker node ls

b4y5fxmnqaw6za652bgrwg413 *   host1  Ready               Active              Reachable           18.06.1-ce
v6brbtejckon3nja78h66wlu0     host2   Ready               Active              Leader              18.06.1-ce
….

A label then can be assigned just easily by the docker CLI

  docker node update –label-add example1=true b4y5fxmnqaw6za652bgrwg413
  docker node update –label-add example2=true v6brbtejckon3nja78h66wlu0

Verification on the set labels can be checked by a ‘docker inspect’ for a given cluster host.

Service deployment with constraints set

docker service create –constraint node.labels.example1==true –replicas 2 –name service1 imageName
docker service create –constraint node.labels.example2==true –replicas 1 –name service2 imageName

A ‘docker service ps service1’ then shows the cluster nodes where ‘service1’ is deployed to. Nodes which were not labeled as ‘example1’ will not show service1 as a running service

AWS EC2 Maintenance Announcements

About the need to automate the AWS EC2 Maintenance Announcement’s tasks

As a common practice in cloud environments the cloud provider will have to go through infrastructure maintenances to keep the underlying data center infrastructure current and up to date.
Those maintenances typically are live-updates with less to no impact but some may have impacts to services used by clients – such like EC2 intances in an AWS context. 

AWS typically sends an email to the account holder email address as well as it is announced in the AWS Console. Both will require either access to the email account that receives the note or returning check through the AWS Console to become aware of an upcoming maintenance schedule. Both option do not scale, do not provide an ability to automate, require hands-on to not miss an announcement which might impact a main or core business function … any many more. Specifically in larger organizations where hundreds if not thousands of EC2 instances are deployed and utilized across a number of teams and organizations to support business needs and functions an automated process to handle a maintenance announcement becomes apparent and not just a help but a very need.

Overall it is as simple as a full instance stop and start is required to ensure instance announced in the maintenance notification is being moved off of the underlying hardware where it is currently deployed and executed on. Therefore either AWS will do a stop and start of those EC2 instances which got announced if no action is taken or someone with enough access rights to the AWS Console, the AWS EC2 API would have to execute a stop, start to initiate an EC2 relocation to a different underlying host in the AWS infrastructure.

A more actionable and automated process towards an ability to automatically let EC2 instances be stopped and started could be done by various options.

Automation based on Resource Tags

Resource Tags will be tremendously helpful if used and utilized properly as for example a tag could be set to identify a given environment an instance would belong to – such like Development vs Production vs Testing

A few thoughts around Resource Tags are collected in this post: https://into-the-cloud.mechmann.com/2020/06/24/resource-tags-in-cloud-deployments/

Monitoring driven

Monitoring applications such like Zabbix which collect instance level metrics through an agent can be used to read/pull the instance metadata.

A so called Template with an Item would be created. Alongside the Agent a so called UserParameter would be created which returns the EC2 metadata as a JSON formatted output

curl http://169.254.169.254/latest/meta-data/events/maintenance/scheduled

The UserParameter would be assigned to the Item where a schedule would be set (i.e. it would be executed once a day). A so called Trigger can be created to initiate an automated execution through the AWS EC2 API this then pre-checked for execution by validation of a given and correctly set resource tag.

Since this Post is about the AWS EC2 Maintenance Announcement the Monitoring driven approach stays high level described.

Asset Management driven

In case a repository of all deployed instances is collected throughout a day to maintain a library of workloads and how/where those are deployed to the AWS EC2 Describe API can provided similar information as described above in the Monitoring driven approach.

The daily data collection could be done via an AWS Lambda function. However the overall approach would be very similar to the above mentioned solution.

Event driven via AWS Cloudwatch

Last but not least an event driven option which would be fully cloud native one will be described in this paragraph.

Event Rules in Cloudwatch can be used to initiate an action to a target. In case of an instance retirement announcement an event rules can help to get proper and required actions automated.

The eventTypeCode is called: AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED

Within the Targets an automation through SSM can be selected where then the event as described above would initiate an EC2 instance restart automatically.

Fazit

Many options can be utilized to allow an automated processing whenever AWS maintenances would be announced. For sure there is not one path only which otherwise helps to keep creativity moving and business needs can be met most appropriately.

Resource Tags

This page is about tags which also get named as resource tags and why they are very essential no matter what infrastructure is in use. Theoretically tags would be even more important in an hybrid setup.

Tags are unrelated to a certain cloud infrastructure. Actually tags should be used in any deployment – no matter if it would be AWS, GCP or an on-premise deployment such like VMWare, Openstack etc etc.

Why tags?

Certainly a strategy to tag and name given workloads correctly and in a meaningful way needs time and efforts. However, the payout is worth to get a company wide standard on tags established.

Typical questions seen when it comes to get certain workloads, deployment identified:

  • What all does belong to an environment named A, or B, C?
  • How much resources is consumed by client A vs client B?
  • How much infrastructure does the company need to support the core business itself? (so corporate, shared services, ..)
  • If I have to patch/move/tier down (..) environment ABC – which consumer may be impacted by doing so?
  • In which AZ and Region is client A located from an application perspective?
  • Where is environment ABCD located at?
  • What hosts (instance types, storage etc) are connected to ABCD?
  • Team A needs AWS API access to maintain (i.e. stop, start, bounce) ABCD, GHI*PRD, … – to which resource tag group does these instances belong to?
  • Which ELB is used by ABCD?
  • Which security group belongs to CDEF?
  • What is AWS RDS XYZ used for?
  • What can be turned off during the weekend, off hours?
  • … and so forth

With utilizing cloud deployed instances historically used dedicated hostnames would become meaningless. Even more important if cloud native services are used (i.e AWS RDS, AWS DynamoDB, AWS ElastiCache etc etc) an option for a hostname simply does not exists. Consequently tagging becomes a key need quickly to be able to get identified the purpose for of a given workload – such like why, by whom, for what purpose.

The better and the more unified a tagging concept will be the better and more precisely various questions (as those above as a typical sample of questions) can be answered.

Any automated processes to maintain, monitor and handle workloads will benefit from a good tagging strategy.

Clean tagging

Some tags might not be needed and therefore would add an overhead to maintain and handle those. For example adding a tag to note which available zone a given AWS EC2 instance would belong to would be an obsolete one since that information would be available by various options (AWS CLI, API, ec2-metadata and so on)

Need for conventions

Before a service or instance gets tag’d it would be recommended to get a naming convention established which will be generic enough to allow future (not yet known) abbreviations and mutations.

It also will be a need to figure the tag word syntax and structure:

  • word total length
  • special characters?
  • word patterns and abbreviations such like prod vs prd
  • human readable and programmatically usable

It typically makes sense to create a library or dictionary that helps to re-use and assign tags.

Anything to be tag’d

When an instance is deployed – such like an AWS EC2 instance – it’ll be likely to keep in mind to get this deployment assigned with proper tags for future reference.

However, there are more components and function in actual use to make the instance usable: Security Group, Subnet, EBS, Lauch Template, Auto Scaling Group and so forth. All components included would want to be assigned with tags properly to ensure a full 360° view on a given deployment.

To ensure anything gets assigned with proper set of tags a policy might be introduced which enforces proper tags.

  • AWS IAM policies could help to enforce a tagging strategy
  • AWS Lambda could be invoked automatically during instance deployment to catch if tags would be set correctly and if not could correct them behind the scenes, automatically

What are tags good for?

Beside the obvious benefit to just be able to know at a later point in time where some deployment has been made for and why it will help on cost assignments. The AWS Cost Explorer for example is a great and powerful tool to help to understand where budgets are spent for. The more precisely a given infrastructure is tag’d at the cost review the better company internal cost distributions could be handled, the better cost optimizations could be discussed. In short – an ability for meaningful cost analytics.

Asset management becomes easier to handle as well as OS patching via AWS services such like AWS SSM will become much less of a headache but rather would be an easy, foreseeable event alongside the ability to fully automate.