Monitoring at the Edge
While traditional monitoring based on CPU, load, used memory etc. is still a key component from an infrastructure health perspective it otherwise does not fully tell if or how requests sent by customers were fulfilled. That’s where HTTP response codes and processing times become an important metric from a monitoring perspective.
In short – everything could look healthy from a system’s perspective but customers may have a difference experience. And if customers (such end users using a store to do a purchase) get into low a performing experience it would turn them away which obviously could become an impact from a revenue perspective.
To get to a more holistic and business supporting level of monitoring the classic set of metrics can been extended by a functionality called HSL also known as High Speed Logging.
What is HSL – High Speed Logging?
A search for HSL or High Speed Logging suggests an infrastructure and security type monitoring. While that’s totally true otherwise traffic monitoring at a network’s edge, gateway level turns an infrastructure type monitoring into a SLA and business type monitoring up to the point that such metrics can be used feed BI solutions to perform further analytics on collected data.
The gateway’s log syntax for sure looks unique per vendor (such like F5, Cisco etc) and system. However AWS’ ELB log syntax as an example can be looked up at the AWS documentation: ELB Log example log entry. In some way it looks like a web server’s log file with extended information.
HSL does not care about the time and performance outside a given network. Therefore HSL typically does not care about the time it takes to deliver a response back to a requestor once it left the inner 4 walls. High Speed Logging though knows how long an instance took to get an answer processed. Furthermore HSL does not it care if a request would have been rendered successfully at a customer’s device. This would be something like the Boomerang project which is a JavaScript library to help to support page load times, user experience – in general called Real User Measurement (RUM).
In short – what is HSL able to answer?
- What was requested? (URL, …)
- How was it answered? (HTTP status response code)
- Where did the request came from? (IP address)
- How long did it take until an answer got returned (therefore how much time was spent within the inner 4 walls)
As mentioned earlier in this article HSL in that context does not care about the time it would take to deliver a response through the Internet back to a customer’s device since all measuring is done inside a given network’s 4 walls (well nowadays those are virtual walls).
SLA – Service Level Agreements
While metrics generated by HSL are purely technical and at an infrastructure level they otherwise can support to get in-line with contract wording such like Uptime, Unavailability and Performance. HSL is able to support SLA reporting.
Questions which can be answered from SLA and SLO perspectives:
- How often did a vendor fail to fulfill a request?
- How long did it take to fulfill requests of a certain type?
- What is the consecutive time systems had been unavailable?
- What is the correlation of certain events in relation to requests, endpoints and similar?
- How’s a given endpoint performing?
- How does a code release or change impact a given endpoint?
Uptime and Unavailability in general are able to be defined precisely since it all is based on HTTP status codes classifications. In a very high level (and leaving a few other things aside) it could be said that any status code is recorded – so user and server side – however only those where a server side status code is identified (so a status above 500) it then will be reflected into Uptime and Unavailability calculations.
What is done with all the data and how can it be collected and processed?
That answer for sure would start with an ‘it depends on …’. It depends on the volume that needs to be processed per time. In any case and as a good practice in general the so called CIA triad may be considered as a best practice to handle data wisely.
For sure a challenging key component is the amount of data which has to be processed and stored. This function has to be highly scalable, robust and self-healing to avoid hard dependencies. A design and solution is talked about in the Processing of High Speed Logs article in this blog which prefers a loose composition of components to ensure each part is changeable and scalable at each level.
Typically a near real-time function comes on top which requires to let the data become searchable right after it got processed. A solution stack that is able to keep the pace would be the so called ELK stack which is being talked about elsewhere in this blog.
Leave a comment