SOA: Responsibilities of Service Providers (Part 4)

This blog post is one in a series. An overview and general outline of this series is linked here.

Responsibility 3:  Publish and commit to a defined level of service (SLA). Publish planned vs. actual performance, and availability metrics.

Background

You manage things; you lead people” -Grace Hopper, invented the compiler in 1953

"If you can not measure it, you can not improve it." -Lord Kelvin, British scientist

It is difficult to effectively manage anything significant without metrics.  Managing performance and availability should be treated no differently, especially when very aggressive requirements are in play.  In a Service Oriented Architecture, services are the "things" to mange.

Ineffectively managing performance and availability can create problems.  A couple examples include:
  • Critical business functions may be negatively impacted when availability and performance levels are unknown.  When interfaces are created from clients to services that do not support the required service level, it can result in business and/or direct customer impact.
  • Lack of confidence in a service.  This can result from clients/consumers experiencing unknown, sporadic outages and performance issues.  When confidence is lacking this becomes a barrier to adopting shared services and achieving reusability.

Considerations

Historically at FedEx, Customer-Supplier Alignments have been a useful tool to synchronize needs and expectations between groups that are dependent upon one another.  This parallels a Service Level Agreement (SLA) in concept.  With either, the bottom line is effective communication and aligning expectations between constituent groups.  Much has been written about SLAs by others.  I will only hit on a few key points.

Service level management effectively starts as a Design-time activity and requires Runtime enforcement. /* TODO blog details later */  Your focus and mileage may vary depending on how critical the service is in terms of the availability and performance requirements it must meet.

Capacity Planning

When a service is being developed, analysis is required to determine the requirements of the service.  This is typically started by analyzing the types of business processes that will be supported.  A service supporting customers placing orders via the web or 800 number can be very different than supporting back office batch processing.

Each client application (service consumer) must quantify performance and availability required.  I typically like to quantify requirements using:
  • Requests per second (average and max during peak hour throughput)
  • Response time per request (average and max tolerable response time)
  • Minutes/hours downtime tolerable per hour/day/week/etc.
  • Business impact of not meeting above requirements
    Formal capacity planning/analysis is done to ensure the requirements can be met as-is or by adding hardware, or taking other measures.  The more aggressive the performance and availability requirements, the more formalized the planning activity should be.

    Capturing Metrics

    At run-time, the service should be instrumented to capture actual performance metrics.  These are most useful when captured by client/consumer--this level of detail can always be rolled up to an overall number.  Useful metrics to capture are requests per second and average response time for a given interval.  

    Alerts when performance falls out of variance is considered a best practice, especially when measures can be taken to address degradation.

    Services hosted by eBay and Twitter for example implement "rate limiting" features.  This is to prevent a run-away client's unplanned volume from impacting the ability to meet service commitments for other clients.  Typically this is implemented in a very simple way to enforce the SLA at run-time.  I prefer two levels of alerting:  a warning threshold and an absolute ceiling by client resulting in requests being turned away for specified interval.  /* TODO blog details later */

    Planned vs. Actual performance and availability metrics should be published.  This is a critical tool to assist formal capacity planning.  And while alerting for outage conditions is a must, keeping metrics for total outage minutes, for both planned and unplanned events, is considered a best practice.  These details are very useful for building confidence with current and future consumers.  Also, measurements such as these drive improvements to meet business needs, depending on the criticality of the business processes supported.

    Details regarding service security will not be outlined in this post but I don't wish to minimize it's importance.  A critical success factor is being able to reliably identify each client uniquely.  This will assist troubleshooting, enable metrics to be captured reliably at the client level, and allows the client's SLA to be enforced at run-time.  Allowing anonymous or rogue clients to invoke a service can skew metrics and cause other manageability issues. /* TODO blog details later */

    Level of Rigor May Vary

    The level of rigor applied can vary depending on the criticality of the service.  When developing and managing a service that demands high-performance and high-availability, it is difficult to imagine taking on this challenge without considering the key elements of this principle.