Service Level Management: Best Practices White Paper - PDF

Please download to get full document.

View again

of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Essays & Theses

Published:

Views: 4 | Pages: 17

Extension: PDF | Download: 0

Share
Related documents
Description
Service Level Management: Best Practices White Paper Document ID: Contents Introduction Service Level Management Overview Critical Success Factors Performance Indicators Service level Management
Transcript
Service Level Management: Best Practices White Paper Document ID: Contents Introduction Service Level Management Overview Critical Success Factors Performance Indicators Service level Management Process Flow Implementing Service level Management Defining Network Service Levels Creating and Maintaining SLAs Service Level Management Performance Indicators Documented Service Level Agreement or Service Level Definition Performance Indicator Metrics Service Level Management Review Service Level Management Summary Related Information Introduction This document describes service level management and service level agreements (SLAs) for high availability networks. It includes critical success factors for service level management and indicators to help evaluate success. The document also provides significant detail for SLAs that follow best practice guidelines identified by the high availability service team. Service Level Management Overview Network organizations have historically met expanding network requirements by building solid network infrastructures and working reactively to handle individual service issues. When an outage occurred, the organization would build new processes, management capabilities, or infrastructure that to prevent a particular outage from occurring again. However, due to a higher change rate and increasing availability requirements, we now need an improved model to proactively prevent unplanned downtime and quickly repair the network. Many service provider and enterprise organizations have attempted to better define the level of service required to achieve business goals. Critical Success Factors Critical success factors for SLAs are used to define key elements for successfully building obtainable service levels and for maintaining SLAs. To qualify as a critical success factor, a process or process step must improve the quality of the SLA and benefit network availability in general. The critical success factor should also be measurable so the organization can determine how successful it has been relative to the defined procedure. See Implementing Service level Management for more details. Performance Indicators Performance indicators provide the mechanism by which an organization measures critical success factors. You typically review these on a monthly basis to ensure that service level definitions or SLAs are working well. The network operations group and the necessary tools groups can perform the following metrics. Note: For organizations without SLAs, we recommend you perform service level definitions and service level reviews in addition to metrics. Performance indicators include: Documented service level definition or SLA that includes availability,, reactive service response time, problem resolution goals, and problem escalation. Monthly networking service level review meeting to review service level compliance and implement improvements. Performance indicator metrics, including availability,, service response time by priority, time to resolve by priority, and other measurable SLA parameters. See Implementing Service level Management for more information. Service level Management Process Flow The high level process flow for service level management contains two major groups: 1. Defining network service levels 2. Creating and maintaining SLAs Click on the objects in the following diagram to view the details for that step. Implementing Service level Management Implementing service level management consists of sixteen steps divided into the following two main categories: Defining network service levels Creating and maintaining SLAs Defining Network Service Levels Network managers need to define the major rules by which the network is supported, managed, and measured. Service levels provide goals for all network personnel and can be used as a metric in the quality of the overall service. You can also us service level definitions as a tool for budgeting network resources and as evidence for the need to fund higher QoS. They also provide a way to evaluate vendor and carrier. Without a service level definition and measurement, the organization does not have clear goals. Service satisfaction may be governed by users with little differentiation between applications, server/client operations, or network support. Budgeting can be more difficult because the end result is not clear to the organization, and finally, the network organization tends to be more reactive, not proactive, in improving the network and support model. We recommend the following steps for building and supporting a service level model: 1. Analyze technical goals and constraints. 2. Determine the availability budget. 3. Create application profiles detailing network characteristics of critical applications. 4. Define availability and standards and define common terms. 5. Create a service level definition that includes availability,, service response time, mean time to resolve problems, fault detection, upgrade thresholds, and escalation path. 6. Collect metrics and monitor the service level definition. Step 1: Analyze Technical Goals and Constraints The best way to start analyzing technical goals and constraints is to brainstorm or research technical goals and requirements. Sometimes it helps to invite other IT technical counterparts into this discussion because these individuals have specific goals related to their services. Technical goals include availability levels, throughput, jitter, delay, response time, scalability requirements, new feature introductions, new application introductions, security, manageability, and even cost. The organization should then investigate constraints to achieving those goals given the available resources. You can create worksheets for each goal with an explanation of constraints. Initially, it may seem as if most of the goals are not achievable. Then start prioritizing the goals or lowering expectations that can still meet business requirements. For example, you might have an availability level of percent, or 5 minutes of downtime per year. There are numerous constraints to achieving this goal, such as single points of failure in hardware, mean time to repair (MTTR) broken hardware in remote locations, carrier reliability, proactive fault detection capabilities, high change rates, and current network capacity limitations. As a result, you may adjust the goal to a more achievable level. The availability model in the next section can help you set realistic goals. You may also think about providing higher availability in certain areas of the network that have fewer constraints. When the networking organization publishes service standards for availability, business groups within the organization may find the level unacceptable. This is then a natural point to begin SLA discussions or funding/budgeting models that can achieve the business requirements. Work to identify all constraints or risks involved in achieving the technical goal. Prioritize constraints in terms of the greatest risk or impact to the desired goal. This helps the organization prioritize network improvement initiatives and determine how easily the constraint can be addressed. There are three kinds of constraints: Network technology, resiliency, and configuration Life cycle practices, including planning, design, implementation, and operation Current traffic load or application behavior Network technology, resiliency, and configuration constraints are any limitations or risks associated with the current technology, hardware, links, design, or configuration. Technology limitations cover any constraint posed by the technology itself. For example, no current technology allows for sub second convergence times in redundant network environments, which may be critical for sustaining voice connections across the network. Another example may be the raw speed that data can traverse on terrestrial links, which is approximately 100 miles per millisecond. Network hardware resiliency risk investigations should concentrate on hardware topology, hierarchy, modularity, redundancy, and MTBF along defined paths in the network. Network link constraints should focus on network links and carrier connectivity for enterprise organizations. Link constraints may include link redundancy and diversity, media limitations, wiring infrastructures, local loop connectivity, and long distance connectivity. Design constraints relate to the physical or logical design of the network and include everything from available space for equipment to scalability of the routing protocol implementation. All protocol and media designs should be considered in relation to configuration, availability, scalability,, and capacity. Network service constraints such as Dynamic Host Configuration Protocol (DHCP), Domain Name System (DNS), firewalls, protocol translators, and network address translators should also be considered. Life cycle practices define the processes and management of the network used to consistently deploy solutions, detect and repair problems, prevent capacity or problems, and configure the network for consistency and modularity. You need to consider this area because expertise and process are typically the largest contributors to non availability. The network life cycle refers to the cycle of planning, design, implementation, and operations. Within each of these areas, you must understand network management functionality such as management, configuration management, fault management, and security. A network life cycle assessment is available from Cisco NSA high availability services (HAS) services showing current network availability constraints associated with network life cycle practices. Current traffic load or application constraints simply refer to the impact of current traffic and applications. Unfortunately, many applications have significant constraints that require careful management. Jitter, delay, throughput, and bandwidth requirements for current applications typically have many constraints. The way the application was written may also create constraints. Application profiling helps you better understand these issues; the next section covers this feature. Investigating current availability, traffic, capacity, and overall also helps network managers to understand current service level expectations and risks. This is typically accomplished with a process called network baselining, which helps to define network, availability, or capacity averages for a defined time period, normally about one month. This information is normally used for capacity planning and trending, but can also be used to understand service level issues. The following worksheet uses the above goal/constraint method for the example goal of preventing a security attack or denial of service (DoS) attack. You can also use this worksheet to help determine service coverage for minimizing security attacks. Risk or Constraint Type of Constraint Potential Impact Available DoS detection tools cannot detect all types of DoS attacks. Don't have the required staff and process to react to alerts. Current network access policies are not in place. Current lower bandwidth Internet connection may be a factor if bandwidth congestion is used for attack. Currently security configuration to help prevent attacks may not be thorough. Technology/resiliency Life cycle practices Life cycle practices Network capacity Technology/resiliency High High Medium Medium Medium Step 2: Determine the Availability Budget An availability budget is the expected theoretical availability of the network between two defined points. Accurate theoretical information is useful in several ways: The organization can use this as a goal for internal availability and deviations can be quickly defined and remedied. The information can be used by network planners in determining the availability of the system to help ensure the design will meet business requirements. Factors that contribute to non availability or outage time include hardware failure, software failure, power and environmental issues, link or carrier failure, network design, human error, or lack of process. You should closely evaluate each of these parameters when evaluating the overall availability budget for the network. If the organization currently measures availability, you may not need an availability budget. Use the availability measurement as a baseline to estimate the current service level used for a service level definition. However, you may be interested in comparing the two to understand potential theoretical availability compared to the actual measured result. Availability is the probability that a product or service will operate when needed. See the following definitions: 1. Availability 1 (total connection outage time) / (total in service connection time) 1 [Sigma(num connections affected in outage i X duration of outage i)] / (num conns in service X operating time) 2. Unavailability 1 Availability, or total outage connection time due to (hardware failure, software failure, environmental and power issues, link or carrier failure, network design, or user error and process failure) 3. Hardware Availability 4. The first area to investigate is potential hardware failure and the effect on unavailability. To determine this, the organization needs to understand the MTBF of all network components and the MTTR for hardware problems for all devices in a path between two points. If the network is modular and hierarchical, the hardware availability will be the same between almost any two points. MTBF information is available for all Cisco components and is available upon request to a local account manager. The Cisco NSA HAS program also uses a tool to help determine hardware availability along network paths, even when module redundancy, chassis redundancy, and path redundancy exist in the system. One major factor of hardware reliability is the MTTR. Organizations should evaluate how quickly they can repair broken hardware. If the organization has no sparing plan and relies on a standard Cisco SMARTnet agreement, then the potential average replacement time is approximately 24 hours. In a typical LAN environment with core redundancy and no access redundancy, the approximate availability is percent with a 4 hour MTTR. Software Availability The next area for investigation is software failures. For measurement purposes, Cisco defines software failures as device coldstarts due to software error. Cisco has made significant progress toward understanding software availability; however, newer releases take time to measure and are considered less available than general deployment software. General deployment software, such as IOS version 11.2(18), has been measured at over percent availability. This is calculated based on actual coldstarts on Cisco routers using six minutes as the repair time (time for router to reload). Organizations with a variety of versions are expected to have slightly lower availability because of added complexity, interoperability, and increased troubleshooting times. Organizations with the latest software versions are expected to have higher non availability. The distribution for the non availability is also fairly wide, meaning that customers could experience either significant non availability or availability close to a general deployment release. 5. Environmental and Power Availability You must also consider environmental and power issues in availability. Environmental issues relate to the breakdown of cooling systems needed to keep equipment at a specified operating temperature. Many Cisco devices will simply shut down when they are considerably out of specification rather than risking damage to all hardware. For the purpose of an availability budget, power will be used because it is the leading cause of non availability in this area. Although power failures are an important aspect of determining network availability, this discussion is limited because theoretical power analysis cannot be accurately done. What an organization must evaluate is an approximate measurement of power availability to its devices based on experience in its geographic area, power backup capabilities, and process implemented to ensure consistent quality power to all devices. For a conservative evaluation, we can say that an organization with backup generators, uninterruptible power supply (UPS) systems, and quality power implementation processes may experience six 9s of availability, or percent, whereas organizations without these systems may experience availability at percent, or approximately 36 minutes of downtime annually. Of course you can adjust these values to more realistic values based on the organization's perception or actual data. 6. Link or Carrier Failure Link and carrier failures are major factors concerning availability in WAN environments. Keep in mind that WAN environments are simply other networks that are subject to the same availability issues as the organization's network, including hardware failure, software failure, user error, and power failure. Many carrier networks have already performed an availability budget on their systems, but getting this information may be difficult. Keep in mind that carriers also frequently have availability guarantee levels that have little or no basis on an actual availability budget. These guarantee levels are sometimes simply marketing and sales methods used to promote the carrier. In some cases, these networks also publish availability statistics that appear extremely good. Keep in mind that these statistics may apply only to completely redundant core networks and don't factor in non availability due to local loop access, which is a major contributor to non availability in WAN networks. Creating an estimate of availability for WAN environments should be based on actual carrier information and the level of redundancy for WAN connectivity. If an organization has multiple building entrance facilities, redundant local loop providers, Synchronous Optical Network (SONET) local access, and redundant long distance carriers with geographic diversity, WAN availability will be considerably enhanced. The phone service is a fairly accurate availability budget for non redundant network connectivity in WAN environments. End to end connectivity for phones has an approximate availability budget of percent using an availability budget methodology similar to the one described in this section. This methodology has been used successfully in data environments with only slight variation, and currently is being used as a target in the packet cable specification for service provider cable networks. If we apply this value to a completely redundant system, we can assume that WAN availability will be close to percent available. Of course very few organizations have completely redundant, geographically dispersed WAN systems because of the expense and availability, so use proper judgement regarding this capability. Link failures in a LAN environment are less likely. However, planners may want to assume a small amount of downtime due to broken or loose connectors. For LAN networks, a conservative estimate is approximately percent availability, or about 30 seconds per year. 7. Network Design Network design is another major contributor to availability. Non scalable designs, design errors, and network convergence time all negatively affect availability. Note: For the purposes of this document, non scalable design or design errors are included in the following section. Network design is then limited to a measurable value based on software and hardware failure in the network causing traffic re routing. This value is typically called system switchover time and is a factor of the self healing protocol capabilities within the system. Calculate availability by simply using the same methods for system calculations. However, this is not valid unless the network switchove
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks