By Eric Burgener, Senior VP of Marketing InMage

Taking steps to prepare for Disaster Recovery

By Eric Burgener, Senior VP of Marketing InMage

In today's climate, most enterprises maintain some form of business continuity plan. Business continuity plans provide a way for an enterprise to continue functioning in the event of a catastrophic disaster that shuts down business operations at one or more primary locations. Business continuity plans cover information technology (IT) infrastructure recovery, human capital issues that arise when business operations must be restarted at a remote location, and physical infrastructure issues, such as re-establishing communications, ensuring physical security, and providing appropriate work areas at remote locations. IT infrastructure recovery, sometimes referred to as disaster recovery (DR), addresses the issues involved with recovering computing equipment (servers, storage, etc.), data, and application services. DR provides a necessary foundation for business continuity plans but is not a substitute for them. An effective DR plan is especially important for online foreign exchange (FX) brokers, whose trade servers are absolutely essential to their businesses.

Any downtime events, such as a server failure, virus or natural disaster, significantly affects a broker's ability to complete transactions, generate revenue and service its customers. FX brokers also operate in an increasingly strict regulatory environment. The National Futures Association (NFA) requires all members to adopt a disaster recovery (DR) plan reasonably designed to enable them to continue operating, re-establish operations, or transfer their business to other members with minimal disruption to their customers, other members, and the commodity futures markets. Additionally, all downtime events must be reported. This article will focus on the key elements of creating an effective DR plan, and then provide a short case study of how a leading provider of foreign exchange trading services, Interbank FX, implemented an updated DR plan to meet their requirements.

Market Forces Driving Change

The conventional approach to DR was to periodically ship copies of backup tapes to remote locations, where they were often stored for years, to ensure data recovery in the event of catastrophic disasters which may shut down primary sites. In the world of DR, two key metrics govern recovery capabilities: recovery point objective (RPO) and recovery time objective (RTO). RPO defines the minimum acceptable level of data loss (e.g. no more than 24 hours, no more than 4 hours, etc.) per recovery event, while RTO defines the maximum acceptable time to recovery (e.g. data and/or applications restored and running within 8 hours, etc.).

Remote recoveries from tape generally exhibit lax RPO and RTO. By the time tapes are stored at a remote location, the data may already be several days to a week old, and recovery can easily require several days to a week. Data is growing at unprecedented rates, and evolving business and regulatory mandates are driving ever more stringent recovery requirements. For most critical application environments, a tape-based DR approach just can't meet these requirements, putting businesses at risk for lost revenue, poor customer service, and, in certain extreme cases, overall business viability. These market forces are driving many FX brokers to re-evaluate how they plan for DR.

Planning for Effective DR

There are four critical planning steps that foreign exchange brokers must take in either setting up a DR plan for the first time or re-evaluating their pre-existing plans:

Step 1: Understand business priorities

While FX brokers have a number of business processes, certain ones are more critical than others. Generally, any business processes that are directly related to revenue generation or customer support are deemed critical. To focus in on areas for which a recovery plan must truly exist, it helps to: 1) understand the time-sensitivity of recovery and how it relates to business priorities; and 2) identify the impacts of failures, quantifying them in terms of dollar amounts (e.g. revenue lost per hour, etc.) where possible. Create a prioritized list that includes all major business process areas, and then map those business processes to the relevant supporting IT infrastructure. The end goal of this exercise is to have a list of applications, servers, and storage that must be available to support each business process.

Step 2: Assess your recovery requirements

Once major business process areas have been prioritized in terms of their criticality to the business, you will know which ones need to be focused on first. The next step is to determine the business impact of longer versus shorter recovery times for these key business processes. Recovery tiering is an approach that is often used when evaluating the recovery requirements associated with various business processes. Instead of evaluating and setting recovery requirements individually for all major business process areas, a small number of recovery tiers is defined. Each tier has a set of recovery performance metrics (e.g. RPO, RTO) that are associated with all application environments within that tier. For example, you may define three tiers: the highest tier for your most critical business processes without which you cannot run your business, a middle tier for applications that are not critical but still important, and a lower tier for all other applications.

Keep in mind that it's not just data recovery you'll need to focus on. When you have to recover from a major outage, you'll likely need to recover both data and applications. Many enterprises implement a DR plan for just data, assuming that servers and application environments will be manually rebuilt and recovered if they need to be. By relying on manual recovery processes for applications, you are putting your business at additional risk. Automated application recovery will be more reliable and perform more predictably because it will not be as dependent upon the skill of the administrators that are actually performing the recovery (your best trained administrators may not always be available when a real disaster hits).

Step 3: Match the right solutions to your recovery requirements

Once you've determined the key recovery metrics of RPO and RTO, you'll need to consider just what type of IT infrastructure you need to meet them. Understand the recovery capabilities that various technologies deliver. Tape has low storage costs, but supports very lax RPO and RTO and requires a lot of administrative overhead during recoveries. This, however, may meet your requirements. If you need better RPO and RTO performance, you may want to consider disk. Disk has higher storage costs, but can support very stringent RPO/RTO, requires significantly less administrative overhead for recoveries, and supports access to a variety of next generation data protection technologies like continuous data protection (CDP), asynchronous replication and WAN optimization that solve a lot of other recovery problems that tape cannot. Finally, don't just consider data recovery technologies; look for technologies that can help automate application recovery as well for your highest recovery tier application environments.

Step 4: Test your DR plans

There is a big difference between theory and reality. We've probably all heard the story about the bumblebee. Scientists evaluating the aerodynamics of the bumblebee, given what we know about aeronautics, would have to conclude that it could not fly. And yet it does.
To be sure your DR plan will work as expected, you have to regularly test it. If you are using some form of replication to meet stringent DR requirements, these are complex configurations that can evolve and degrade over time in unexpected ways. You want no surprises - your DR plan should work predictably. Newer technologies like server virtualization and application failover/failback can help make DR testing non-disruptive to production environments and much less expensive than it has been in the past. Regular testing also helps you fine tune and improve your recovery capabilities, evolving them over time as your own recovery requirements evolve.

Illustration of a DR Plan

Headquartered in Salt Lake City, Utah, IBFX is a leading provider of online foreign exchange trading services that serves over 35,000 clients across more than 140 countries. IBFX maintains two data centers, a main production center in Salt Lake City that houses all of their business-critical trade servers, and a remote data center in New York.

Taking steps to prepare for Disaster Recovery

With data growth rates skyrocketing, IBFX was looking to maintain compliance while at the same time improving their recovery capabilities. The main production center had a variety of heterogeneous servers and storage, and IBFX was looking for a solution that would provide the flexibility to accommodate all of them. Of particular concern were minimizing data loss on recovery, shortening recovery times, and solution scalability.

"We needed technology that would enable us to fully recover our data center in the event of a catastrophe, without any gaps," said Paxton Powers, IT Infrastructure Manager, IBFX. "Our idea was for a completely virtual DR site. Real-time replication from physical to virtual machines would be the fastest way to transfer data from the Salt Lake City data center to the New York DR site."

Taking steps to prepare for Disaster Recovery
Paxton Powers

"We needed technology that would enable us to fully recover our data center in the event of a catastrophe, without any gaps,"

After evaluating their requirements, IBFX came to the conclusion that tape-based infrastructure could not meet their highest recovery tier requirements, and that the business impacts of excessive downtime justified an investment in newer technology. Candidate technologies to meet IBFX's recovery requirements included CDP, asynchronous replication, recovery automation, and disk-based recovery.

InMage Systems provided a software-based recovery solution that integrated local (backup) and remote (DR) recovery capabilities into a single solution designed to support heterogeneous environments. InMage's foundation technologies, which included CDP, asynchronous replication, application failover/failback, WAN optimization, and disk-based recovery, were a good fit for IBFX's needs. CDP helped minimize the impact of data protection operations on trading servers, helping them to maintain high performance, and provided options to minimize data loss on recovery while meeting very short RTOs. Asynchronous replication, combined with WAN optimization, allowed IBFX to maintain very current copies of their production data sets at their remote data center in New York while keeping bandwidth costs to a minimum. Application failover and failback extended the solution's abilities beyond just recovering data, helping IBFX to automate application-level recovery operations to make them faster and more reliable.

"Our main objective was to be able to recover our data center and remain operational during downtime events," said Powers. "InMage gave us that ability. We've got our production trade servers being replicated between our two data centers, which is a huge win. Additionally, we can meet near-zero recovery time objectives, enabling us to come back online very quickly after a problem. In the trading business, time literally is money and every minute of downtime counts. We have peace of mind now that we've minimized the risk of impacting customers or revenue due to server downtime, whether it's a simple failure or a natural disaster."

Taking steps to prepare for Disaster Recovery