Saturday, April 30, 2011

Back up and running

I blame Virgin. Up until late last year everything in IT seemed honky dory. However, ever since the Virgin reservation system went belly-up for a few days late last year there seems to have been a succession of IT failures that have caught the attention of the mainstream press. ATM’s have either been offline in major banks or else they have permitted unauthorised withdrawals. Elsewhere a financial institution in Queensland was off air for several days in January when both its production and back up data centres were affected by the Queensland floods. Yet to the layman all this seems inconceivable.

IT is at the epicentre of most organisations today. If your IT is down then so is your business. It clearly is a matter that regulators have recognised. Organisations like the Australian Prudential Regulatory Authority (APRA) certify the operational procedures of Australian financial institutions and one area they pay particular attention to are the disaster recovery procedures. They do this because they recognise that the integrity of any finance company is closely aligned with the robustness of its IT operations. I suspect, for similar reasons, this is why disaster recovery arrangements are a key element of SOX compliance. Similarly, I know disaster recovery is an area covered by the emerging IT governance frameworks like ISO 38001. Yet, why, with all this focus on compliance in recent years, have these IT disasters eventuated?

My colleague Kevin McIsaac eloquently summed it up for me the other day. He regarded disaster recovery as a no win topic for most IT executives. Like other security expenditure such as insurance it is a grudge investment. You do it because you have to but you hope you never have to use it. As such, there can be a tendency in even the best businesses to under invest in this area. The business is not enthused about it because they question how expenditure here will help the bottom line. IT executives are loath to highlight deficiencies here because they know they will only get lukewarm support from the business and they have other things to worry about. As such, Kevin believed many organisations took a wing and a prayer attitude to disaster recovery.

Disaster recovery will be the topic for the next Coalface session. The presenter will be the General Manager of IT at a second tier bank. One Saturday afternoon four years ago an electrician who was working to increase the capacity of the company’s data centre inadvertently plugged the wrong device in to a socket. In so doing he fused the electrics and plunged the data centre in to darkness. Pretty soon the room was full of smoke. The IT executive now had a golden opportunity to activate the company’s rigorously tested, thoroughly documented, independently audited, regulatory compliant, disaster recovery (DR) and business continuity (BCP) plans. He confidently reported to his management that he expected things would be back to normal in about four hours. Seventeen hours later, as Saturday evening rolled in to Sunday morning, he began to understand the deficiencies in these arrangements.

While the IT team eventually got the production data centre back up and running it was clear that the episode could easily have been a disaster. The following Monday the executive reported his concerns to his CEO who immediately wrote him a cheque for several million dollars and gave him a mandate to fix it. The CEO had done a rough, back of the envelope, calculation of just how much the unavailability of the IT systems for several days would cost the organisation. He realised that it would be foolhardy not to address these deficiencies in the DR arrangements.

Since 2007 the IT executive has been diligently improving the IT infrastructure in the business to enhance the effectiveness of DR arrangements. Moreover, he has engaged external consultants to help him document the necessary processes, and the personal who are responsible in each of these processes, on an easily read A3 document which can act as a key reference document in an emergency. One of the lessons the company had learned from its disaster in 2007 was that a 250 page detailed disaster recovery plan is not much use in a real life crisis. More recently, he has evolved his operations to an active-active arrangement across two remote datacentres which includes the ability to cluster a production database across the two data centres and concurrently write transactional data at both sites while retaining full data consistency.

I think he has a great story to tell. Yet I have to say I am putting on this session with some trepidation. I have quite a bit of interest from a number of members. Some have told me that they realise their company needs to do more in the area of DR, something that is highlighted by the fact that it often takes them two weeks to prepare for A DR test. Yet several other members have been quite dismissive of this topic as a session. They have told me they think their disaster arrangements are tested frequently and are well understood. They regard them as a strong point in their IT operations. As such, they doubt whether they can learn anything from someone else. Moreover, they believe the problems experienced by the Bank in question are a reflection that its DR arrangements were inadequate in the first place.

Nevertheless, in the balance of things I have decided to use this case study as the next Coalface session. I believe the growing number of high profile and prestigious organisations that have experienced significant issues with their DR and BCP plans in recent months is evidence that more needs to be done in this area. As such, I felt that the chance to hear from a local counterpart who has had to address a near death experience with DR strikes me as an invaluable learning experience for others and one that is very much in keeping with the ethos behind The Coalface Community.

However, there was also a personal factor in this. Last weekend I had my own near death experience. Driving on the wide suburban streets of Charters Towers in outback Queensland I, inadvertently, missed a STOP sign and ploughed head long in to a car on the primary road. Thanks to modern safety standards in cars my brother and I and the lady I struck walked away unharmed from what was an horrific accident. For me the episode was both a humiliating experience, given my own incompetence put three people’s lives at risk, but, perhaps more importantly, a wake up call. Never again will I deride the importance of seat belts and air bags in cars. I’m sure they saved my life. Therefore, I suppose I’m living evidence that you need to live through a potential disaster before you truly appreciate the safeguards you need to apply.

0 comments: