April 20th, 2010 @ Gareth Bult // No Comments
We suffered major outage today at our Data Center, or rather our Data Center suffered an outage which resulted in all our kit not being accessible from the outside world. As far as we can tell the core router supplying the data center blew all of it’s redundant power supplies simultaneously, knocking all connectivity off-line. The problem was compounded by a number of associated communications systems (PBAX’s etc) also being taken off-line, thus limiting on-site communications.
Here’s the feedback from the Engineers;
Around 11:16pm our primary power supply failed in our Cisco core, and shortly after it’s second PSU also blew, taking out the breaker on the comms cabinet in room one. We tried to repair the PSU at circuit level but alas, were unsuccessful and declared it a lost cause at approx 1am. We tried to introduce our foundry router to the THN links but a fault light was on the BT Openreach WES box, so we called the BT NOC at around 1:45am to report a fault on the circuits. We were replied to at around 2:15am with “no fault found” , “customer equipment faulty” – Following this we spent the next few hours attempting a workaround with three different switches, LH/LX optics etc. – All with the same result.
At 8am, Hardware.com despatched two replacement PSU’s by dedicated delivery driver, these only arrived just after 11am. The switch was up and running, and as logs show, customer racks were reconnected at around 11:15am, but our main fibre and redundant pair were both*still* showing faults on the NTE unit. BT Openreach were called again at around 12 noon, and we demanded site presence. The engineer arrived onsite first at 3pm, then 4pm, having completed “end to end loopback” tests on both fibre pairs – No fault found. I then took it upon myself to close the front gate and take hostage the BT Openreach engineer, who was determined he was finishing his shift, and would return the following day – Naturally, this wasn’t going to happen. After quizzing him repeatedly regarding whether his colleague at Star in Barnwood, had ruled out the line card or done a reboot, I finally convinced him to pull the card at the remote end for a cold reboot.
At around 5:30pm the first fibre pair became active, green lights, and we began exchanging packets of data with Telehouse North once again.
Many thanks to the guys at the Data Center who stuck with the fault from 11:16pm last night until 5:45pm today when BT managed to get the circuits back on-line. Time for a well-earned rest. Following on, this is the work currently underway to mitigate the effects any such future problems;
We have placed on order, two Cisco 6509-E chassis, two Sup 720 3B cards, and a plethora of Cisco line cards, with spare optics etc. Shortly we’ll have a pair of fully redundant cores, powered from a pair of separate Riello UPS systems, A + B fed power supplies in each unit. Our Gas Turbines and new 11Kv feed are already onsite and ready for commissioning at the beginning of next month, and we’ll be posting various photographs of the hardware, power circuitry, and our new suite.
The Date Center’s website (Saxon Data) can be found here.
Saxon Data Incident Report
Date of Incident : 2010-04-19
Incident Type : Communications Loss
At 23:30 (BST) on Monday, 19th April 2010, we experienced a multiple power supply failure to our Cisco 6509 router directly impacting connectivity to Saxon House. Replacement PSU’s were immediately sourced from our hardware vendor but were unavailable until the following morning.
At 10:24, the replacement PSU’s arrived and were fitted by on-site engineers. The router booted successfully but failed to establish a layer 1 connection to our Telehouse router located in London, Docklands. Upon further investigation, the BT’s remote NTE which carries the first leg of layer 1 fibre connectivity to London, had appeared to reset during the outage.
A BT engineer was dispatched to the local Barnwood exchange to diagnose and correct the outstanding fault. At 17:30, full connectivity was restored and services resumed.
Obviously, our customers cannot tolerate this length of outage and we must make sure that we minimalise the risk of anything similar happening again. To that end, we are bringing forward the acquisition of another Cisco router which should be installed by the time this weekend is out. For the technical people out there, we will be running our new cores on the following rig:
Router 1
Cisco 6509 Chassis
2 x 2500W Power Suplies
1 x High speed fan tray (plus 1 spare for each router)
1 x WS-SUP720-3B-GR3 – Supervisor fabric (handling main inbound dual fibre feeds)
1 x WS-X6748-GE-TX-GR3 – 48 port Gigabit line cards for customer racks
1 x WS-X6724-SFP-GR3 – 24 Port SFP fibre card for customer racks
Router 2
Cisco 6509 Chassis
2 x 1300W Power supplies
1 x High speed fan tray (plus 1 spare for each router)
1 x WS-SUP720-GR3
1 x 16 port classic fibre card
1 x 48 port Ethernet card
We’ll be running dual OSPF sessions (1 per router) to the KCN rack in Telehouse North, over the existing BT NTE equipment and also the second, redundant fibre. This will mean if any part of a router fails, we have a spare onsite and if a complete catastrophic failure occurs, it’s a matter of several minutes work by hands to swap everyone over to the spare box.
We’re currently holding over 150 pieces of spare SFP optics – SX and LH/LX, so a trivial failure of one single port is catered for also.
Each router will be power from independent N+1 UPS feeds, and this will also be an available option to rack customers at some time real soon.
In addition, we have a 2 hour service contract arranged with Hardware.com who are based in Cirencester – 20 minutes from us.
We’re also bringing forward investment in our Power generation and UPS infrastructure to maintain a high degree of redundancy. As an example, we have taken delivery of three Riello master plus units to be run in parallel, and have another one on the way, which will bring us up to a total 400kva 2N.
All units have output transformers, and input rectifiers, 8 minutes battery autonomy at full load. We have arranged to light the second fibre in the BT OpenReach blowpipe, and to have it bypass the NTE equipment at both ends of the circuit, ensuring that a failing NTE can no longer cause failure of the data path.
We are working to ensure that an off-site copy of the ticketing system will be available in the future in the event that the data centre suffers a communications outage.
As we expand we will install a second, diverse fibre feed to avoid further downtime due to events such as fibre damage during civil engineering work. We have meetings with two major communications providers in the next month, both of whom are eager to discuss POPing our building with their own Dark Fibre feeds. Naturally, this would be a massive advantage to both the facility and its customers, so we’ll keep you all updated.
On Thursday 6th we are looking to receive the last of the parts for the new router and will have fresh Multimode cable into each rack. The new router will at first be placed in line with the old one and all customer VLANs will continue to be fed from the old system. Customers will be notified of the exact time of this emergency changeover. As settings will be done beforehand, this should result in an outage of less than 10 seconds, whilst the fibres are swapped over.
We’ll let everyone know once this phase is complete and invite service tickets, where you will get the chance to specify the exact time of day we switch you over to the new core. We plan on leaving the old fibres into each rack for extra redundancy and to save time should a cabling fault ever affect you.
Once everyone is over to the new router, we’ll be looking to pull the old 6509 chassis and replace with the new spare unit / Sup card / PSU’s etc. And setup the second fibre link.
Naturally, we’ll keep everyone informed throughout the whole process by email and if anyone has any questions, please feel free to call / email.
Please accept our most sincere and deepest apologies and feel free to copy this document to any of your clients who may be asking questions of you. You can all rest assured that the process of upgrading connectivity, power and UPS / generation is completely underway and will result in a highly resilient DC infrastructure in a very short while.
Thanks
Saxon Data technical support team.
| Mobile broadband deals Compare mobile broadband deals from all UK mobile broadband providers at Broadband Expert. | IT Support Bristol Outsourced IT Support & IT Services for Bristol, London & Bath. Linux & Microsoft certified business support, Call 01225 588 588 | IT Support For The Best IT Support In London Make Our IT Department Your IT Department. Microsoft Gold Certified Partner. | IT Support London Offers executive IT Support and IT Services to SME businesses in and around Greater London. | IT SUPPORT Award winning IT Services from London provider Wavex offering support, advice, and training |
| IT Outsourcing A full range of services, from hosting, data recovery and day-to-day support through to Board-level IT advice through our virtual IT Director offering. |