Overview
This appendix outlines the key processes to enable Lamplight (the Service) to recover from various disaster scenarios. Our aim is to ensure Lamplight downtime is minimised and data integrity maintained.
A secondary focus is ensuring Lamplight’s internal business systems (the Business) can recover from disaster scenarios
The backup and disaster recovery procedures are subject to change. This is the current, authoritative version covering customers on AWS servers.
Scope
The disaster recovery plans cover
1. Service: Catastrophic server failure
2. Service: Hosting provider business failure
3. Service: Data corruption
4. Business: IT failure or loss
They do not cover physical risks to the server infrastructure. Our servers are provided by a third party, Amazon AWS. Amazon AWS have a number of certifications demonstrating their compliance with a range of security and availability standards, including ISO27001, ISO 27017, ISO 27018, SOC1, SOC2, and SOC3 and Cyber Essentials Plus. We consider the risk mitigations in place by our suppliers to be sufficient.
Assets
The core assets that would require recovery of Lamplight are
1. Service: Customer data
2. Service: Application code
3. Business: Development code and tooling
4. Business: Business critical data
Backups
The following backup processes are in place:
Asset | Process | Frequency |
---|---|---|
Customer data | Data replication | Continuous |
Customer data | On-site backups | Daily |
Application code | Off-site backups | Daily |
Development code and tooling | Off-site backups | Daily |
Lamplight business data | Off-site backups | Daily |
Lamplight uses the backup and recovery functionality provided by AWS. Central to this is the concept of Regions and Availability Zones (AZ).
Data is held in the London region. Within that region, two Availability Zones are available. These are engineered to operate independently, so that problems in one AZ do not impact services in the other.
We do not maintain hot backups in other locations or with other providers. We judge the risk of Amazon AWS closing down with no notice to be minimal and not necessary to prepare for. We judge the risk of outage of the entire London region to be higher, but still with very low probability. Amazon do not provide a second UK based region to maintain a DR capacity within. We do not judge that it will be acceptable to customers to maintain DR facilities outside of the UK; and the cost of maintaining a DR capability with a separate provider within the UK will be excessive compared to the benefit of doing so.
Databases
Data is stored on Amazon Aurora instances. These duplicate data 6 ways across both AZs. Each database server has a replica on standby, and if faults are detected in the primary server AWS will promote the replica, while the faulty server heals. This happens rapidly (usually under a minute) and automatically.
Encrypted snapshots of the entire databases are taken every evening and retained for 21 days. These snapshots can be restored to new servers if necessary. These snapshots protect against an entire Aurora instance becoming irretrievably corrupted.
The maintenance window for database servers is one hour at 1am GMT on a Monday morning. Updates to the underlying database software will be carried out by Amazon at this time and may result in short periods of downtime.
File Data
Customer files uploaded to Lamplight are stored, encrypted, on S3. This provides extremely high availability, and multiple copies of files are stored automatically by AWS.
Application Servers
Application servers are provisioned from a pre-built image with the latest application code deployed to them automatically. These servers are spread equally between both London AZs. Server capacity can be automatically added or removed at any time depending on traffic and load.
Other Services
Lamplight also uses Application Load Balancers, DynamoDB, AutoScaling, and others, all of which are engineered to be able to scale in response to demand, and to failover seamlessly in the event of faults in any one component.
Monitoring
We carry out detailed monitoring of the service health of various components, response times, and other key metrics to ensure service responsiveness and health.
Key Contacts
Matt Parker
Amazon AWS https://status.aws.amazon.com/
Communications
Our external system status site is at https://sites.google.com/a/lamplight3.co.uk/status/. This should be updated as soon as possible in a DR scenario. Updates should also be posted to Twitter @LamplightDb