Building Disaster Recovery Solution on AWS for SaaS
Unveil how we built a fully functional pilot light DR environment for protecting the client's SaaS infrastructure from the downtime.
Executive Summary
Creating a Disaster Recovery Infrastructure
Our Customer
Pragma IT therapyBOSS is a comprehensive web and mobile SaaS platform for agencies and clinicians that allows its users manage administrative and clinical aspects of home health therapy (Early Intervention, Physical Therapy, Speech Therapy, Skilled Nursing etc). It helps healthcare providers be more efficient and compliant, saves time, cuts costs and streamlines operations of treating patients at home.
The Obstacles They Faced
Client’s main workloads are running on-premises, but in order to meet the US healthcare compliance requirements, reduce restore time, recovery time objective (RTO) and recovery point objective (RPO), minimize the interruption of critical processes and safeguard business operations they needed to build a trustworthy and sustainable disaster recovery (DR) infrastructure.
How We Helped
Romexsoft successfully did professional work on therapyBOSS on-premises environments and software components and built a fully functional pilot light DR environment for protecting the client’s SaaS infrastructure from prolonged downtime and thereby for safeguarding the vital business operations.
The Challenges
Balancing Fast Recovery with Cost Efficiency
The challenge was to find the best possible option in building the DR solution for the SaaS from the perspective of the right balance between the fastest feasible restoration of the platform and the cost-effectiveness of disaster recovery infrastructure itself.
For instance, negative events that could happen with on-premise environments could be a hardware or software failure, a network or power outage, physical damage caused by fire or flooding, human error or some other kind of significant disaster which causes a negative impact on the business continuity.
The Solution
Cloud-Based Disaster Recovery Model
How the application is built
The TherapyBOSS application is written in Java and has microservices based containerized architecture. Communication between the microservices is implemented through the REST API and event driven approaches. Apache Kafka is used as a distributed event streaming platform. Galera Cluster for MySQL and MongoDB are used as data storage solutions.
How the DR infrastructure is designed
After several workshops with the customer Romexsoft suggested building pilot light DR infrastructure in the US East (Ohio) AWS region far from on-premises data-center. This decision was driven to meet client’s specific RTO, RPO and TCO requirements for their application as well as to enable faster disaster recovery of the critical IT systems from any event that harms the Pragma IT business.
The pilot light disaster recovery approach was delivered by configuring and running the most critical core elements of the customer system in AWS. When the time for recovery comes, AWS infrastructure rapidly provisions a full-scale production environment around the critical one.
Ensuring data relevance and synchronization
To provide constant data relevance for the solution, one of the Galera’s read replicas always runs on AWS EC2 instance and remains synchronized with the main cluster in the data center. Similar approach is designed for the MongoDB cluster. Additional Galera and MongoDB replicas will be provisioned on EC2 instances and synchronized as well.
Data synchronization between on-premises and AWS is accomplished through AWS Site-to-Site VPN. All other AWS services such as applications running in Fargate, AWS MKS, Jenkins server, and Bastion host run in the idle mode. In the moment of disaster, idle part of the AWS infrastructure will be provisioned using the infrastructure as code (IaC) approach with Terraform.
How the DR infrastructure is maintained
We have agreed with the customer to perform disaster recovery exercises for the staging environment on a monthly basis. This activity ensures:
- confidence that DR infrastructure always functions properly
- integrity of DR environment evolution in accordance with the app’s development
- tracking and compliance of determined time range for the restoration of replicas of the on-premises infrastructure
Disaster Recovery Solution for Healthcare SaaS Architecture Diagram
Amazon Web Services Utilized
Verified by AWS
This case study is validated by AWS. Experts and professional auditors from AWS reviewed this case study and verified that we, Romexsoft, have built a functional infrastructure and efficient cloud solution.
It showcases the value that Romexsoft, being a certified AWS Advanced Tier Services Partner, delivers cloud solutions according to AWS standards and best practices.
The Results
Minimized Downtime with Compliant DR Architecture
In general, AWS based DR infrastructure designed and developed by Romexsoft holds the critical core of the customer’s SaaS around which all other infrastructure pieces can be quickly provisioned to restore the complete system when the time comes.
Within the implemented solution we also achieved:
- meeting the US healthcare compliance requirements
- minimization of interruptions of critical processes
- safeguard of vital business operations
- cost effectiveness for the whole DR infrastructure
- warranty of restore systems and services in a short period of time
(about one hour recovery time objective (RTO) and seconds recovery point objective (RPO).
Why Romexsoft
Expert in Pilot Light Disaster Recovery
Romexsoft is an AWS-certified Consulting Partner, trusted Software Development Company and Managed Service Provider, founded in 2004. We help customer-centric companies build, run, and optimize their cloud systems on AWS with creative, elegant, and cost-efficient solutions.
Our key values
- Delivery of quality solutions
- Customer satisfaction
- Long-term partnership
We have successfully delivered 100+ projects and have a proven track record in FinTech, HealthCare, AdTech, and Media industries.
Romexsoft possesses a 5-star rating on Clutch due to its strong expertise, responsiveness, and commitment. 60% of our clients have been working with us for over 4 years.
Related Success Stories
Disaster Recovery Solution on AWS FAQ
An aws pilot light strategy involves maintaining a minimal version of a system, with core components always running. This ensures that in the event of a disaster, the system can be quickly scaled up to become fully operational, using the most recent data. By having this core system always on standby, recovery times are significantly reduced compared to traditional methods, ensuring business continuity with minimal disruption.
AWS disaster recovery solutions offer several advantages over traditional DR methods. They provide flexibility in terms of scaling, allowing for cost-effective solutions that can be tailored to specific business needs. AWS DR solutions also ensure high availability, with multiple regions and zones to choose from, ensuring data integrity and availability even in the event of regional outages. Additionally, the pay-as-you-go model of AWS allows for cost savings, as businesses only pay for the resources they use.
Integrating Infrastructure-as-Code tools, such as Terraform, into the DR process allows for the automated and consistent provisioning of AWS resources. This ensures that the DR environment is always in sync with the production environment, minimizing potential discrepancies during recovery. IaC tools also allow for version control, ensuring that any changes to the infrastructure are tracked and can be rolled back if necessary. This level of automation and consistency ensures a smoother and more reliable DR process.
Continuous data synchronization is crucial in a pilot light aws strategy as it ensures that the standby environment is always updated with the most recent data from the primary system. This means that in the event of a disaster, the recovery process will restore the most up-to-date version of the system, minimizing data loss and ensuring business continuity. Continuous synchronization also reduces the risk of discrepancies between the primary and DR environments, ensuring a smoother recovery process.