Dozens of unique tasks and runbooks run daily
Flip is a Fintech company based out of Indonesia that allows users to send money securely at a low cost. In December 2021, Flip raised $48M in Series B funding and now provides over eight million users with a number of financial services. They use Airplane as their primary platform for runbook automation.
Heavy reliance on SRE led to bottlenecks
Over the last year and a half, Flip has added over 250 team members to provide financial services for millions of people. As they grew, so did their reliance on technical people, specifically their Site Reliability Engineers (SREs).
The SRE team at Flip handles a number of critical technical operations including services reliability, infrastructure management, system administration, and issue patching, and they need to be able to solve problems quickly.
When Flip's business experienced hyper-growth, the team had to grow massively to keep up with demands. As a result, the increased number of requests to SREs started to become unmanageable. They became bottlenecks for things like resolving on-call support tickets. Tasks took longer to resolve because they were constantly context switching and dividing their investment across a number of high-priority tasks.
They needed to find something that would free up SRE bandwidth and enable them to move more quickly. Henry Suryawirawan, Flip's Head of Engineering, first started to use Airplane to automate SRE toils. Once the SRE team realized that Airplane could help streamline operations, they started using Airplane tasks for a number of other engineering-related and support use cases.
Airplane: bridging gaps across the company
There are a number of use cases where Airplane has been instrumental at Flip:
- On-call support: Flip uses a number of tasks and runbooks to handle production-related on-call requests. Customer requests received by the operations team often require access to the production database to resolve. The SRE team created predefined tasks for these recurring use cases in Airplane and configured approval flows on them. Now, an on-call support person requests the task through Airplane and a manager easily approves and runs it. Prior to Airplane, there were a small number of engineers who needed to SSH in and run a command which was tedious and error-prone. With Airplane, they've safely opened up access to these operations without having to grant production access unnecessarily.
- Supporting maintenance: When stack maintenance requires downtime, SREs use an Airplane task to switch on "maintenance mode" in a quick and reliable way.
- Restarting jobs: SREs often deal with jobs becoming unresponsive. Often, these jobs can be safely retried and solved by a job restart. Using Airplane, SREs can easily specify which VMs and which jobs to restart if a job is timing out.
- Reporting: Flip uses Datadog to monitor query health. Prior to Airplane, they would manually identify slow queries and remediate. Using Airplane schedules, they automatically generate a daily report that fetches slow queries from Datadog and sends a "Top Slowest Queries" report to their engineering Slack channel every morning.
- Feature flags: Flip often needs to enable features flags for customers on a one-off basis. Using Airplane, they can roll out features to specific customers at off-hours because they can trigger Airplane tasks both manually and via schedules. This means they can also schedule rollouts of new features.
- Rolling updates to prod deployments: Flip uses Airplane to easily pre-warm their Kubernetes cluster rather than using kubectl to do it manually. Previously, only devs who had access to Flip's Kubernetes cluster were able to make updates. Airplane broadens access to these operations, reduces risk of errors, and reduces the time spent running these operations.
Airplane provides value at Flip in a few key ways:
- Limiting the number of interruptions that SREs experience.
- Simplifying one-off manual tasks and toils like enabling feature flags and restarting jobs.
- Acting as an easy, maintenance-free scheduler.
- Promoting security. Predefining these operations with approval flows helps Flip massively with auditing and safety.
Today, Flip regularly uses dozens of unique tasks and runbooks. "Airplane not only saves us time, but provides security across our teams. The audit logging and governance we get with Airplane has been key to building out our internal processes."