Scaling Payment Syncing & Improving DevOps on the Production Engineering team at Drop

Modifying system architecture to improve inter-service communication, and establishing a framework for executing long-running scripts

January 1, 2024
by George Shao

What is Drop?

Drop is a fintech startup based in Toronto, Canada which has over 5 million members and has raised over $70 million from investors.

Users can link their debit/credit cards to the mobile app, and earn rewards when spending at partner brands including Amazon, Starbucks, and Uber.

Drop has appeared on LinkedIn's Top Startups in Canada list multiple times, ranking 6th in 2019, 2nd in 2020, 10th in 2021, and 9th in 2022.

This article is based on what I learned during my time as a Software Engineering Intern on the Production Engineering team there.

_{This article does not include any confidential information that is not already publicly available.
Some technical details have been simplified, and exact details may be different from what is described here, as I am writing some of this from memory.
Much of the info presented here can be found on the Drop Engineering Blog.}

Scaling Payment Syncing

Since Drop's launch, members have spent over $155 billion — that's an enormous amount of transactions that need to be synced from our data providers to our databases.

The Main Cluster

As a simplification, Drop's infra was composed of a Ruby on Rails app on EC2 interfacing with various AWS products including RDS, S3, Cloudfront, ELB, Redis, Memcached, SES, Elasticsearch. Essentially, all traffic was eventually routed to a Kubernetes cluster responsible for hosting our Rails monolith and most of it's related services, including our admin portal, Puma web server, Sidekiq background job workers, feature flag service, and more.

As Drop grew, we needed to scale our infrastructure to handle the increased traffic.

We already had Kubernetes pod replicas and horizontal pod autoscalers configured, so we looked deeper and found that 67% of incoming traffic to the main Kubernetes cluster was from third party APIs related to payment syncing.

The Payment Syncing Cluster

We decided to move payment syncing to a separate Kubernetes cluster, which would allow us to scale it independently from the rest of our infrastructure.

So we had two Kubernetes clusters, a primary cluster for the main Rails app, and a secondary cluster for handling all payment syncing related tasks. But third party API webhook payloads were still hitting the main cluster, so we temporarily forwarded them from the main cluster to the payment syncing cluster.

This was obviously inefficient, but it allowed us to move quickly and get the new cluster up and running. Unfortunately, it wasn't simple to change the third party API webhook endpoints, as they were configured on a per-user/per-card basis, and we had millions of users, many with multiple linked credit/debit cards.

In order to point the third party API webhooks to the new payment syncing cluster instead of the main cluster, we needed to make a separate API call for each user and each of their cards. At the very least, we'd have to make nearly a million API calls to third party APIs, and we didn't have a good way to run large migrations like this.

Improving DevOps

Our existing method of running large migrations was to manually access the Kubernetes cluster and run the migration script on a single pod.

We wanted to improve this process by making it easier to orchestrate long running scripts. When searching for a solution, we looked for a few key features, including:

Persistence Across Deployments
Error Handling and Retries
Parallelization
Logging and Metrics Collection

To solve this problem, we setup Maintenance Tasks, a Rails engine by Shopify for queueing and managing data migrations.

Persistence Across Deployments

Based on some benchmarks I performed, I found that our original method would take 240 hours to make all the necessary API calls to update the third party API webhook endpoints.

We needed a way to persist the state of the script across deployments, so that we could resume the script from where it left off if it was interrupted.

Our original method of running the script on a single pod would not work, as the pod would be destroyed when the deployment was updated. Without Shopify's Maintenance Tasks gem, we would be blocked from deploying to production for 2 weeks, slowing down our development process and wasting dev time.

Parallelization

Unfortunately, the Maintenance Task gem did not support instantiating multiple instances of the same task to run in parallel.

We considered a few options, including monkey patching, but ultimately settled on duplicating the script and adding parameters to specify a range of user ids to process, allowing us to divide the work across multiple tasks running in parallel.

Looking at the Maintenance Tasks dashboard, we could see that multiple instances of the script were running in parallel:

This reduced the time to complete the migration from 240 hours to 72 hours, a 70% improvement.

Error Handling and Retries

While running our migration script using the Maintenance Tasks framework, we encountered numerous errors:

degraded service / downtime with third party APIs
intermittent networking issues
errors when restarting Kubernetes pods
database contention issues

Unfortunately, there was no built-in or default behaviour to retry the task or resolve the error in another way. There were some built-in features that could be used to make this functionality ourselves including the after_error callback and the throttling mechanism.

To resolve these issues, we implemented our own custom error handling logic: we rescued (caught) the errors, retried the process with linear backoff, and logged the error if it persisted. This helped ensure that we could run the task continuously without human intervention for as long as possible, preventing intermittent issues from halting task progress, while still keeping track of what was causing errors.

Logging and Metrics Collection

The Maintenance Tasks gem provided a dashboard for viewing the status of tasks, but it did not provide any logging or metrics collection.

We wanted to increase observability, so we added logging using our logging provider LogDNA/Mezmo and metrics collection with DataDog and StatsD.

This allowed us to determine how many users had been processed, how many API calls had been made, and how many errors had occurred.

It also allowed us to observe whether our migration script was working properly, with new payment syncing webhook payloads from third party APIs being sent to the payment syncing cluster instead of the main cluster.

The image above is from a DataDog dashboard I created. The light blue bars represent the number of payment syncing webhook payloads received by the main cluster, and the dark blue bars represent the number of payment syncing webhook payloads received by the payment syncing cluster.

At the beginning, requests are being sent to the main cluster, then forwarded to the payment syncing cluster, so both bars are the same height.

As we run the migration script over the course of 72 hours, setting the third party API webhook endpoint for each user/card to the payment syncing cluster instead of the main cluster, we observe the number of requests sent to the main cluster decreasing to nearly 0%, and the number of requests sent to the payment syncing cluster increasing to nearly 100%.

Impact

In the end, I was able to make a significant impact on Drop's infrastructure and DevOps processes.

By moving payment syncing to a separate Kubernetes cluster and running this large webhook endpoint migration, we were able to scale it independently from the rest of our infrastructure, reduce the load on the main Kubernetes cluster by 68%, and reduce webhook response time by 84%.

By establishing a framework for executing long-running scripts, we were able to allow developers to run efficient and observable migrations without blocking the production deployment pipeline.

This was just one of the two major projects I worked on during my 4-month internship at Drop, and overall I'm proud of the impact I made.