Feb 14, 2017 - By Stephen Baynham

Simple and Reliable Database Migrations

In a previous blog post, our manager of Database Performance and Reliability, Aaron Brashears, wrote about Twitch’s database topology, particularly for a massive website database that receives over 300,000 transactions per second.

One of DPR’s primary goals for 2017 is to begin the work of decommissioning that database by migrating data out into specialized datastores managed by the teams which own the services that use the data. Our team of Database Reliability Engineers are hard at work building self-service tools to help other Twitch engineers migrate, provision, and manage their datastores.

Jacqueline Xu at Stripe wrote recently about the difficulties involved when making migrations for their own datastores. We have many of the same requirements Stripe does:

  • Scale — Our table of follow relationships alone contains over 1.5 billion records!

  • Uptime — Twitch is live and streaming video 24 hours per day, 7 days per week, 365 days per year (no OpieOP)

  • Accuracy — Nobody likes it when their favorite streamer disappears from their follows page — the data post-migration needs to be identical to the data pre-migration

However, due to the the fact that we are looking to dismantle our database, not just improve it, we have some additional requirements:

  • Datastore Support — We aren’t just splitting up our database into smaller databases — the engineering teams taking ownership of the data are now able to find datastores which best meet their products’ requirements. Many teams are choosing to move data into DynamoDB, Aurora, or even a non-persistent cache. Our tooling needs to support all of these options, even if that means less power

  • Coder Focused — The teams taking over the data frequently do not have Data Engineers. We provide consults to these teams, but in order to meet our goals, these teams will need to be able to perform and validate migrations with little input from us. So, our tools need to be focused on software engineers without a strong operations or data engineering background

These needs are slightly complicated by the fact that most new code at Twitch is written in Go, a language that does not have templating at this time. As a result, it’s not easy to build service code that assists backend engineers with arbitrary migrations natively. By using generated glue code to connect engineer-written data interfaces to migration decorators to our migration system, we can make integrating the migration process mostly seamless.

The Process

Our process mirrors writes to the new database from the first, but because our services don’t use full upserts to write changes, we have a difficult bootstrapping problem on our hands: we have to bulk-migrate the database, then replay the writes made while the bulk-migration was ongoing, and only then cut over to simple mirroring when the new database has caught up to the old. As a result, here is the general sketch of the process:

  1. In the service, initialize a generated decorator around the old datastore’s writer which mirrors operations to an AWS Kinesis stream

  2. Bulk-migrate the contents of the source database. This can be as simple as pg_dump/pg_restore of select tables for a postgres-to-postgres migration, or as complicated custom bulk importer written by the service’s engineers

  3. Run a small bit of code that initializes a generated decorator around the new datastore’s writer which replays the mirrored operations

  4. Monitor result validation from our migration dashboard — verify that no write mismatches are recorded, and that the Kinesis stream is being consumed faster than it’s being written to. Issues with validation may indicate a bug in the new datastore’s connector code. Provisioned migration Kinesis streams can hold data for up to 7 days, so it’s easy to rebuild the target datastore and start the replay process from the beginning after correcting any issues

  5. Once the dashboard indicates that replay has caught up, cut the service over to a system that uses a generated decorator to mirror results to the new database in real time. Reads should also be mirrored to allow the contents of the new database to be validated over time. The migration dashboard will record mismatches and errors

  6. After a period of days or weeks, when the engineer is confident that the new datastore is correct and their connector is functioning, change the mirroring behavior to treat the new datastore as the primary source of truth. Eventually, remove the generated decorators and decomission the old datastore

The Generated Decorators

An engineer begins with their service’s data connection. These connection structs should be split into writer and reader objects, and build an interface that the rest of the service accepts (this is already considered best practice at Twitch). Then, the interfaces can be tagged with instructions to generate decorators for those interfaces. We use an open source tool called gen, which allows us to easily write generation code which can read the engineer’s interface through a reflection-like interface and generate decorator code files. A tagged interface looks like:

When gen is run, the code files are constructed and refreshed into place. The generated files look like this:

There are additional generation flags, not detailed here, which allow engineers to permit “non-catastrophic” errors to be returned from the backend & mirrored, allow engineers to specify some interface methods to be non-mirrored simple passthroughs to the inner data connector, etc. Additionally, it’s worth noting that the specialized code above to handle context parameters and error return values are generated dynamically based on the signature of the method.

The engineer then creates implementations for the new target datastore (if necessary) and configures their service to decorate their data connectors with the generated decorators. That might look like the following:

Old:

New:

Mirroring to multiple datastores for advanced testing is possible, and developers can write their own comparison preprocessors in cases where a simple 1:1 comparison of return values is inappropriate. The code is now plugged into our framework and ready for mirroring. By placing the write/replay/mirror functionality at the interface level, we can support any target datastore that our service engineers are willing to implement connectors for. There are also unexplored applications for live debugging.

The Dashboard

The migration code is heavily instrumented to allow the progress to be monitored from Grafana. We track everything from Kinesis replay progress to mirrored data correctness validation, to even the relative runtime between the old and new datastore, which has allowed us to explore different data solutions by making live performance comparisons. By providing a simple tactical readout to our service engineers, they can be immediately alerted when something is wrong with their migration and consult with us for solutions.

The Future

As more service teams begin to adopt this technology, we hope to improve it even further, building out our library of bulk-migration tools, simplifying the process of debugging data mismatches, and making it easier to restart a failed replay from scratch when bugs are discovered. This is just one of many projects we’re undertaking now to provide quality data management tools and education to our service teams at Twitch. If you’d like to help our team in our quest to make Twitch more performant and reliable, apply to join us as a Database Reliability Engineer!

In other news
Feb 15, 2017

Check out the Ghost Recon Wildlands live event on Twitch!

Tom Clancy’s Ghost Recon Wildlands War Within The Cartel is scheduled to premiere on February 16th 2017 at 2pm PST. The promotional…
Check out the Ghost Recon Wildlands live event on Twitch! Post
Feb 14, 2017

Twitch Prime members — get the For Honor™ Twitch Prime bundle!

From February 14th to March 12th — help tip the scales in your faction’s favor with the exclusive For Honor™ Twitch Prime bundle! If you’re…
Twitch Prime members — get the For Honor™ Twitch Prime bundle! Post