Sep 16, 2021 - By Mitchell Lloyd

The Quest for Fresher Stream Summaries

The Creator Analytics team owns the dashboards that show streamers metrics about their channel’s performance (like average viewers, follows, subscriptions) and other insights they can use to adjust their streaming activity and grow their communities. 

In this post we’ll focus on a migration from a legacy batch processing system to streaming analytics for generating creators’ stream summaries. This change hit production in April 2020.

Before this change, creators waited up to 1.5 hours after streaming to receive the Stream Summary email or see stats on their Stream Summary page. Today, we send 99.99% of creators emails within 20 minutes after they stop streaming and creators can view up-to-the-minute stats about their streams. More importantly, we eliminated ongoing reports of incorrect, overlapping, and missing stream summaries that plagued the legacy batch process.

These features used a custom Go service to orchestrate a sequence of queries every 45 minutes. We gained familiarity with the legacy systems and the domain as we migrated this batch orchestration to Airflow, an open-source job scheduler. This migration helped us tease out the dependencies between queries, parallelizing more work.

This screenshot shows all of the batch tasks used to generate creator’s metrics after the move to Airflow.

The previous orchestration system ran all of its queries every 45 minutes and, in the event that the process exceeded 45 minutes, two runs could overlap and overload the database. When backfilling, developers followed instructions to temporarily edit code, shutdown ongoing queries, and initiate a process from their laptops. Switching to Airflow allowed us to run our batch queries every 30 minutes! Each run would wait for the previous one to finish before starting to query the database and, using Airflow’s resource pools to control parallelism, we could now backfill data with a few clicks.

However, the create_sessions query remained buggy. These bugs mostly affected streamers with long sessions, causing overlapping or missing stream summaries. We worked with data scientists to address these problems but weren’t able to make progress. The 100 lines of SQL for this query contained complicated windowing logic that we couldn’t reason through and used a table with 250 integers (named “Integers”) in an unrecognizable order.

At the end of 2019, Twitch’s data infrastructure team was shutting down the system that powered our analytics batch process. Switching from this sunsetting system to the primary Twitch data lake would add 30 minutes to 1.5 hours of delay to our existing analytics process. This gave us the business case we needed to prioritize a streaming solution for creator analytics. Our team looked for open-source tools that could address our needs and settled on Druid to aggregate metrics from Kinesis (a topic for another blog post) and Flink to generate stream sessions.

So what’s a stream session anyway? For each minute that a creator’s broadcasting software sends data to Twitch, our ingest system emits a minute_broadcast tracking event for their channel. The rules for creating a stream session are:

  1. When a minute_broadcast event first arrives, begin a session for that channel.
  2. When minute_broadcast events stop for 15 minutes, end the session for that channel.
  3. If a session exceeds 24 hours, begin breaking sessions into 24 hour segments starting at UTC midnight.

The resulting session records contain channel IDs and the start and end time of streams. 

{
   "channel_id": "123456",
   "start_time": "2021-03-05 14:13:00",
   "end_time": "2021-03-05 18:20:00",
   "id": "123456:2021-03-05 14:13:00"
}

Using Flink allowed us to ditch our SQL query in favor of a more flexible programming language (Java). Flink’s session window abstraction is a perfect fit for short stream sessions delineated by 15 minute gaps, but our 24 hour stream rule requires a custom trigger to fire when streams cross the 24 hour mark and a custom ProcessWindow function to decide when to begin 24 hour segmentation. Although these rules are nuanced, unit tests allowed us to demonstrate the job’s behavior in many scenarios and ship a solution that we knew eliminated the bugs from our legacy SQL job.

Having spent two years triaging issues with long stream sessions, we wanted to make sure that the process for generating new stream sessions was idempotent. In the event of an upstream failure with missing or incorrect data, we wanted to regenerate a day of stream sessions without any manual clean up work. We use Airflow to unload data from our data lake to S3 and then trigger a finite Flink job to generate the sessions. This daily job runs code that is nearly identical to the streaming job except that it consumes CSV files from S3 rather than compressed JSON via Kinesis.

Ultimately, our sessions end up in a DynamoDB table partitioned by ChannelID, so we needed to add a global secondary index to the table to let us query all of the sessions starting on a particular day. This allows us to reconcile newly generated sessions against the existing sessions and delete any invalid sessions during backfill. Because we want to make sure this process works reliably and isn’t just an emergency fallback procedure, we run this backfill once a day as soon as the data lake has enough data to cover a day of stream sessions.

Looking Back

Moving from a batch to streaming process for stream sessions is a big win for our creators who can review metrics while their stream is fresh in their minds and our engineers who no longer have to deal with manual backfills and bugs.

However, this streaming setup is more complicated than running a batch process. Along the way we needed to tune our Kinesis streams and deal with memory leaks in our Flink job. We now manage a Zookeeper cluster and maintain custom infrastructure for deployment. Going forward, we plan to migrate our streaming and backfill jobs to Kinesis Analytics (AWS-managed Flink), to lower our maintenance cost.

This project taught our team some new tricks and lessons:

  • Always be backfilling: We love that our backfill runs every day rather than only as an ad-hoc process. This means that for many upstream outages we don’t need any intervention from the on-call engineer and we’re confident that the backfill will work every time. We look for ways to incorporate this “no fallback” principle into every new project so that correction mechanisms run automatically and continuously.
  • CDK is great: Along with a few other big infrastructure projects, our team used this project to level up on CDK. During this time, our productivity in setting up new infrastructure, monitoring, alarming, and automating processes with custom resources dramatically increased. I went on to write CDK’s aws-kinesisanalytics-flink module to make deploying Flink on Kinesis Analytics a breeze.
  • Flink in our toolbox: We got comfortable using Flink to analyze and transform streams of data. We see opportunities at Twitch to help product teams deliver real-time analytics features that used to be too expensive to build and maintain.
  • Queries as (imperative) code: Big SQL queries can be tough to change. It’s nice to manage query complexity in code with unit tests and common programming constructs.

This project added a critical piece of our stream-based analytics infrastructure. Stay tuned as we share more components of this system and our learnings along the way in the future.


Want to Join Our Quest to empower live communities on the internet? Find out more about what it is like to work at Twitch on our Career Site,  LinkedIn, and Instagram, or check out our Job Openings and apply

In other news
Sep 22, 2021

The TwitchCon Blog You’ve Been Waiting For

TwitchCon is returning with TwitchCon Amsterdam in July 2022, followed by TwitchCon San Diego in October 2022.
The TwitchCon Blog You’ve Been Waiting For Post
Sep 15, 2021

Celebrate Hispanic Heritage Month on Twitch

Celebrate Hispanic Heritage Month on Twitch Post