How to monitor your app retention in Fabric

By Shobhit Chugh, Product Manager

How-to-monitor-retention-in-Fabric

A common misconception in the mobile world is that number of app downloads is the strongest indicator of success. But what if you have a ton of users, yet they rarely interact with your app? What if people download your app and then churn the next day? Looking at your total number of app users or installs in isolation doesn’t paint an accurate picture of your app’s health - you also need to pay attention to retention. Retention helps you understand how often people return to your app. It’s important to measure retention because if your hard-earned users aren’t sticking around and regularly engaging with your app, you cannot build a sustainable mobile business.

In this blog post, we’ll show you how to track your retention over time through Fabric’s new retention page (which is part of our new dashboard).

 

Measuring retention from three angles

To give you a holistic view of how strong your app retention is, we focus on three things: active users, activity segments, and new user retention.

Let’s review how each angle helps you better understand retention.

1. Fluctuations in active users

The first sign of how well you’re retaining users is the number of active users you have on a daily, weekly, and monthly basis - and if these numbers are trending up or down over time.

When you navigate to the retention page, you’ll see these metrics in the top two graphs:

  • Daily active users (DAUs - how many people have had at least one session with your app today)

  • Weekly active users (WAUs - how many people have had at least one session with your app in the last seven days)

  • Monthly active users (MAUs - how many people have had at least one session with your app in the last 30 days)

The pulsating DAUs graph gives you real-time insight into how many people have used your app so far today, compared to this time last week. The second graph provides an additional lens by highlighting changes in weekly and monthly active users.

Steady, consistent growth in active users is a good signal that your retention is strong.

 

2. Changes in activity segments

The middle section of the retention page is centered around activity segments. Based on session data, activity segments groups users into buckets, ranging from inactive users (people who have not launched your app in more than a week) all the way to high activity users (people who have used your app almost every single day in the past seven days).

Activity segments provide a deeper look at your retention by revealing how engaged your active users are, how many are at risk of abandoning your app, and how people flow from one segment to another.

This graph can tell you a few interesting things about retention. First off, look at how people are transitioning between states. For example, if you see a large and healthy flow of users moving from “low activity” to “medium or high activity”, their engagement level is changing in a positive way - meaning that retention is improving.

Another interesting thing to monitor is the correlation between the number of new users and the growth in each segment. For instance, if you’re earning thousands of new users every week, but you’re only seeing a corresponding bump in the low activity segment - this means your new users are not deeply committed to your app. In this case, consider improving your onboarding flow to showcase the value of your app to new users.

Pro Tip: Move the slider at the bottom of this graph to see your activity segments at different times during the last 30 days. You can also use this slider to compare how active your users are during the weekday versus the weekend.

 

3. New user retention rate

Finally, the last graph on the retention page shows you what percent of new users are continuing to interact with your app after one day, seven days, and thirty days. This graph helps you see whether or not new users are still active after their first session at key time intervals. For instance, the day one metric means that X% of people who installed and used your app for the first time yesterday, also used it today.

The higher these percentages are, the stronger your app retention is because it means that a large amount of new users are turning into loyal, habitual users. If we notice any irregularities (i.e. an unusual increase or decrease in your new user retention), we’ll flag it so you can dig into what happened on that day.

 

From understanding retention to improving it

Fabric’s new retention page helps you measure retention from three different angles: active users (how many people are using my app?), activity segments (how engaged are my users?), and new user retention rate (how often do new users come back to my app?).

Armed with this insight, you’ll develop a baseline understanding of your retention, be able to recognize when it becomes a problem, and act quickly to combat churn. 

If you’re already a Fabric customer, click here to check out your retention page.

If you’re not currently a Fabric customer, get started by signing up and installing Crashlytics.

Migrating to Druid: how we improved the accuracy of our stability metrics

by Max Lord, Software Engineer

Stability metrics are one of the most critical parts of Crashlytics because they show you which issues are having the biggest impact on your apps. We know that you rely on this data to prioritize your time and make key decisions about what to fix, so our job is to ensure these metrics are as accurate as possible.  

In an effort to strengthen the reliability of these numbers, we spent the last few months overhauling the system that gathers and calculates the stability metrics that power Crashlytics. Now, all of our stability metrics are being served out of a system built on Druid. Since the migration has ended, we wanted to step back, reflect on how it went, and share some lessons and learnings with the rest of the engineering community.

Why migrate?

In the very early days of Crashlytics, we simply wrote every crash report we received to a Mongo database. Once we were processing thousands of crashes per second, that database couldn't keep up. We developed a bespoke system based on Apache Storm and Cassandra that served everyone well for the next few years. This system pre-computed all of the metrics that it would ever need to serve, which meant that end-user requests were always very fast. However, its primary disadvantage was that it was cumbersome for us to develop new features, such as new filtering dimensions. Additionally, we occasionally used sampling and estimation techniques to handle the flood of events from our larger customers, but these estimation techniques didn't always work perfectly for everyone.

We wanted to improve the accuracy of metrics for all of our customers, and introduce a richer set of features on our dashboard.  However, we were approaching the limits of what we could build with our current architecture.  Any solution we invented would be restricted to pre-computing metrics and subject to sampling and estimation. This was our cue to explore other options.

Discovering Druid

We learned that the analytics start-up MetaMarkets had found themselves in a similar position and the solution that they open-sourced, Druid, looked like a good fit for us as well. Druid belongs to the column-store family of OLAP databases, purpose-built to efficiently aggregate metrics from a large number of data points. Unlike most other analytics-oriented databases, Druid is optimized for very low latency queries. This characteristic makes it ideally suited for serving data to an exploratory, customer-facing dashboard.

We were doubtful that any column store could compete with the speed of serving pre-computed metrics from Cassandra, but our experimentation demonstrated that Druid's performance is phenomenal. After spending a bit of time tweaking our schema and cluster configuration, we were easily able to achieve latencies comparable to (and sometimes even better than!) our prior system.  We were satisfied that this technology would unlock an immense amount of flexibility and scale, so our next challenge was to swap it in without destabilizing the dashboard for our existing customers.

Migrating safely

As with all major migrations, we had to come up with a plan to keep the firehose of crash reports running while still serving up all of our existing dashboard requests. We didn’t want errors or discrepancies to impact our customers so we enlisted a tool by Github called Scientist. With Scientist, we were able to run all of the metrics requests that support our dashboard through Druid, issuing the exact same query to both the old system and the new system, and comparing the results.  We expected to see a few discrepancies, but we were excited to see that when there were differences, Druid generally produced more accurate results. This gave us the confidence that Druid would provide the functionality we needed, but we still needed to scale it up to support all of our dashboard traffic.  

To insulate our customers from a potential failure as we tuned it to support all of our traffic, we implemented a library called Trial.  This gave us an automatic fallback to the old system. After running this for a few weeks we were able to gradually scale up and cut over all of our traffic to the new system.

How we use Druid for Crashlytics

On busy days, Crashlytics can receive well over a billion crash reports from mobile devices all over the world. Our crash processing pipeline processes most crashes within seconds, and developers love that they can see those events on their dashboards in very close to real time.

To introduce a minimum of additional processing time, we make extensive use of Druid's real-time ingestion capabilities. Our pipeline publishes every processed crash event to a Kafka cluster that facilitates fanout to a number of other systems in Fabric that consume crash events. We use a Heron topology to stream events to Druid through a library called Tranquility. Part of the Druid cluster called the "indexing service" receives each event and can immediately service queries over that data. This path enables us to serve an accurate, minute by minute picture of events for each app for the last few hours.  

However, calculating metrics over a week or months of data requires a different approach. To accomplish this, Druid periodically moves data from its indexing service to another part of the cluster made up of "historical" nodes. Historical nodes store immutable chunks of highly compressed, indexed data called "segments" in Druid parlance and are optimized to service and cache queries against them. In our cluster, we move data to the historical nodes every six hours. Druid knows how to combine data from both types of nodes, so a query for a week of data may scan 27 of these segments plus the very latest one currently being built in the indexing service.

The results

Our Druid based system now allows us to ingest 100% of the events we receive, so we are happy to report that we are no longer sampling crash data from any of our customers.  The result is more accurate metrics that you can trust to triage stability issues, no matter how widely installed your app is.

While nothing is more important to us than working to ensure you have the most reliable information possible, we also strive to iterate and improve the Crashlytics experience. In addition to helping us improve accuracy, Druid has unlocked an unprecedented degree of flexibility and richness in what we can show you about the stability issues impacting your users. Since the migration, you may have noticed a steady stream of design tweaks, new features, and performance enhancements on our dashboard. For example, here are a few heavily-requested features that we’ve recently rolled out:  

  • You can now view issues across multiple versions of your app at the same time
  • You can view individual issue metrics for any time range
  • You can now filter your issues by device model and operating system

This is just the beginning. We're looking forward to what else we can build to help developers ship stable apps to their customers.

P.S. We're building a mobile platform to help teams create bold new app experiences. Want to join us? Check out our open positions!

Get Crashlytics