7 tips for getting the most out of Crashlytics

By Jason St. Pierre, Product Manager

Crashlytics tips

For many years, developers and app teams have relied on Crashlytics to improve their app stability. By now, you’re probably familiar with the main parts of the Crashlytics UI; perhaps you even glance at crash-free users, crash-free sessions, and the issues list multiple times a day (you wouldn’t be the only one!).

In this post, we want to share 7 pro-tips that will help you get even more value out of Crashlytics, which is now part of the new Fabric dashboard, so you can track, prioritize, and solve issues faster.


1. Speed up your troubleshooting by checking out crash insights

In July, we officially released crash insights out of beta. Crash insights helps you understand your crashes better by giving you more context and clarity on why those crashes occurred. When you see a green lightning bolt appear next to an issue in your issues list, click on it to see potential root causes and troubleshooting resources.


2. Mark resolved issues as “closed” to track regressions

Debugging and troubleshooting crashes is time-consuming, hard work. As developers ourselves, we understand the urge to sign-off and return to more exciting tasks (like building new app features) as soon you resolve a pesky issue - but don’t forget to mark this issue as “closed” in Crashlytics! When you formally close out an issue, you get enhanced visibility into that issue’s lifecycle through regression detection. Regression detection alerts you when a previously closed issue reoccurs in a new app version, which is a signal that something else may be awry and you should pay close attention to it.


3. Close and lock issues you want to ignore and declutter your issue list

As a general rule of thumb, you should close issues so you can monitor regression. However, you can also close and lock issues that you don’t want to be notified about because you’re unlikely to fix or prioritize them. These could be low-impact, obscure bugs or issues that are beyond your control because the problem isn’t in your code. To keep these issues out of view and declutter your Crashlytics charts, you can close and lock them. By taking advantage of this “ignore functionality," you can fine tune your stability page so only critical information that needs action bubbles up to the top.


4. Use wildcard builds as a shortcut for adding build versions manually

Sometimes, you may have multiple builds of the same version. These build versions start with the same number, but the tail end contains a unique identifier (such as 9.12 (123), 9.12 (124), 9.12 (125), etc). If you want to see crashes for all of these versions, don’t manually type them into the search bar. Instead, use a wildcard to group similar versions together much faster. You can do this by simply adding a star (aka. an asterisk) at the end of your version prefix (i.e. 9.12*). For example, if you use APK Splits on Android, a wildcard build will quickly show you crashes for the combined set of builds.


5. Pin your most important builds to keep them front and center

As a developer, you probably deploy a handful of builds each day. As a development team, that number can shoot up to tens or hundreds of builds. The speed and agility with which mobile teams ship is impressive and awesome. But you know what’s not awesome? Wasting time having to comb through your numerous builds to find the one (or two, or three, etc.) that matter the most. That’s why Crashlytics allows you to “pin” key builds so that they appear at the top of your builds list. Pinned builds allow you to find your most important builds faster and keep them front and center, for as long as you need. Plus, this feature makes it easier to collaborate with your teammates on fixing crashes because pinned builds will automatically appear at the top of their builds list too.


6. Pay attention to velocity alerts to stay informed about critical stability issues

Stability issues can pop up anytime - even when you’re away from your workstation. Crashlytics intelligently monitors your builds to check if one issue has caused a statistically significant number of crashes. If so, we’ll let you know if you need to ship a hot fix of your app via a velocity alert. Velocity alerts are proactive alerts that appear right in your crash reporting dashboard when an issue suddenly increases in severity or impact. We’ll send you an email too, but you should also install the Fabric mobile app, which will send you a push notification so you can stay in the loop even on the go. Keep an eye out for velocity alerts and you’ll never miss a critical crash, no matter where you are!


7. Use logs, keys, and non-fatals in the right scenarios

The Crashlytics SDK lets you instrument logs, keys, non-fatals, and custom events, which provide additional information and context on why a crash occurred and what happened leading up to it. However, logs, keys, non-fatals, and custom events are designed to track different things so let’s review the right way to use them.

Logs: You should instrument logs to gather important information about user activity before a crash. This could be user behavior (ex. user went to download screen, clicked on download button) to details about the user’s action (ex. image downloaded, image downloaded from). Basically, logs are breadcrumbs that show you what happened prior to a crash. When a crash occurs, we take the contents of the log and attach it to the crash to help you debug faster. Here are instructions for instrumenting logs for iOS, Android, and Unity apps.

Keys: Keys are key value pairs, which provide a snapshot of information at one point in time. Unlike logs, which record a timeline of activity, keys record the last known value and change over time. Since keys are overwritten, you should use keys for something that you would only want the last known value for. For example, use keys to track the last level a user completed, the last step a user completed in a wizard, what image the user looked at last, and what the last custom settings configuration was. Keys are also helpful in providing a summary or “roll-up” of information. For instance, if your log shows “login, retry, retry, retry” your key would show “retry count: 3.” To set up keys, follow these instructions for iOS, Android, and Unity apps.

Non-fatals: While Crashlytics captures crashes automatically, you can also record non-fatal events. Non-fatal events mean that your app is experiencing an error, but not actually crashing.

For example, a good scenario to log a non-fatal is if your app has deep links, but fails to navigate to them. A broken link isn’t something that will necessarily crash your app, but it’s something you’d want to track so you can fix the link. A bad scenario to log a non-fatal is if an image fails to load in your app due to a network failure because this isn’t actionable or specific.

You should set up non-fatal events for something you want the stack trace for so you can triage and troubleshoot the issue.

If you simply want to count the number of times something happens (and don’t need the stack trace), we’d recommend checking out custom events.

These 7 tips will help you get the most out of Crashlytics. If you have other pro-tips that have helped you improve your app stability with Crashlytics, tweet them at us! We can’t wait to learn more about how you use Crashlytics.
 

Learn from app makers: How Levi Bostian used Fabric & fastlane to scale

By Todd Burner, Developer Advocate

Learn from app makers

In this series, we feature customers that have used our platform in an innovative way. For this installment, we chatted with freelancer and indie app developer Levi Bostian, who uses Fabric and fastlane to scale his one-person development shop.

Levi Bostian is a native Android and iOS app developer from Cedar Rapids, Iowa, who spends his days building apps for a rotating set of external clients. As a one-man app development shop, he needs to set-up a streamlined development process so he can focus his attention on client work without being bogged down with tedious tasks. But since Levi is extremely busy running his business, he doesn’t have much time to learn complicated new tools. Let’s learn more about Levi, his work, and his experience streamlining his development process.

Levi's workstation

Levi's workstation

What types of apps do you work on?

"I am a freelancer building native Android and iOS apps with Node.js APIs for startups. I’m also an indie developer building my own apps like Your Circle, a virtual support group app for cancer patients. In the past, I have built social media, mobile banking, beauty, manufacturing, and music streaming apps as well as apps that connect to Arduinos. I have also worked with Google, Salesforce, Jack Henry and Associates, and a dozen startups on their apps. I love the variety of apps I get to work on."

What challenges have you run into as a freelance developer?

"Because I work with external clients, I distribute a lot of builds for many separate apps. My biggest pain point has been distributing beta apps as I make incremental changes and add testers. Every time I need to beta test a new build, I have to update the provisioning profile, update beta testing devices, build and sign the app and then distribute it - which is a lot of steps. This gets repetitive and is very error prone especially since I don’t have access to the users’ devices for troubleshooting. I need to get everything right the first time."


Finding fastlane

Levi was looking for a way to automate the distribution of these daily beta builds to each of his clients. For a while, he tried to script the signing and distribution steps on his own, but it required too much maintenance to keep up and running. After talking to some friends, he learned about fastlane. fastlane is an open source toolset that automates app deployment. In other words, it does the heavy lifting of streamlining code signing, distribution, and more. While Levi was already using Fabric to distribute beta builds and monitor app stability, he decided to try using fastlane to speed up distribution of his native Android and iOS apps to his external customers.

When did you first start using fastlane?

“I was using Crashlytics Beta, which is part of the Fabric platform, for my Android and iOS beta testing because  it makes adding new testers and managing versions very easy. I had been manually uploading builds using the Fabric plugin, but with my business growing, I also needed a way to automate the build, signing and deployment process. It was getting really hard to manage all of the provisioning profiles and devices registered to each app.

After many headaches with code signing, I decided I was going to get fastlane setup and I did.  Whoa! It handles the code signing and building steps and also lets me automate the submission of the beta app after those are done. It saves me tons of time, which I can now use to focus on building apps! I’m so glad I picked it up!”

What was your initial reaction and experience?

“I was blown away by how easy it was to get up and running. The documentation on GitHub is very thorough and you can generate everything you need on the Fabric site. The first time I ran 'fastlane run crashlytics' it distributed my app with a few simple inputs. That’s when I knew I had found something great!"


Growing & scaling with fastlane

Now that Levi has used fastlane in more than 10 shipped projects, he's uncovered new, creative ways to use our toolset to scale his business and make working with external clients easier.

Now that you’ve been using fastlane for 8 months, how has it impacted your development process?

“It’s been saving me so much time. I’ve been able to take on more work and do more side projects. I love it! And in addition to distributing my beta apps, I’ve been gradually adopting other fastlane tools too.

For example, I just started using match. Match saves me the headache of creating and syncing a huge amount of provisioning profiles. I really dislike dealing with those manually and using match in fastlane to manage all your apps, devices, certs, profiles all via a git repo is magic. I want to give fastlane a big kiss for all it does for me there.

I’m also delighted by how I can configure fastlane to be a hands free build system. I have my iOS and Android Fastfiles setup with parameters so that when I run a fastlane action, it doesn't require any work on my end. It just runs in the background and does all the hard work for me. I can spend my time working on features instead!”

Did anything else about fastlane surprise you?

“The amount of Android support. Initially, my friend told me it was just for iOS but after poking around, I realized I can do so many things with it in Android too! Outside of distributing betas, I’m able to manage Play Store metadata and APK uploads and I even use it to run gradle tasks. I use it for complex tasks, such as building versions of my app via gradle and releasing those builds to Crashlytics, as well as simple tasks, such as running gradle clean. It’s much easier to type "fastlane install" rather than "./gradlew installDevelopmentDebug."

What resources do you use to get help with fastlane?

“There is a very active GitHub community for support and bugs. I’ve thought about contributing code to the project as well, but fastlane has everything that I need at the moment. When the time comes, I will make sure to do so - and I would be happy to be a part of it. The product is so rock solid and it would be an honor to help out.”


Looking forward

By using fastlane and Fabric together, Levi has been able to successfully scale his business without needing to hire additional help. He’s quickly become a power user of fastlane and wants to explore the many more fastlane tools.

How has your use of Fabric & fastlane evolved over time?

“I started by only using Fabric to distribute beta builds, then added fastlane into the mix. Now, I’ve been gradually adding a bunch of other fastlane tools to my flow. I use match for iOS and the gradle action heavily for Android.

Today, all projects that I build (or apps that I make some updates to) I install fastlane right away. I distribute 5 - 8 beta releases a week for both (Android and iOS) apps. By using fastlane in conjunction with Crashlytics Beta, I have way fewer headaches because I have the fastfile fully configured to ask for no input. One command and within 3 - 8 minutes a build is out to my client.

One of my products I mentioned earlier, Your Circle, is a white labeled app. All of its builds share the same code base with nice theming added to each. I have fastlane setup so that I can build and release updates to all of the white labeled apps with 1 command. Plus, I use fastlane to generate all of the separate icons for each white labeled app target. So awesome! All commands, certificates, list of devices, synced via Git with the project. I have no idea what I would have done if I didn’t have fastlane to distribute this app.

Your Circle app

Your Circle app

Since adding fastlane into my workflow, my client relationships have greatly improved because I’m able to sync everything I need into my git repo so that it’s contained in the project. This helps me keep track of my code and metadata in the same place, makes it easier to communicate updates to clients, and speeds up releases.”

Are you thinking about trying any other fastlane tools?

“Taking screenshots for the app stores is something I hope to try soon. I plan to set up fastlane up to a CI so it can automatically take screenshots for me in the background and make it a hands-off process. I’m excited about the potential there.”

What advice would you give to new fastlane users?

“Come into fastlane with an idea of what you want the toolset to do for you. Do not get overwhelmed by the dozens of plugins and actions it provides, come in with a game plan. If that is to sync your provisioning profiles with your team, start there. It's easy to add to your configuration at any time, but just start with one problem to solve and get to it.”
 

Learn from app leaders: How Doodle redesigned their app using Fabric & Firebase

By Todd Burner, Developer Advocate

Learn from app leaders header

In this new series, we feature customers that have used our platform in an innovative way. For this installment, we chatted with the app team at Doodle who used the Fabric and Firebase platforms together to redesign their app to be more user-centric. If you want to participate in this series, please email support@fabric.io.

Recently, we sat down with Alexander Thiele who is a senior Android engineer at Doodle, a company that helps you find the best date and time to meet other people. As early adopters of Fabric’s Crashlytics and Firebase Remote Config, his team has expert familiarity with our platforms. The focus of our conversation was on how they redesigned their mobile app using analytics and crash data from their Fabric and Firebase dashboards.

Doodle logo.png


Q. How did you approach the redesign?

“The redesign is a complete overhaul of our app. We started by updating our onboarding flow to help people understand the best ways to use Doodle. We wanted to show users how they can poll each other to quickly find the best meeting time. We divided the redesign into three phases, first improving stability with Fabric’s Crashlytics, then A/B testing our poll creation feature with Firebase Remote Config, and finally measuring the results of our tests and production rollout by monitoring our app metrics in Fabric and Firebase analytics."


Phase 1: Finding tricky crashes

The team at Doodle wanted to understand how stability was impacting their app quality. That’s why they first focused on getting their crash-free user rate as close to 100% as possible.

Fabric’s Crashlytics helped them track crashes and prioritize them so they could improve their crash-free user rate. One feature they found particularly useful is adding logs and keys to crash reports.

Q. How did Fabric’s Crashlytics help you improve your app stability? 

“Crashlytics saved us a ton of time by surfacing crashes and helping us pinpoint their cause. I remember one really rare crash which we couldn't find the source of ourselves. We also don't have many crashes so we were really eager to find it. We then started to log everything that could be related to this crash, like page visits and the current internal database size. We also recorded those instances when our database couldn’t find something. After a few releases with custom logs, we found the bug. It happened in a really rare case where the user went to specific screens and used some specific features. Without custom logs, we wouldn’t have been able to find this bug.”

The team at Doodle also logs all crashes that they manually catch in the code to Crashlytics as non-fatals. This gives them more insight about what's going on in the app. By taking advantage of these unique Crashlytics features, Doodle has been able to move faster and add new features into production with less anxiety.


Phase 2: User-centric design

 The second phase of the app redesign was focused on updating the user experience and design. The goal of this phase was to refresh the look and feel of the app and introduce streamlined flows so users could accomplish their tasks faster (and with fewer screens/steps).

Q. What types of UI changes did you make in the redesign?

“The changes we made during this stage included everything from changing the color palette to introducing new screens and adding new app functionality. By monitoring the 7 day retention metrics in Fabric and Firebase, we saw that some new users didn’t understand the concept of Doodle immediately - so they didn’t return to our app. That’s why we changed the whole onboarding process to make Doodle easier to understand and use from the first time it’s installed.”

doodle app screen.png

Q. How did you test your changes?

“We used Firebase Remote Config. We tested our user onboarding and the flow users go through when creating a poll. We tried 4 different kinds of flows, which we tested using Remote Config. In the end, the data showed that one flow resulted in more polls being created than the others. Our key performance indicator for the A/B test was the numbers of polls created by users, and we tracked this KPI with Google Analytics for Firebase.“

Q. Did you use Remote Config for other things? 

“We also used Remote Config to test feature switches. For example a few months ago, we implemented banner ads on our scheduling screen and enabled them through Remote Config. We noticed that these ads didn’t perform well so we turned them off easily with Remote Config. Then, we tried inserting native ads into a few other places in our app. Through Remote Config, we were able to discover the right placement for ads in our app without disrupting our users or requiring them to update their app to see the changes.”

By tracking crashes and non-fatals with Fabric and deploying changes with Firebase Remote Config, the team at Doodle didn’t have to depend on the app store release cycles to understand their users and update their app accordingly. They could see user behavior change in real-time and make appropriate changes to their app before problems arose.


Phase 3: Measuring and going forward with Firebase and Fabric

The team at Doodle plans to keep using both the Fabric and Firebase platforms to monitor and improve their app - and display their metrics throughout every stage! 

Q. Now that the redesign is live, what dashboards do you find yourself using the most?

 “Our most important metrics are how many polls a user creates and how many people participate in a poll. We monitor these in the Fabric events dashboard and in Google Analytics for Firebase by logging events.

I’m also a big fan of the new TV Mode for Fabric, we have a big conference room and we put up our dashboard on the TV during launches so the whole team can see how we’re doing. The new Crashlytics dashboard looks nice too, especially device and OS filtering. We keep an eye on most of the dashboards daily.”

Q. What Fabric and Firebase features do you plan to adopt next?

“Over the next few weeks, we have plans to adopt Firebase Dynamic Links and to set up more in-depth Fabric custom events. By using Dynamic Links, we’ll be able to make it even easier to share polls. For example, our users will be able to invite other people to participate in polls via SMS and deep link right to the relevant app screen (even if the people they are inviting to the poll haven’t installed the app yet). We’ll track more events, like content views, to understand where our users find value in our app.”

Q. What advice do you have for other app teams who are considering redesigning their app?

“Two things: test ideas constantly and put your app users first. By combining Crashlytics’ real-time crash reporting with the ability to deploy remote changes to a subset of users through Firebase Remote Config, you can learn how valuable a new feature is, identify potential issues, and take action immediately.”

Q. Can you share some results of the redesign?

“This redesign greatly improved our in-app poll creation process so users could create polls faster and more easily. We measured the success of this redesign by looking at our daily active users (DAUs) in Fabric and our retention numbers in Firebase/Fabric, which have risen beyond our expectations!”

How to monitor your app retention in Fabric

By Shobhit Chugh, Product Manager

How-to-monitor-retention-in-Fabric

A common misconception in the mobile world is that number of app downloads is the strongest indicator of success. But what if you have a ton of users, yet they rarely interact with your app? What if people download your app and then churn the next day? Looking at your total number of app users or installs in isolation doesn’t paint an accurate picture of your app’s health - you also need to pay attention to retention. Retention helps you understand how often people return to your app. It’s important to measure retention because if your hard-earned users aren’t sticking around and regularly engaging with your app, you cannot build a sustainable mobile business.

In this blog post, we’ll show you how to track your retention over time through Fabric’s new retention page (which is part of our new dashboard).

 

Measuring retention from three angles

To give you a holistic view of how strong your app retention is, we focus on three things: active users, activity segments, and new user retention.

Let’s review how each angle helps you better understand retention.

1. Fluctuations in active users

The first sign of how well you’re retaining users is the number of active users you have on a daily, weekly, and monthly basis - and if these numbers are trending up or down over time.

When you navigate to the retention page, you’ll see these metrics in the top two graphs:

  • Daily active users (DAUs - how many people have had at least one session with your app today)

  • Weekly active users (WAUs - how many people have had at least one session with your app in the last seven days)

  • Monthly active users (MAUs - how many people have had at least one session with your app in the last 30 days)

The pulsating DAUs graph gives you real-time insight into how many people have used your app so far today, compared to this time last week. The second graph provides an additional lens by highlighting changes in weekly and monthly active users.

Steady, consistent growth in active users is a good signal that your retention is strong.

 

2. Changes in activity segments

The middle section of the retention page is centered around activity segments. Based on session data, activity segments groups users into buckets, ranging from inactive users (people who have not launched your app in more than a week) all the way to high activity users (people who have used your app almost every single day in the past seven days).

Activity segments provide a deeper look at your retention by revealing how engaged your active users are, how many are at risk of abandoning your app, and how people flow from one segment to another.

This graph can tell you a few interesting things about retention. First off, look at how people are transitioning between states. For example, if you see a large and healthy flow of users moving from “low activity” to “medium or high activity”, their engagement level is changing in a positive way - meaning that retention is improving.

Another interesting thing to monitor is the correlation between the number of new users and the growth in each segment. For instance, if you’re earning thousands of new users every week, but you’re only seeing a corresponding bump in the low activity segment - this means your new users are not deeply committed to your app. In this case, consider improving your onboarding flow to showcase the value of your app to new users.

Pro Tip: Move the slider at the bottom of this graph to see your activity segments at different times during the last 30 days. You can also use this slider to compare how active your users are during the weekday versus the weekend.

 

3. New user retention rate

Finally, the last graph on the retention page shows you what percent of new users are continuing to interact with your app after one day, seven days, and thirty days. This graph helps you see whether or not new users are still active after their first session at key time intervals. For instance, the day one metric means that X% of people who installed and used your app for the first time yesterday, also used it today.

The higher these percentages are, the stronger your app retention is because it means that a large amount of new users are turning into loyal, habitual users. If we notice any irregularities (i.e. an unusual increase or decrease in your new user retention), we’ll flag it so you can dig into what happened on that day.

 

From understanding retention to improving it

Fabric’s new retention page helps you measure retention from three different angles: active users (how many people are using my app?), activity segments (how engaged are my users?), and new user retention rate (how often do new users come back to my app?).

Armed with this insight, you’ll develop a baseline understanding of your retention, be able to recognize when it becomes a problem, and act quickly to combat churn. 

If you’re already a Fabric customer, click here to check out your retention page.

If you’re not currently a Fabric customer, get started by signing up and installing Crashlytics.

Migrating to Druid: how we improved the accuracy of our stability metrics

by Max Lord, Software Engineer

Stability metrics are one of the most critical parts of Crashlytics because they show you which issues are having the biggest impact on your apps. We know that you rely on this data to prioritize your time and make key decisions about what to fix, so our job is to ensure these metrics are as accurate as possible.  

In an effort to strengthen the reliability of these numbers, we spent the last few months overhauling the system that gathers and calculates the stability metrics that power Crashlytics. Now, all of our stability metrics are being served out of a system built on Druid. Since the migration has ended, we wanted to step back, reflect on how it went, and share some lessons and learnings with the rest of the engineering community.

Why migrate?

In the very early days of Crashlytics, we simply wrote every crash report we received to a Mongo database. Once we were processing thousands of crashes per second, that database couldn't keep up. We developed a bespoke system based on Apache Storm and Cassandra that served everyone well for the next few years. This system pre-computed all of the metrics that it would ever need to serve, which meant that end-user requests were always very fast. However, its primary disadvantage was that it was cumbersome for us to develop new features, such as new filtering dimensions. Additionally, we occasionally used sampling and estimation techniques to handle the flood of events from our larger customers, but these estimation techniques didn't always work perfectly for everyone.

We wanted to improve the accuracy of metrics for all of our customers, and introduce a richer set of features on our dashboard.  However, we were approaching the limits of what we could build with our current architecture.  Any solution we invented would be restricted to pre-computing metrics and subject to sampling and estimation. This was our cue to explore other options.

Discovering Druid

We learned that the analytics start-up MetaMarkets had found themselves in a similar position and the solution that they open-sourced, Druid, looked like a good fit for us as well. Druid belongs to the column-store family of OLAP databases, purpose-built to efficiently aggregate metrics from a large number of data points. Unlike most other analytics-oriented databases, Druid is optimized for very low latency queries. This characteristic makes it ideally suited for serving data to an exploratory, customer-facing dashboard.

We were doubtful that any column store could compete with the speed of serving pre-computed metrics from Cassandra, but our experimentation demonstrated that Druid's performance is phenomenal. After spending a bit of time tweaking our schema and cluster configuration, we were easily able to achieve latencies comparable to (and sometimes even better than!) our prior system.  We were satisfied that this technology would unlock an immense amount of flexibility and scale, so our next challenge was to swap it in without destabilizing the dashboard for our existing customers.

Migrating safely

As with all major migrations, we had to come up with a plan to keep the firehose of crash reports running while still serving up all of our existing dashboard requests. We didn’t want errors or discrepancies to impact our customers so we enlisted a tool by Github called Scientist. With Scientist, we were able to run all of the metrics requests that support our dashboard through Druid, issuing the exact same query to both the old system and the new system, and comparing the results.  We expected to see a few discrepancies, but we were excited to see that when there were differences, Druid generally produced more accurate results. This gave us the confidence that Druid would provide the functionality we needed, but we still needed to scale it up to support all of our dashboard traffic.  

To insulate our customers from a potential failure as we tuned it to support all of our traffic, we implemented a library called Trial.  This gave us an automatic fallback to the old system. After running this for a few weeks we were able to gradually scale up and cut over all of our traffic to the new system.

How we use Druid for Crashlytics

On busy days, Crashlytics can receive well over a billion crash reports from mobile devices all over the world. Our crash processing pipeline processes most crashes within seconds, and developers love that they can see those events on their dashboards in very close to real time.

To introduce a minimum of additional processing time, we make extensive use of Druid's real-time ingestion capabilities. Our pipeline publishes every processed crash event to a Kafka cluster that facilitates fanout to a number of other systems in Fabric that consume crash events. We use a Heron topology to stream events to Druid through a library called Tranquility. Part of the Druid cluster called the "indexing service" receives each event and can immediately service queries over that data. This path enables us to serve an accurate, minute by minute picture of events for each app for the last few hours.  

However, calculating metrics over a week or months of data requires a different approach. To accomplish this, Druid periodically moves data from its indexing service to another part of the cluster made up of "historical" nodes. Historical nodes store immutable chunks of highly compressed, indexed data called "segments" in Druid parlance and are optimized to service and cache queries against them. In our cluster, we move data to the historical nodes every six hours. Druid knows how to combine data from both types of nodes, so a query for a week of data may scan 27 of these segments plus the very latest one currently being built in the indexing service.

The results

Our Druid based system now allows us to ingest 100% of the events we receive, so we are happy to report that we are no longer sampling crash data from any of our customers.  The result is more accurate metrics that you can trust to triage stability issues, no matter how widely installed your app is.

While nothing is more important to us than working to ensure you have the most reliable information possible, we also strive to iterate and improve the Crashlytics experience. In addition to helping us improve accuracy, Druid has unlocked an unprecedented degree of flexibility and richness in what we can show you about the stability issues impacting your users. Since the migration, you may have noticed a steady stream of design tweaks, new features, and performance enhancements on our dashboard. For example, here are a few heavily-requested features that we’ve recently rolled out:  

  • You can now view issues across multiple versions of your app at the same time
  • You can view individual issue metrics for any time range
  • You can now filter your issues by device model and operating system

This is just the beginning. We're looking forward to what else we can build to help developers ship stable apps to their customers.

P.S. We're building a mobile platform to help teams create bold new app experiences. Want to join us? Check out our open positions!

Get Crashlytics