Migrating to Druid: how we improved the accuracy of our stability metrics

by Max Lord, Software Engineer

Stability metrics are one of the most critical parts of Crashlytics because they show you which issues are having the biggest impact on your apps. We know that you rely on this data to prioritize your time and make key decisions about what to fix, so our job is to ensure these metrics are as accurate as possible.  

In an effort to strengthen the reliability of these numbers, we spent the last few months overhauling the system that gathers and calculates the stability metrics that power Crashlytics. Now, all of our stability metrics are being served out of a system built on Druid. Since the migration has ended, we wanted to step back, reflect on how it went, and share some lessons and learnings with the rest of the engineering community.

Why migrate?

In the very early days of Crashlytics, we simply wrote every crash report we received to a Mongo database. Once we were processing thousands of crashes per second, that database couldn't keep up. We developed a bespoke system based on Apache Storm and Cassandra that served everyone well for the next few years. This system pre-computed all of the metrics that it would ever need to serve, which meant that end-user requests were always very fast. However, its primary disadvantage was that it was cumbersome for us to develop new features, such as new filtering dimensions. Additionally, we occasionally used sampling and estimation techniques to handle the flood of events from our larger customers, but these estimation techniques didn't always work perfectly for everyone.

We wanted to improve the accuracy of metrics for all of our customers, and introduce a richer set of features on our dashboard.  However, we were approaching the limits of what we could build with our current architecture.  Any solution we invented would be restricted to pre-computing metrics and subject to sampling and estimation. This was our cue to explore other options.

Discovering Druid

We learned that the analytics start-up MetaMarkets had found themselves in a similar position and the solution that they open-sourced, Druid, looked like a good fit for us as well. Druid belongs to the column-store family of OLAP databases, purpose-built to efficiently aggregate metrics from a large number of data points. Unlike most other analytics-oriented databases, Druid is optimized for very low latency queries. This characteristic makes it ideally suited for serving data to an exploratory, customer-facing dashboard.

We were doubtful that any column store could compete with the speed of serving pre-computed metrics from Cassandra, but our experimentation demonstrated that Druid's performance is phenomenal. After spending a bit of time tweaking our schema and cluster configuration, we were easily able to achieve latencies comparable to (and sometimes even better than!) our prior system.  We were satisfied that this technology would unlock an immense amount of flexibility and scale, so our next challenge was to swap it in without destabilizing the dashboard for our existing customers.

Migrating safely

As with all major migrations, we had to come up with a plan to keep the firehose of crash reports running while still serving up all of our existing dashboard requests. We didn’t want errors or discrepancies to impact our customers so we enlisted a tool by Github called Scientist. With Scientist, we were able to run all of the metrics requests that support our dashboard through Druid, issuing the exact same query to both the old system and the new system, and comparing the results.  We expected to see a few discrepancies, but we were excited to see that when there were differences, Druid generally produced more accurate results. This gave us the confidence that Druid would provide the functionality we needed, but we still needed to scale it up to support all of our dashboard traffic.  

To insulate our customers from a potential failure as we tuned it to support all of our traffic, we implemented a library called Trial.  This gave us an automatic fallback to the old system. After running this for a few weeks we were able to gradually scale up and cut over all of our traffic to the new system.

How we use Druid for Crashlytics

On busy days, Crashlytics can receive well over a billion crash reports from mobile devices all over the world. Our crash processing pipeline processes most crashes within seconds, and developers love that they can see those events on their dashboards in very close to real time.

To introduce a minimum of additional processing time, we make extensive use of Druid's real-time ingestion capabilities. Our pipeline publishes every processed crash event to a Kafka cluster that facilitates fanout to a number of other systems in Fabric that consume crash events. We use a Heron topology to stream events to Druid through a library called Tranquility. Part of the Druid cluster called the "indexing service" receives each event and can immediately service queries over that data. This path enables us to serve an accurate, minute by minute picture of events for each app for the last few hours.  

However, calculating metrics over a week or months of data requires a different approach. To accomplish this, Druid periodically moves data from its indexing service to another part of the cluster made up of "historical" nodes. Historical nodes store immutable chunks of highly compressed, indexed data called "segments" in Druid parlance and are optimized to service and cache queries against them. In our cluster, we move data to the historical nodes every six hours. Druid knows how to combine data from both types of nodes, so a query for a week of data may scan 27 of these segments plus the very latest one currently being built in the indexing service.

The results

Our Druid based system now allows us to ingest 100% of the events we receive, so we are happy to report that we are no longer sampling crash data from any of our customers.  The result is more accurate metrics that you can trust to triage stability issues, no matter how widely installed your app is.

While nothing is more important to us than working to ensure you have the most reliable information possible, we also strive to iterate and improve the Crashlytics experience. In addition to helping us improve accuracy, Druid has unlocked an unprecedented degree of flexibility and richness in what we can show you about the stability issues impacting your users. Since the migration, you may have noticed a steady stream of design tweaks, new features, and performance enhancements on our dashboard. For example, here are a few heavily-requested features that we’ve recently rolled out:  

  • You can now view issues across multiple versions of your app at the same time
  • You can view individual issue metrics for any time range
  • You can now filter your issues by device model and operating system

This is just the beginning. We're looking forward to what else we can build to help developers ship stable apps to their customers.

P.S. We're building a mobile platform to help teams create bold new app experiences. Want to join us? Check out our open positions!

Get Crashlytics

Filter crashes by activity segments: improve stability for your most valuable users

by Jason St. Pierre, Product Manager

Here’s the conundrum mobile app teams face: all apps crash, but it’s impossible to fix every single stability issue. In a world with limited time and resources, the million dollar question is – which crashes should you tackle first? Our mission is to help you identify and prioritize issues that have the biggest impact on your app quality.

Since launch, Crashlytics has given you visibility into the downstream effects crashes have on your business by highlighting their severity and prevalence. Today, we’re unveiling an additional lens to show you how stability varies among different user activity segments.

Want to know which crashes are preventing new users from turning into loyal, engaged ones? Want to resolve crashes that are blocking active users from completing key in-app actions? The combined power of Crashlytics and activity segments will show you which crashes are affecting your most valuable users.

Deliver a fantastic new user experience

New users have low tolerance for buggy apps – and a glitchy session can deter them from ever coming back. In fact, 1 out of 4 mobile apps are abandoned after just one use and one of the top reasons for abandonment is technical flaws. Crashlytics will help you make a great first impression by isolating stability issues impacting new users so you can fix them, fast. To see these issues, just select “New Users” on the User Activity tab on your dashboard:

Keep active users happy and engaged

You can also use this filter to improve stability for people that interact with your app on a daily basis. By clicking on “Active Users” or “Highly Active Users," you’ll be able to see which crashes are affecting your most engaged users. These people love your app and keep coming back to it, so reward them with a stellar, stable experience!

Make smarter stability decisions

Crashes by activity segments combines the power of crash reporting with user engagement to surface issues affecting your most valuable users. Now, you’ll be able to see if certain issues are especially prevalent in a specific segment and better prioritize your time.

To see it in action, just hop over to the Crashlytics dashboard and give it a try. We can’t wait to hear what you think!

Get Crashlytics

Supercharge Beta by Crashlytics with fastlane: automate app testing to save more time

By Hemal Shah, Product Manager

Supercharge beta by Crashlytics with fastlane

Feedback that comes early and often is the best way to improve your app experience. Your users are your greatest source of learning (what can I do better?) and inspiration (what should I build next?), which is why it’s such a shame that beta testing is a tedious, tiring, and hair-pulling process.

We built Beta by Crashlytics to fix this by making it easy and straightforward to distribute beta builds to your users. Simplified beta distribution is awesome, but imagine how much more time you’d save if you could also automate it in seconds. Now, all iOS apps using Beta by Crashlytics can harness the power of fastlane to automate beta deployment with a simple, guided setup process!

 

Less tedious work, more time to innovate

A successful beta release involves a lot of steps – you have to bump version numbers, compile the build, append the release notes, distribute it to your testers, alert your team that a build went out, and the list goes on and on. And if that wasn’t enough, sometimes only one person knows how to distribute a beta build, which means you’re tied to their schedule and there is a bottleneck for delivering value to your app users.

Beta by Crashlytics takes the pain out of this process and makes it intuitive and efficient. By installing fastlane, an award-winning mobile deployment toolset, you can automate the tedious tasks that slow down your app development. Say “goodbye” to manual work and spend more time innovating, while still getting a healthy stream of user feedback.

Beta distribution takes only seconds using fastlane:

Beta distribution with fastlane

 

Instant access to all the power and benefits of fastlane

As an added bonus, with fastlane, you’ll also unlock new tools to help you customize your build process and effortlessly release your app to the App Store. Plus, fastlane is fully extensible so you can create and consume actions that automate all aspects of your mobile development – whether it’s working with Git, posting to Slack, etc. – there are 170 built-in actions and 50 third-party plugins to choose from!

Here’s an example of fastlane’s extensibility: We noticed that developers are increasingly forced to use multiple beta distribution services to address different needs. With fastlane, you can use the same process to automate beta deployment to Beta by Crashlytics and stage your build on iTunes Connect. fastlane integrates with the services you already use and love, so everything will still work exactly as you designed it – it’ll just be light years faster and you’ll have more flexibility and control.


lane :beta do
 increment_build_number
 gym                  # Build your app
 testflight           # Upload to TestFlight
end

lane :appstore do
 snapshot             # Generate screenshots for the App Store
 gym                  # Build your app
 deliver              # Upload the screenshots and the binary to iTunes
 slack                # Let your team-mates know the new version is live
end

 

The fastest set-up process, ever

We know that one of the biggest hesitations to adopting a new tool, no matter how incredible it looks, is the fear of a terrible and time-consuming onboarding experience. Nobody wants to spend days connecting everything and making sure it’s set up properly. That’s why we worked hard to make it easy as pie to enhance Beta by Crashlytics with fastlane. This will be the smoothest onboarding process you’ve ever seen.

Want to see for yourself? All you need to do is open up the Fabric Mac app and click on the “fastlane” tab in the top menu. Then, click the “Automate your beta” link and we’ll use existing information about your app to auto-generate your single instruction file (called a Fastfile). From there, just build your app and you’re good to go.

You can even use the same Fastfile to automate taking screenshots and deploying to app stores. And once fastlane is setup, everyone on your team can use it out of the box (no extra knowledge transfer needed!) and it seamlessly works with your CI server.

fastlane + beta onboarding process

 

A better beta experience awaits

User feedback can reveal app issues and opportunities that even the most astute development team may miss – it is the key to meaningful iteration. We care about your success, which is why we invested time and resources to drastically improve the beta testing process. With the combined power of Beta by Crashlytics and fastlane, you can get all the feedback you need in a fraction of the time and effort. And it only takes a few seconds to save precious hours. Upgrade to fastlane today and let us know what you’re building with all of your extra time!

For more information about doing beta deployment through fastlane, check out our quick setup guide.

Introducing OOM reporting: a new dimension to app quality

by Sean Curran, Software Engineer

Stability issues can derail the success of even the best apps – glitchy software repels people. We know that app quality is one of your top priorities and crashes are your worst nightmare, which is why Crashlytics will always alert you when issues arise. We’ll even help you pinpoint their root cause so you can fix issues fast – we’ve got your back!

Today, we’re extending our crash coverage to include out-of-memory (OOM) reporting on iOS. Now, you can see stability from a whole new angle by understanding the impact OOM events have on your app experience.

What’s an OOM event and when does it occur?

Unexpected app terminations degrade your app experience and interrupt your user’s session. One type of app termination you’re probably familiar with is crashes, but there’s another unexpected termination that warrants your attention called an OOM event.

An OOM event is an app termination that occurs when a mobile device runs out of memory. All apps need memory to work, but there is only a finite amount available on each device. When an app needs more memory and there isn’t any available, the operating system terminates the app session. To your users, this looks like any other crash, however in reality, this is an OOM event.

Our approach to solving a hard problem: intelligent heuristics

OOM events are difficult to report because iOS doesn’t provide any direct mechanism to detect them and they can be caused by factors beyond your app environment and control, such as your app’s memory usage. But because of how important it is to understand your rate of OOM events, we took on this challenge and added OOM reporting to Crashlytics for iOS devices.

Here’s how it works: When you enable the Crashlytics and Answers Kits available on Fabric, we get a stream of live data about your app’s performance. Then, we apply a server-side process of elimination to this data stream to detect OOM events in your app. This detection is based on an intelligent heuristic, inspired by the work of two talented engineers. In other words, we analyze your app’s event stream to come up with an explanation of why it terminated and if we can’t match it up it to a known reason for termination, we count it as an OOM event. And, since no changes to the Answers SDK were required, you will automatically get OOM reporting without needing to do any work (as long as you have both Crashlytics and Answers installed).

This is an example of how your app analytics and stability kits can work together to unlock powerful new insight – something neither kit could do alone.

At Pinterest, we're shipping iOS app updates every two weeks to millions of people. Using the new OOM insights, we've been able to track our memory optimizations and have confidence in the stability of each release.

- Scott Goodson, Head of Core Experience


New line of sight into the impact of OOM events on app quality

Our new OOM reporting dashboard will help you understand if OOM events are a problem for your app. Now, you’ll be able to see the overall percentage of app sessions that were unaffected by OOM events across your builds. You can even drill down and see the percentage of OOM-free sessions for individual builds. This will help you answer important questions like, “Is my app being terminated more on one build than another?” and “Is my app seeing more or less OOM-free sessions over the last week?”

We’ll also give you a sortable, daily device breakdown so you can compare the raw OOM counts and the percent of OOM-free sessions across different iOS devices. Once you know which iOS device experiences the most OOM events, you can better triage and spend your time investigating problems on your most used devices. After you roll out a fix, you can monitor the number of OOM events on that device to see if your solution helped reduce them. You may even learn that there are some low-end devices that you simply can’t support.

This additional information paints a clearer picture of your app stability across devices and provides clues for troubleshooting.

Finally, we’ll show you the total number (i.e., the raw counts) of OOM events across all of your builds versus your top three builds. Use this graph to understand the prevalence and magnitude of OOM events.

Don’t let OOM events crash your app party 🎉

Our mission is to ensure that there are no more sad, unstable apps. By adding OOM reporting into Crashlytics, we’re giving you even more insight into the quality of your app. Don’t be blindsided by OOM issues that disrupt your users’ app experiences (and maybe even cause them to flee!). Monitor your OOM-free sessions, promptly identify when OOM events become a problem, and get valuable direction on where to start your troubleshooting by going to the OOMs page from your Crashlytics dashboard. Check it out and let us know what you think!

Get Crashlytics Now

Introducing enhanced dSYM tools: stay on top of stability

by Jason St. Pierre, Product Manager

Missing crashes is frustrating, but not knowing the cause is even worse. App users have low tolerance for buggy apps, which is why stable apps retain more engaged users.

When we launched Crashlytics four years ago, we set out to solve the problem of mobile crash reporting through an easy to use, automated experience. We invested in tools that automated the dSYMs upload process for iOS apps, so you never had to worry about missing a crash.

As the mobile ecosystem evolves, there are complex situations (i.e., Bitcode support or the use of dynamic libraries) that can prevent us from locating the right dSYM to symbolicate your crashes. That's why today, we're excited to unveil our new command line dSYM uploader. Now, you'll never miss a crash report and you'll also have more transparency and the highest quality crash data — even in the most complex situations!

Real-time alerts when crashes can’t be processed

It isn’t always obvious when you’re impacted by a symbolication issue. Now, any crash that is missing a dSYM will immediately prompt a banner alert in your Crashlytics dashboard. We’ll even show you the aggregate number of unsymbolicated crashes and the UUIDs of the missing dSYMs, meaning you’ll never waste time guessing which ones to upload.

More control over dSYM uploads

dSYMs are essential for crash reporting because they allow us to group crashes into larger issues, isolate the root cause of the error, and provide context around its severity. We continue to invest in flexible, yet powerful tools, that help you automate this process since we know how critical the retrieval and submission of dSYMs is to your stability insight.

That’s why we built upload-symbols, a command line tool, written in Swift, that ships within the Fabric Mac app and our Fabric CocoaPod. It’s incredibly scripting-friendly - give it a try yourself!

Automate with fastlane and skip iTunes connect

When you submit iOS apps to the App Store with Bitcode enabled, your app gets recompiled on Apple’s servers. This means Crashlytics doesn’t have access to the debug information it needs to symbolicate your crashes. Luckily, you do, via a dSYM download facility in Xcode.

Instead of manually downloading these files from Xcode and submitting them to Crashlytics, you can use fastlane to automate this tedious process after every release.

Learn more about how to save time by setting up fastlane automation.

Generous crash reprocessing window for peace of mind

Don’t have time to upload missing dSYMs right away? No problem. We know you’re busy so we’ll hold onto unprocessed crash reports for seven days. Once we receive the missing dSYMs, we’ll process the unsymbolicated crashes. Rest assured, you won’t lose any of the information you need to solve the most critical issues.

Highest crash reporting quality

We built our dSYM uploader to stay ahead of the curve. Even though managing debug symbols is getting more complicated, we’ve got you covered with our new Crashlytics updates.

In addition to increased transparency on the number of unsymbolicated crashes and missing dSYMs, we’re now giving you more control over uploading them. And with the power of fastlane, we’ve made the dSYM management as smooth and hassle-free as possible for developers that take advantage of Bitcode. Best of all, our seven-day crash reprocessing window means you don’t need to drop what you’re doing to find a missing dSYM - you have time to retrieve them without risking crash data loss.

Try our new dSYM uploader to get even more insight into your crashes and become the master of your app’s stability. If you have more ideas for how we can make dSYM processing even better, we’re all ears!