Migrating to Druid: how we improved the accuracy of our stability metrics

by Max Lord, Software Engineer

Stability metrics are one of the most critical parts of Crashlytics because they show you which issues are having the biggest impact on your apps. We know that you rely on this data to prioritize your time and make key decisions about what to fix, so our job is to ensure these metrics are as accurate as possible.  

In an effort to strengthen the reliability of these numbers, we spent the last few months overhauling the system that gathers and calculates the stability metrics that power Crashlytics. Now, all of our stability metrics are being served out of a system built on Druid. Since the migration has ended, we wanted to step back, reflect on how it went, and share some lessons and learnings with the rest of the engineering community.

Why migrate?

In the very early days of Crashlytics, we simply wrote every crash report we received to a Mongo database. Once we were processing thousands of crashes per second, that database couldn't keep up. We developed a bespoke system based on Apache Storm and Cassandra that served everyone well for the next few years. This system pre-computed all of the metrics that it would ever need to serve, which meant that end-user requests were always very fast. However, its primary disadvantage was that it was cumbersome for us to develop new features, such as new filtering dimensions. Additionally, we occasionally used sampling and estimation techniques to handle the flood of events from our larger customers, but these estimation techniques didn't always work perfectly for everyone.

We wanted to improve the accuracy of metrics for all of our customers, and introduce a richer set of features on our dashboard.  However, we were approaching the limits of what we could build with our current architecture.  Any solution we invented would be restricted to pre-computing metrics and subject to sampling and estimation. This was our cue to explore other options.

Discovering Druid

We learned that the analytics start-up MetaMarkets had found themselves in a similar position and the solution that they open-sourced, Druid, looked like a good fit for us as well. Druid belongs to the column-store family of OLAP databases, purpose-built to efficiently aggregate metrics from a large number of data points. Unlike most other analytics-oriented databases, Druid is optimized for very low latency queries. This characteristic makes it ideally suited for serving data to an exploratory, customer-facing dashboard.

We were doubtful that any column store could compete with the speed of serving pre-computed metrics from Cassandra, but our experimentation demonstrated that Druid's performance is phenomenal. After spending a bit of time tweaking our schema and cluster configuration, we were easily able to achieve latencies comparable to (and sometimes even better than!) our prior system.  We were satisfied that this technology would unlock an immense amount of flexibility and scale, so our next challenge was to swap it in without destabilizing the dashboard for our existing customers.

Migrating safely

As with all major migrations, we had to come up with a plan to keep the firehose of crash reports running while still serving up all of our existing dashboard requests. We didn’t want errors or discrepancies to impact our customers so we enlisted a tool by Github called Scientist. With Scientist, we were able to run all of the metrics requests that support our dashboard through Druid, issuing the exact same query to both the old system and the new system, and comparing the results.  We expected to see a few discrepancies, but we were excited to see that when there were differences, Druid generally produced more accurate results. This gave us the confidence that Druid would provide the functionality we needed, but we still needed to scale it up to support all of our dashboard traffic.  

To insulate our customers from a potential failure as we tuned it to support all of our traffic, we implemented a library called Trial.  This gave us an automatic fallback to the old system. After running this for a few weeks we were able to gradually scale up and cut over all of our traffic to the new system.

How we use Druid for Crashlytics

On busy days, Crashlytics can receive well over a billion crash reports from mobile devices all over the world. Our crash processing pipeline processes most crashes within seconds, and developers love that they can see those events on their dashboards in very close to real time.

To introduce a minimum of additional processing time, we make extensive use of Druid's real-time ingestion capabilities. Our pipeline publishes every processed crash event to a Kafka cluster that facilitates fanout to a number of other systems in Fabric that consume crash events. We use a Heron topology to stream events to Druid through a library called Tranquility. Part of the Druid cluster called the "indexing service" receives each event and can immediately service queries over that data. This path enables us to serve an accurate, minute by minute picture of events for each app for the last few hours.  

However, calculating metrics over a week or months of data requires a different approach. To accomplish this, Druid periodically moves data from its indexing service to another part of the cluster made up of "historical" nodes. Historical nodes store immutable chunks of highly compressed, indexed data called "segments" in Druid parlance and are optimized to service and cache queries against them. In our cluster, we move data to the historical nodes every six hours. Druid knows how to combine data from both types of nodes, so a query for a week of data may scan 27 of these segments plus the very latest one currently being built in the indexing service.

The results

Our Druid based system now allows us to ingest 100% of the events we receive, so we are happy to report that we are no longer sampling crash data from any of our customers.  The result is more accurate metrics that you can trust to triage stability issues, no matter how widely installed your app is.

While nothing is more important to us than working to ensure you have the most reliable information possible, we also strive to iterate and improve the Crashlytics experience. In addition to helping us improve accuracy, Druid has unlocked an unprecedented degree of flexibility and richness in what we can show you about the stability issues impacting your users. Since the migration, you may have noticed a steady stream of design tweaks, new features, and performance enhancements on our dashboard. For example, here are a few heavily-requested features that we’ve recently rolled out:  

  • You can now view issues across multiple versions of your app at the same time.
  • You can view individual issue metrics for any time range.
  • You can now filter your issues by device model and operating system.

This is just the beginning. We're looking forward to what else we can build to help developers ship stable apps to their customers.

P.S. We're building a mobile platform to help teams create bold new app experiences. Want to join us? Check out our open positions!

Get Crashlytics

Building an energy-efficient analytics SDK for iOS

By Stephen Panaro, Software Engineer

The primary goal of any analytics SDK is to gather accurate data. Answers goes a step further by putting an equal focus on timeliness. Correct analytics are one thing, but tracking them in real-time is a super power. Is your latest release stable and well-adopted? Is your 24-hour sale driving more purchases? We’ll give you the answer immediately so you can take appropriate action. To deliver this real-time insight, we have to pay special attention to how we design Answers for iOS, tvOS, and macOS.

Low power as a feature

We could have built Answers to make a network request for every app event as it happens. This would have been the most timely SDK, but it would have also drained battery life. Or, we could have designed Answers to collect large batches of events and sacrifice all timeliness for great battery life. Instead, we took a balanced approach in between these two extremes. It has served Answers well, but we wanted to revisit our implementation to see how much better we could make it.

With Answers 1.3 for Apple platforms, we introduced several new power optimizations throughout the SDK. We prioritized application performance and user battery life while keeping latency as low as we could. In some cases, these optimizations produced significant power gains. This allowed us to have a large impact on battery life with a negligible aggregate impact on Answers’ latency. In the next section, we’ll walk through some of the improvements we made to enable these gains.

Limit networking

First, we turned our focus to Answers’ networking. Networking can have a substantial power impact, which made it a great place to look for optimizations. A simple way to reduce its impact is by limiting the number of requests you make. Answers has always done this by sending data in batches. For our latest release, we try even harder to ensure that app events that occur close in time to each other are sent in the same batch. This is most beneficial for heavy users of Answers Events, especially when events are logged in bursts. We also tuned our retry policy to help prevent us from making requests that are unlikely to succeed.

Background uploads

It’s impossible to talk about power-efficient networking without mentioning NSURLSession’s background uploading capability. We’ve used this capability for several years in crash reporting and seen really good power and reliability wins, so we wanted to bring those to Answers. Unfortunately, we found some issues with the background uploading API in iOS 10. Because of that, it is currently only suitable for extremely low volume networking. We hope to enable it in a future release when these issues have been addressed.

Low power mode

Next, we took advantage of NSProcessInfo’s lowPowerModeEnabled property, which notifies apps when a device enters low power mode. Introduced in iOS 9, this is a very strong indication that the user wants battery life to be maximized. When devices are in low power mode, Answers retries network requests less frequently. On macOS, we take similar action when thermal conditions are elevated. This is an easy way to further reduce our power impact, and is particularly effective when networking conditions are poor. We also have plans to expand our adoption of low power mode to other parts of the SDK.

Quality of service

In addition to the low power mode API, we also adopted two easy to use APIs to better inform the OS of the priority of Answers’ work. First, we made sure to set the qualityOfService property of our NSOperationQueues. This was a one-line change and makes sure that Answers always defers to your app’s needs. We also made extensive use of NSProcessInfo’s activity APIs to help the OS understand what we’re doing and how our work should be prioritized. As an added benefit, this makes it more obvious what Answers’ threads are doing if you happen to catch them in a debugger.

Optimized timers

Finally, we wanted to improve our usage of timers. Answers has always relied on timers because we have to make sure we periodically relay any events back to our system. In this release, we updated our timers to fire less frequently. We also now choose a timer’s tolerance value based on its duration to help the OS schedule work more efficiently. On macOS, we’ve adopted NSBackgroundActivityScheduler. This is similar to a timer but takes into account even more system conditions when scheduling work. We discovered this API while reading Apple’s Energy Efficiency guidelines, which has many other useful tips.

Same Answers, more battery

When we started to plan for Answers 1.3, we knew improving energy efficiency and satisfying Answers’ design goals would be challenging. Being accurate and being real-time are inherently power-hungry, but they’re essential to Answers so we were determined to find a solution.

We stepped back, took a holistic look at Answers and identified our best opportunities for improvement. Fortunately, there was a large overlap between the problems we wanted to solve and the problems that Apple supplies a solution for as a part of iOS. This was fantastic on two fronts: it let us write less code and in many cases, it unlocked optimizations that wouldn’t be possible otherwise.

We’re thrilled with how this approach turned out and are even more excited to share these changes with you in the latest Answers release. We hope that this makes it easier for you to build the best apps with the lowest impact on your users’ battery.

If you’re interested in further reducing your app’s impact on battery life, we encourage you to incorporate these best practices into your development workflow. Most are simple to learn and implement and the more apps that adopt them, the greater impact they will have. Plus, no one wants their app to be at the top of the battery section as the worst offender in Settings!

Fabric October Update

By Brian Lynn, Sr. Product Marketing Manager

Halloween is a great time to munch on candy, carve pumpkins, and partake in other fall festivities. Over here at Fabric, we shipped some treats in October to keep you ghoulishly happy. Recap below!

Automate app testing to save more time

Beta testing is an important part of building an app users love - which is why it’s such a shame that the beta process is tedious and tiring. Although you can already use Beta by Crashlytics to simplify that process, imagine how much more time you’d save if you could also automate it in seconds! That’s why in October, we made it easy to supercharge Beta by Crashlytics with fastlane. Now, you can spend less time testing/releasing your app and more time building features that users love. See the integration in action here.

Track your onboarding performance in real-time

Your login funnel is one of your app’s most important conversion paths. New users want a simple onboarding flow so they can start using your app right away. That’s why on top of the fastlane integration, we also launched Digits' login funnels: a new real-time view into how many users begin the onboarding process, how many finish it, and where people fall out. This new level of granularity will show you the biggest drop-off points in your funnel so you can take action to fix them:

See the original announcement here!

Easily organize and find your in-app events

As users sign up or log in to your app, Answers Events lets you track their in-app actions and understand how they’re engaging with your app. In October, we made it even easier to organize and find these events/key performance indicators (KPIs) on your events dashboard. Now, you can scroll through your entire events list (instead of just the top 10) and sort it by name, frequency or category. Check it out on your latest dashboard.

Improve stability for your most valuable users

New users tend to abandon apps that are buggy, and returning users need your app to continue to be stable. That’s why in October, we also released crashes by activity segments — a new filter that shows you which stability issues are affecting new and active users. Want to know which crashes are causing new users to leave or which issues are blocking active users from completing key in-app actions? Hop over to your Crashlytics dashboard to find out:

Learn more from the original announcement.

Grow your revenue with engaging rewarded video ads

For those of you generating revenue with ads, we launched rewarded video ads on the MoPub Marketplace — a new way to monetize your app. Rewarded video ads allows you to offer users a meaningful reward (like in-game currency) in exchange for watching a video ad. This is a win-win because you can now encourage user engagement and earn revenue at the same time. More on this from the original announcement.

 

We hope you and your team find these features helpful. Here’s to another great month!

Supercharge Beta by Crashlytics with fastlane: automate app testing to save more time

By Hemal Shah, Product Manager

Supercharge beta by Crashlytics with fastlane

Feedback that comes early and often is the best way to improve your app experience. Your users are your greatest source of learning (what can I do better?) and inspiration (what should I build next?), which is why it’s such a shame that beta testing is a tedious, tiring, and hair-pulling process.

We built Beta by Crashlytics to fix this by making it easy and straightforward to distribute beta builds to your users. Simplified beta distribution is awesome, but imagine how much more time you’d save if you could also automate it in seconds. Now, all iOS apps using Beta by Crashlytics can harness the power of fastlane to automate beta deployment with a simple, guided setup process!

 

Less tedious work, more time to innovate

A successful beta release involves a lot of steps – you have to bump version numbers, compile the build, append the release notes, distribute it to your testers, alert your team that a build went out, and the list goes on and on. And if that wasn’t enough, sometimes only one person knows how to distribute a beta build, which means you’re tied to their schedule and there is a bottleneck for delivering value to your app users.

Beta by Crashlytics takes the pain out of this process and makes it intuitive and efficient. By installing fastlane, an award-winning mobile deployment toolset, you can automate the tedious tasks that slow down your app development. Say “goodbye” to manual work and spend more time innovating, while still getting a healthy stream of user feedback.

Beta distribution takes only seconds using fastlane:

Beta distribution with fastlane

 

Instant access to all the power and benefits of fastlane

As an added bonus, with fastlane, you’ll also unlock new tools to help you customize your build process and effortlessly release your app to the App Store. Plus, fastlane is fully extensible so you can create and consume actions that automate all aspects of your mobile development – whether it’s working with Git, posting to Slack, etc. – there are 170 built-in actions and 50 third-party plugins to choose from!

Here’s an example of fastlane’s extensibility: We noticed that developers are increasingly forced to use multiple beta distribution services to address different needs. With fastlane, you can use the same process to automate beta deployment to Beta by Crashlytics and stage your build on iTunes Connect. fastlane integrates with the services you already use and love, so everything will still work exactly as you designed it – it’ll just be light years faster and you’ll have more flexibility and control.


lane :beta do
 increment_build_number
 gym                  # Build your app
 testflight           # Upload to TestFlight
end

lane :appstore do
 snapshot             # Generate screenshots for the App Store
 gym                  # Build your app
 deliver              # Upload the screenshots and the binary to iTunes
 slack                # Let your team-mates know the new version is live
end

 

The fastest set-up process, ever

We know that one of the biggest hesitations to adopting a new tool, no matter how incredible it looks, is the fear of a terrible and time-consuming onboarding experience. Nobody wants to spend days connecting everything and making sure it’s set up properly. That’s why we worked hard to make it easy as pie to enhance Beta by Crashlytics with fastlane. This will be the smoothest onboarding process you’ve ever seen.

Want to see for yourself? All you need to do is open up the Fabric Mac app and click on the “fastlane” tab in the top menu. Then, click the “Automate your beta” link and we’ll use existing information about your app to auto-generate your single instruction file (called a Fastfile). From there, just build your app and you’re good to go.

You can even use the same Fastfile to automate taking screenshots and deploying to app stores. And once fastlane is setup, everyone on your team can use it out of the box (no extra knowledge transfer needed!) and it seamlessly works with your CI server.

fastlane + beta onboarding process

 

A better beta experience awaits

User feedback can reveal app issues and opportunities that even the most astute development team may miss – it is the key to meaningful iteration. We care about your success, which is why we invested time and resources to drastically improve the beta testing process. With the combined power of Beta by Crashlytics and fastlane, you can get all the feedback you need in a fraction of the time and effort. And it only takes a few seconds to save precious hours. Upgrade to fastlane today and let us know what you’re building with all of your extra time!

For more information about doing beta deployment through fastlane, check out our quick setup guide.

Fabric September Update

by Brian Lynn, Sr. Product Marketing Manager

Now that fall is here and school’s back in session, we’re studying up on how to make life easier for you and your app development team. This month, we released three new features to help you understand your users and strengthen your app quality:

Your key performance indicators — now available on the go

Many of you are already tracking Answers Events and your key performance indicators (KPIs) via your Fabric dashboard. Now, with the latest version of our Fabric iOS app, you can easily track those KPIs and monitor user actions even when you’re away from your desk. By combining these additional insights with your adoption and stability metrics (e.g., DAU, MAU), you’ll know exactly where to focus your app improvement efforts.

Get the latest update:

(Android coming soon!
Follow us on Twitter so you don’t miss it)



See the most critical issues across builds or by device/OS

To help you fix crashes even faster, we added a new build selector as well as device and OS filters to your Crashlytics dashboard. Now, you can easily find and prioritize issues that are affecting multiple builds and increase stability across the spectrum.

On top of the new build selector, you can even filter crashes by device or OS. Want to focus in on crashes happening on iPhone 7 or Android 7, or compare crashes on iOS 10 vs. iOS 9? These new filters make it easy to search, compare and address your app's stability.

See these in action on your Crashlytics dashboard!

New fastlane docs: automate your app releases

Earlier this month, we launched docs.fastlane.tools, our new docs website for fastlane. This new site will walk you through fastlane’s set-up process and show you how to streamline tedious work when releasing your app, like taking screenshots, beta distribution, code signing, and more. Check it out!

Sidenote: we’re also happy to share that fastlane has now surpassed 500 contributors on GitHub. We continue to be humbled by your support, and we can’t wait to keep moving fastlane forward!

Here’s our internal changelog:

Fabric

  • iOS

    • Update upload-symbols for macOS Sierra compatibility

    • Warn on projects built for ≤ iOS 6 and ≤ macOS 10.7 that Fabric compatibility with those version is deprecated

  • Android

    • Update the Fabric dependency to update Crashlytics Core

Crashlytics

  • Android

    • Improved support for Android M & N

    • Facilitated improved NDK support on Android M & N

    • Updated Crashlytics Core dependency

    • Fixed issue which prevented sending crash reports in the rare case battery level info is not available

Answers

  • Android

    • Fixed a bug that caused Answers to undercount the number of Daily New Users Daily New Users count may be temporarily higher when you first launch a version of your app with this SDK

    • Updated Answers dependency

Digits

  • Android

    • Fixed error inflating StateButton when requesting email

    • Introduced new logger events to obtain errors while submitting phone number or confirmation code

Twitter Kit

  • Android

    • Bump dependencies

    • Added translations

    • Removed pseudo locales from translations

    • Updated proguard rules for Okhttp3 and Retrofit2

    • Removed pseudo locales from translations

    • Moved TwitterCollection from internal package to models

    • Minor bug fixes

    • Removed pseudo locales from translations

MoPub