Building Fabric mission control with GraphQL and Relay

By Fin Hopkins, Software Engineer

A few weeks ago, we shipped mission control, a new dashboard for Fabric that shows you the most important information across all of your apps on the same page. To build it, we turned to two fairly new technologies, GraphQL and Relay, because of the flexibility they bring to fetching data. Although investigating and productionizing these new technologies added an upfront cost to the project, we found that they paid off over the course of development and we’re excited by their potential as we develop more for Fabric.

What does it take to responsibly dive headfirst into a new technology and framework? Here’s how we went about it:

Bringing together the data

Just from the description of mission control (“all your apps, all their important information”), we knew that we were going to faceplant into the limitations of our existing API architecture. Before mission control, pages on fabric.io were limited to showing one kit’s worth of information for one app at a time, and our RESTful APIs and backend data providers reflected that design. If we were going to show you the most important data for any of your kits, for all of your apps, in under 500 network requests from the browser, we’d need to build something new.

That “new” could have been mission control–specific endpoints in our Ruby on Rails API layer, but, as we’ll see, that would have been forcing an architecture we already weren’t satisfied with to do something it wasn’t suited for. Instead, we took this as an opportunity to vet GraphQL, a technology that, on its surface, was tailor-made to our problem. If this experiment went well, we could see ourselves adopting GraphQL for more features, bringing its benefits to all Fabric teams.

Rendering a page with the information diversity and density we were going for requires making multiple API calls to a variety of backend systems, such as those for Answers, Crashlytics, and Beta. A RESTful public endpoint that combined those sources would actually spend most of its time waiting for those API responses, something our existing Ruby on Rails setup is decidedly poor at. Making the API calls in serial would take far too long, but parallel requests in Rails would require bringing in a concurrency library like EventMachine, and even then we’d still be tying up one of our limited Unicorn processes for the duration of the request.

GraphQL for a better server

GraphQL, developed at Facebook, is a query language for, well, graphs of data. You start with a strongly-typed schema of your data, which is defined principly as objects and their fields. A user Account might have strings for name and email address, but it might also have a “connection” to a list of Apps. Those Apps will have their own fields, like name, icon, or current number of active users. An App might also have a connection back to a list of member Accounts. GraphQL lets you start at a “root” in your schema — say, the current account — and query for the subset of fields you’d like returned. In our case, one query might be for the account’s starred Apps, their names, icons, and the timestamp / value pairs of their daily and monthly active user data.

It’s important to note that the queries come from the client. For mission control, that’s a webapp written in JavaScript. The server is a general engine that can fulfill any query that matches the schema. That flexibility will be very important later on.

By using the reference GraphQL implementation, graphql-js, we would have access to Node’s native concurrency, giving us the “wait fast” behavior we want. Backend API requests could happen in parallel, and running out of Unicorn workers during those waits wouldn’t be a concern. This has since been proven in production: loading the mission control details for an app can approach two dozen internal API requests but still return in only a few hundred milliseconds to the browser.

Relay for a better process

Adopting GraphQL would also give us the opportunity to adopt Relay, another Facebook library that integrates GraphQL and React (we love React). Relay essentially annotates React components with a GraphQL query that describes the data they need for rendering. Defining the query next to the rendering code allows a component to be developed and maintained in isolation, rather than in concert with server-side API changes. That means it’s faster and easier to make changes during development.

We knew we would be rapidly iterating mission control in response to usability studies and beta tester feedback, so anything that reduced what we needed to change to bring in new data would pay off and speed those cycles up. Having new data available with just a tweak to the GraphQL query and a hot reload in the browser changed how we were able to work. We iterated on mission control in mob sessions with our product manager and designer, and Relay let us instantly respond to their feedback and see changes working live.

“Go slow to go fast”

After kicking the tires of graphql-js and Relay (“Does it do what it says it on the tin? Can we do development / testing / logging / monitoring / deploying / &c. up to our standards?”), we got a thumbs up from other Fabric engineers and buy-in from our product manager to give it a shot for mission control. We were upfront that this was a “go slow to go fast” strategy: getting a GraphQL server and Relay frontend infrastructure up-and-running was going to take some initial time that wouldn’t be spent shipping new features to our customers, but once it was in place, we were betting we’d be able to write our feature (and hopefully many more) much more quickly than we could have on the old stack.

Adopting graphql-js

Though we’ll have more to say later about specific productionization and hardening steps we took, putting graphql-js on top of our JSON-returning backend services was straightforward. Though any unification of previously-disparate data sources will require some smoothing over of inconsistencies, on the whole we were able to craft a solid schema out of the most important Crashlytics, Answers, and Beta data.

Our best advice to new graphql-js users is to build your schema out slowly, so you can see how your data model will get consumed by a frontend client, and also to lean heavily on DataLoader to dedupe, cache, and adapt multiple one-off requests to use bulk endpoints. We also have become fans of marking any field we can as “not-null,” for reasons we’ll explain in a future post. (Spoiler: you can automatically generate valid Relay data for testing and development.)

Speed bumps in production

Development was going swimmingly against development backends and fixture data, but once we got mission control loaded up in production we saw that it actually took quite a long time to see any apps on the page. Despite all of the parallel fetches and concurrent processing in our GraphQL server, we’d see the loading spinner for sometimes as long as 10 seconds! It was a terrible experience for what was due to become the new fabric.io home page.

In that first iteration, our by-the-book use of Relay and GraphQL led us to an extreme that was ill-suited to our backend’s characteristics. Relay works by gathering up the data requests for all the (transitive) sub-components of a particular root component and combining them into a single query to the backend. That meant that we were making a single request for all of the mission control data at once: metrics, issue counts, velocity alerts, and so on, our first page of 25 apps. To fulfill such a query, our GraphQL server was making scores of requests to our backend services, and, even though they were mostly in parallel, it couldn’t return anything to the browser until they had all completed.

In an ideal world, these requests would all be consistently performant, but in our situation the backend services would occasionally hit a snag that would add a few hundred milliseconds or even a couple of seconds to their response times, delaying the rendering of anything in the browser. As we added to mission control and the GraphQL query grew, the number of internal requests — and the chance of this happening — grew along with it.

An even worse case would happen when one of the dozens of internal requests timed out or had an otherwise transient error. This would add an error to the response — which Relay interprets as a failure — requiring a second attempt from the client while the loader still spun. Of course, there was no guarantee that this next attempt wouldn’t have its own transient errors, leading to a sort of starvation by flakiness.

Though Relay’s mega-query was an efficient use of client-side network connections, it was trading that against the time-to-first-render of the page. Luckily, while Relay and GraphQL were leading us into making these giant, flaky queries, they also provided the tools we needed to fix them.

Smaller batches of data

Our solution was easy enough: split up the queries. We knew we could request the basic information about your apps in a small number of bulk requests to backends that had very predictable performance. With that data, we could start rendering rows for each app and make the page functional for navigation while we brought in the metrics and highlights information. Separating each app’s detailed data into its own request would isolate a slow response to just a single app, not the whole page.

To accomplish this in Relay, we made our AppRow component take “app” and “appDetails” props, each with its own fragment. The former fragment requested the bare-bones data and was referenced by our page’s top-level RootContainer. We then wrapped each AppRow in its own RootContainer to load the “appDetails” fragment (while passing through the “app” prop).

// MissionControl.jsx
render() {
  return <Relay.RootContainer route={appListRoute} Component={AppList}/>;
}

// AppList.jsx
render() {
  return <div>
    {this.props.account.apps.map((app) => <AppDetailsLoader 
appId={app.id}><App app={app} appDetails={null}/></AppDetailsLoader>)}
  </div>;
}

// AppDetailsLoader.jsx
render() {
  const appView = React.children.only(this.props.children);

  return <Relay.RootContainer
    route={new DetailsRoute({appId: this.props.appId})
    Component={App}
    renderLoading={() => appView}
    renderFetched={(data) => React.cloneElement(appView, {appDetails:
data.app})}
    />;
}

 

It’s important to note that this refactor of the queries was entirely done client-side. Because of GraphQL’s flexibility, we were free to make the connection / responsiveness tradeoff that was appropriate to the experience we wanted. If we had resigned to make this change with a legacy-style, fixed API, we would have had to split out the code on the server and then maintain two endpoints instead of one. Furthermore, a different frontend client (on mobile, perhaps) wouldn’t have been able to choose its own performance tradeoff without writing yet another endpoint to the same data.

GraphQL is not a magic wand

This was an important lesson for us and one that anyone adopting GraphQL will have to pay attention to: combining data from multiple sources is a big advantage of using GraphQL, but the more distinct backends or internal requests that go into servicing a single query, the more you will have to consider the case where one of them fails.

This experience also caused us to rethink how we wanted to handle retries and backoff. We were hoping to keep all retry logic on the client for operational simplicity, but, if just one internal request had a transient failure, a client-side retry would not only waste all the work that did succeed from the first request, it would be at risk for its own transient failure. Since most of our backend failures and timeouts were transient, rather than consistently slow, we added a retrying wrapper around request-promise in our GraphQL server to re-do those requests. This simple change eliminated 99% of the errors when resolving queries, and the expense of a second or two of extra latency in cases where a retry occurred after a timeout.

Going fast

While we can’t compare our GraphQL / Relay implementation to mission control built with RESTful endpoints on Ruby on Rails, we’re satisfied with both the pace and the experience of developing mission control. We’re confident that “go slow to go fast” paid off. Even if the “go slow” made things take longer than just following our old API pattern, we can now be in “go fast” mode from here on out, as can everyone else doing client development at Fabric. GraphQL has let us build a platform for ourselves that is delightful to build upon.

We took Relay to production with eyes open about its pre-release status. We expected a few rough edges, and we found some: pulling in connections a page at a time was awkward; after the page was open for a while polling for updates, running a mutation (starring an app, for example) could generate a query with 30K of duplicate fields. We were able to write workarounds in both cases (a mixin for paging in connections, and an alternate query tracker that continuously collapses), and we’ve found the maintainers at Facebook are very responsive and helpful when we run into things.

For a project like mission control, GraphQL and Relay were a near-perfect solution, and the cost of building it any other way justified the investment. Now that we have these pieces in place, we see ourselves using them again and again.

Stay tuned for our next post, where Sam Neubardt is going to share more about how we extended graphql-js to make a productionized GraphQL server.

 

We're helping our customers build the best apps in the world. Want to be a part of the Fabric team and build awesome stuff? Check out our open positions!