How digging into an on-call issue led to an unlikely culprit
What in tarnation?
During one of my on-call rotations for our internal tools team, we got a report that Chrome was crashing for users of Gusto’s internal software. This was causing all sorts of interruptions to our normal customer service. Gusto employees in the middle of answering customer emails or phone calls might suddenly find themselves without visibility into customers’ accounts necessary to do their jobs.
This was fairly far outside the usual scope of our on-call issues. Our team is generally well-insulated by other teams from issues like browser compatibility, so I didn’t know the first thing about browser debugging. Where would I even begin? I leaned on a more experienced teammate, our product infrastructure team, and our IT team.
The first clues
We started by trying to find out what the affected users had in common.
We learned rather quickly that:
- Not all Gusto employee users were affected.
- Our customer-facing software did not appear to be affected.
- Other internal software webpages appeared to be fine.
- Crashing did not occur consistently. Users who experienced it could reload the same page multiple times, and sometimes it would crash and sometimes it would not.
- Not all of our internal pages crashed, but more than one of them did, including our most heavily-trafficked pages.
- The issue was not occurring with Safari or Firefox.
Hunch #1: A bad Chrome version
Our first hunch was that maybe it was the Chrome version. We had one affected user update their Chrome version, and early signs looked promising that the issue was resolved. However, we soon learned that although installing the new Chrome version decreased the frequency of crashing, it didn’t eliminate the issue.
We added the following new clues to our list:
- The specific Chrome version had already been released for a while.
- We had affected users on different versions of Chrome.
- We had affected users and unaffected users on the same version of Chrome.
- Upgrading/downgrading the Chrome version for affected users did not fix the issue.
Hunch #2: A bad Chrome extension
Okay, maybe it was a Chrome extension? We thought the crashing stopped when one of our users disabled three core extensions. When we tried to reproduce the fix with a guest profile (without extensions present), we still saw crashing. Back to the drawing board, then.
Trouble reproducing the bug
Our Infrastructure team put out a call to all of engineering to ask if anyone could reproduce the issue. Frustratingly, although many of our customer service representatives were affected, none of our engineering teams reported any crashes at all except for two engineers in Turkey. With precious little overlap in available pairing time due to time zone differences, we slowly learned the following over several weeks:
- For security reasons, we do not have Chrome crash reporting enabled.
- Checking out code from several weeks prior did not fix the issue, which indicated that the cause was not a recent change.
- Loading a static html version of a crashing page did not cause crashing.
- Using open-source Chromium instead of Chrome did not cause crashes, so we couldn’t see what Chrome code was failing either.
- Several different internal applications were causing Chrome to crash, not just one.
- Removing all of the page-specific content from the page did not fix the issue.
- Disabling our in-house font did not fix the issue.
As urgency waned because our users were using other browsers as a workaround, progress on this bug slowed to make way for other priorities. We didn’t have much left to go on without being able to reproduce the bug. However, we wanted to resolve it since users had bookmarks/settings/preferences in Chrome. We believed that we shouldn’t have to ask our users to avoid the world’s most popular browser, and we were also still getting periodic pings from various users asking whether we had made any progress on this bug.
A stroke of good luck
One day out of the blue, one of our Denver engineers reported being newly affected. The only change she had made was downloading the Grammarly desktop app.
Wait, really? We had to see if we could reproduce the bug.
- I downloaded the Grammarly desktop app too. Instant reproduction of the issue (at last!).
- I deleted Grammarly. The issue didn’t go away. I restarted my computer and the issue went away again.
- I reinstalled Grammarly. Chrome started crashing again.
We also confirmed with many of our affected users that they had Grammarly installed on their computers. Now we were cooking with fire!
With our ability to debug now greatly improved, we started making tedious headway: make a change, reload the page ten times or until it crashes.
Our main internal application is built on ActiveAdmin, but newer parts of it use React without the same ActiveAdmin framework as the rest of the application. The pages that are built in React did not crash. Hmm, so maybe our ActiveAdmin code was causing the crashes?
We learned earlier that removing all page-specific content did not fix the issue, so we started looking at parts of the code that were common to multiple pages, like the main navigation header and sidebar. Notably, our React pages do not use the same navigation header.
The code for our main navigation bar has a fair amount of metaprogramming, and chasing down threads here was often more confusing than not. We eventually figured out how to comment out pieces of the navigation bar, until we pinpointed one line that stopped crashing Chrome when it was commented out:
dropdown 'My History', , turbo: true, src: '/navigation/my_history'
This section is called “My History” and it differs from the rest of the main navigation in that instead of being more-or-less the same for all users, it is customized for each user, displaying the handful of pages that each user has visited most recently. We discovered that even when the page loaded successfully, hovering over the “My History” section could cause Chrome to immediately crash.
Hunch #3: Turbo
Then we looked at
turbo: true. Could that be causing the issue? Turbo is a gem we added to speed up our Rails application, but it turned out to be a red herring: it was only introduced after the bug had already been reported, and we learned that the engineer who introduced it had actually been experiencing these Chrome crashes for months prior to Turbo being introduced, and months prior to the bug being escalated to us.
Okay, so where was the dropdown being defined? We use a framework called Arbre to metaprogram html from this type of method. To navigate the internal plumbing, I turned to one of our engineers with deep Rails knowledge. In this case, the relevant code (once we finally found it) looked like this:
This code generates html that looked something like this:
We replaced the call to dropdown with the generated html, removing pieces of the new html until we zeroed in on the culprit.
When I removed
loader-spinner.gif, the placeholder we display while the menu options load, the page stopped crashing. Eureka! It’s the gif! We swapped in a different gif and the page did not crash.
We downloaded the image file and dragged the file into the browser window. With Shakespearean melodrama, the page immediately crashed. My pair and I both audibly gasped.
We also found out that:
- Opening the file in Safari did not cause a crash.
- After uninstalling Grammarly and restarting the computer, the gif loaded in Chrome without crashing.
At this point, we notified our Design Systems team of the very peculiar fact that this gif was causing Chrome to crash, and they promptly replaced it with a new one.
Why did this particular gif crash Chrome when Grammarly was installed? Unfortunately, with access to neither the Chrome source code nor the Grammarly source code, we can only guess. In the time since we replaced the gif, either Grammarly or Chrome or both have fixed this issue, because the original gif no longer causes Chrome to crash.
I would never, ever have guessed that the treasure at the end of the debugging rainbow was an animated gif.
Even though the priority of this bug changed over time as we found workarounds, relentless curiosity won out in the end. No single one of us had all of the necessary knowledge to solve this bug on their own, but with persistence and collaboration, we were able to figure it out together.
If you also enjoy collaborating with relentlessly curious people, we are hiring!
Hats off to the many people who collaborated to investigate and fix this issue: Iain McGinniss, Lucy Fox, Gregor MacDougall, Oguzhan Ince, Can Gençler, Daniel Flynn, Eric Nagy, Lijie Zhou, Harry Seeber, Nathaniel Strauss, and Steve Konves.