Obligatory TL;DR with spoilers at the end of the post.

Background

MusicButler is the first web-app I ever built. It notifies its users about new music by artists found in their libraries. It’s also how I learned a lot of what I know about programming and most of what I know about web-development. I believe it played a huge part in how I got to make a career shift to a software developer in my 30s.

Before the dawn of Dockerized apps, one way a layman like me would get an app to production was using Heroku’s Procfiles. MusicButler went live in August 2018. Today it’s a nice side-project that I’m proud to stand behind.

The app bakcend is built on Django and relies heavily on the Celery library for carrying out millions of background and scheduled tasks each day. In fact, all Heroku dynos except for the web one are Celery dynos:1

web: daphne musicbutler.asgi:application --port $PORT --bind 0.0.0.0
celerybeat: celery -A musicbutler beat
celerybackgroundworker1: celery -A musicbutler worker -Q regular
celerybackgroundworker2: celery -A musicbutler worker -Q regular
celeryimportantworker: celery -A musicbutler worker -Q important

The dyno of interest here is celerybeat: the Celery scheduler” responsible for assigning scheduled tasks to other Celery workers.

April 8: first duplicate email received

One of the oldest pieces of code in MusicButler is the one that sends out music-drop emails to thousands of users each day, once a day.

I was surprised when I saw that my test user had received two emails titled New Music for April 8, 2023” at the same time. The emails were identical. This had never happened before.

Upon checking my email delivery provider dashboard, I confirmed that MusicButler had sent twice its average volume of emails that day. Heroku’s logs showed that celerybeat had dispatched the scheduled task twice:

celerybeat.1 [2023-04-08 21:35:00,000: INFO/MainProcess] Scheduler: Sending due task send_music_drops (send_music_drops)
celerybeat.1 [2023-04-08 21:35:00,014: INFO/MainProcess] Scheduler: Sending due task send_music_drops (send_music_drops)

Ok. This isn’t the email provider’s fault. It’s mine, or Celery’s. Probably the former.

I was busy that night so I utilized a Celery companion library called Celery Once which ensures only one instance of a task can run concurrently.

April 12: First user complaint

Over the next few days some personal matters came up that steered my attention away. It was not until a long-time user complained about receiving duplicate emails that I realized the issue was still persisting. Apparently Celery Once doesn’t handle locking for scheduled tasks.

The music drops is one the core features of the app, so I needed to fix it quickly. I had approximately 24 hours until the next batch of emails was due to be sent out.

I could implement the use of a distributed lock myself, or make the email-sending task truly idempotent, but I wanted understand why something was broken in the first place. This code has been running flawlessly for years.

Unfortunate coincidence

Nothing stood out when I ran the service locally. A quick scan of the code didn’t bring up any suspects either.

I inspected Heroku’s logs again and saw that it wasn’t just this specific task that was being dispatched twice, all of them were:

celerybeat.1 [2023-04-12 18:22:00,000: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)
celerybeat.1 [2023-04-12 18:22:00,015: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)

Luckily most of these tasks are truly idempotent so little damage was done there.

Here’s where a mix of Impostor Syndrome and mandated developer humility sent me looking in the wrong directions.

In the weeks before, I’ve written some new code that was also supposed to be scheduled by celerybeat. Did I muck up something there? since this is Python, you can screw yourself in the foot in a variety of ways. Maybe I’ve imported the scheduler configuration code twice, maybe I inadvertedly put some code in an __init__ file, like this Stack Overflow answer I came across suggests.

I start to delete the new code. At first gradually, and then frantically. With each deployment that failed to mitigate the issue, I was running out of time and nerves.

It’s Celery’s fault”

Using Heroku’s log drains again, I managed to pinpoint the exact time the issue first manifested. The celerybeat dyno started sending the same tasks twice on April 6, 2023 at 22:16:00 UTC:

# once every two minutes
celerybeat.1 [2023-04-06 22:10:00,003: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)
celerybeat.1 [2023-04-06 22:12:00,001: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)
celerybeat.1 [2023-04-06 22:14:00,001: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)
# twice every two minutes
celerybeat.1 [2023-04-06 22:16:00,000: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)
celerybeat.1 [2023-04-06 22:16:00,015: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)

22:16:00 is the exact time a new build of the app was deployed. This was a revelation, but not a very useful one by the time I discovered it: I had already commented out all new code that could remotely be connected to the bug”.

So is it Celery then? This awesome library has its fair share of complaints about duplicated task execution — most are due to misconfiguration. I scanned dozens of Stack Overflow threads and GitHub issues in Celery’s repository. Very few were relevant to my case, and none of the fixes were.

April 13: You didn’t just do that, Heroku

The last thread I read and the one that would lead me to the shocking discovery included a brute-force suggestion: delete Celery’s celerybeat-schedule file, where celerybeat keeps its schedule.

There’s no reason this would happen on a Serverless platform like Heroku, I thought to myself. But, at this point nothing makes sense anymore, I also thought to myself.

I decided to do something different just before; I renamed the dyno celerybeat to celerybeatnew in the Procfile and deployed:

web: daphne musicbutler.asgi:application --port $PORT --bind 0.0.0.0
celerybeatnew: celery -A musicbutler beat
celerybackgroundworker1: celery -A musicbutler worker -Q regular
celerybackgroundworker2: celery -A musicbutler worker -Q regular
celeryimportantworker: celery -A musicbutler worker -Q important

After the deployment was over I checked the logs again; now everything became clear:

👇
celerybeatnew.1 [2023-04-13 19:42:00,000: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)
👇
celerybeat.1 [2023-04-13 19:42:00,015: INFO/MainProcess] Scheduler: Sending due task refresh_user_library (refresh_user_library)

See, at this point the celerybeat dyno shouldn’t even exist. It was nowhere to be found on my list of dynos. But here it is, alive, well, and scheduling tasks.

So what happened on that April 6 deployment is that Heroku either spun up two celerybeat dynos instead of one, or just never killed the old one. It’s not that celerybeat was misbehaving, it’s just that there were now two of them. It was virtually impossible to get to this conclustion until I changed the dyno’s name in the Procfile.

Heroku’s support lives up to its name

I actually know what happened, but not thanks to Heroku’s support. I contacted them on April 13 and as of April 17, their only response is we’re looking into this”. I haven’t heard from them since, and the old celerybeat dyno is still up with nothing I can do to stop it.2

Update: Herkou finally terminated the zombie dyno on April 17 at 14:00UTC. They confirmed there was nothing I could’ve done to terminate it myself.

In hindsight, it was a good decision to not simply rewrite the task’s code to utilize locks. That would have prevented new instances of the app from executing tasks twice, but Heroku is running an old, zombie dyno with outdated code. I know this because I removed some tasks from celerybeatnew and witnessed how they’re still running, courtesy of Heroku’s on-the-house worker. If Heroku was simply spinning up a new extra dyno with each deployment, these tasks should’ve stopped executing enitrely.

Writing on the wall

I’ve been reading on Heroku’s slow demise on developer communities for years now. Just weeks earlier, I’d read a horror story on Twitter about Heroku just flat-out deleting someone’s account, including production apps.

And still. I don’t know what I could’ve done differently here: when I list an internet-taught software developer, a talented OSS team, and a multi-gazillion dollar corporation, my instinct is to look into them in this same order.

Next steps

Truly idempotent tasks

This could have happened on any platform, and for other reasons. If a task should never ever be executed twice, or concurrently, it shouldn’t count on a scheduler to prevent that. In fact, I’ve followed Celery’s own guide on ensuring a task is only executed one at a time. It’s just that I didn’t retroactively apply that to a 4 year-old piece of code.

No more vendor lock

In 2018 Heroku was pretty much the only option for a newbie like me to get started with web-development. The landscape is very different now with providers like Render, Railway, and more offering a simillar DX.

I’ve just finished deploying a Celery worker on Railway. The exprience was OK. It would cost a bit more than what I currently pay for a Heroku dyno. Railway ivented their own deployment-file format, Nixpacks, which allows one to build Docker images. Render utilizes render.yaml files. Wherever I decide to go, I’ll be Dockerizing MusicButler so I can move between vendors less painfully.

When you’re a solo-developer doing something on the side you need to prioritize your time ruthlessly. I’ve always prioritized new features and user feedback over infra work. Looks like it’s time to start paying attention to the latter.

TL;DR

  • Heroku has been running a 2nd copy of my scheduler instance since April 6, 2023 and I have zero control over it.
  • All scheduled tasks were carried out twice, causing disturbance to users and unnecessary high load.
  • Given how Heroku works and how they present their logs, I had no way to detect this early on, or reason to suspect that’s what happened.
  • I discovered the root cause April 13 and contacted Heroku. The zombie instance is still running as of April 17 at 18:45 UTC.

Read More