Beware the hackney default pool

How monitoring broke our app.

Today, a client of mine experienced outages of majore core features in their app. Common denominator: File uploads to S3.

Symptom

A likely culprit was soon identified:

> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 0, queue_count: 0]

The hackney default connection pool was drained.

ex_aws uses hackney by default. Thus any operation trying to upload files to S3 was starved for connections and eventually timed out.

Workaround

Since ex_aws ships with a req adapter by default and Finch was already also in use in the app, this was the 1-line workaround:

config :ex_aws, http_client: ExAws.Request.Req, req_opts: [finch: MyApp.Finch]

Cause

An analysis of mix.lock revealed another user of hackney: appsignal.

I had recently added AppSignal cron Check-ins to the app to monitor execution of a few vital Oban cron jobs.

There seems to be a bug (update: fixed) in the AppSignal check-in implementation: hackney connections remain checked out even after the check-in is finished.

> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]

> Appsignal.CheckIn.cron("my_check_in")

> Process.sleep(10_000)
> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 1, free_count: 0, queue_count: 0]

> Appsignal.CheckIn.cron("my_check_in", fn -> :ok end)

> Process.sleep(10_000)
> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 2, free_count: 0, queue_count: 0]

The result: The pool is slowly drained, and eventually completely empty.

Lessons learned

Don't use the hackney default pool, as it is a central point of failure. In this case, monitoring caused core features to fail.

This coupling was not all that obvious:

  • ex_aws by default uses hackney and it's default pool. ex_aws documents this and offers alternative adapters.
  • appsignal uses hackney and it's default pool. This is undocumented and not configurable.

We will keep the "workaround" in place and continue using the req adapter. This uses finch under the hood (by default). This provides better decoupling of external services:

When using HTTP/1, Finch will parse the passed in URL into a {scheme, host, port} tuple, and maintain one or more connection pools for each {scheme, host, port} you interact with.

Further analyis of the app's mix.lock file revealed further libraries that use hackney. But we categorized those usages as safe:

  • packmatic -> httpoison: safe, because it doesn't use connection pooling (pool: false)
  • swoosh: already configured to use finch instead of hackney (no, uses finch instead)
  • tzdata: updates data on application start, before pool can be drained, and also failure to update would not impact users
  • wallaby -> web_driver_client: only used for tests

Over to you: best check your dependencies and make sure you're not using the hackney default pool!