Beware the hackney default pool
How monitoring broke our app.
Today, a client of mine experienced outages of majore core features in their app. Common denominator: File uploads to S3.
Symptom
A likely culprit was soon identified:
> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 50, free_count: 0, queue_count: 0]
The hackney default connection pool was drained.
ex_aws
uses hackney by default. Thus any operation trying to upload files to S3 was starved
for connections and eventually timed out.
Workaround
Since ex_aws
ships with a req
adapter by default and
Finch
was already also in use in the app, this was the 1-line workaround:
config :ex_aws, http_client: ExAws.Request.Req, req_opts: [finch: MyApp.Finch]
Cause
An analysis of mix.lock
revealed another user of hackney: appsignal
.
I had recently added AppSignal cron Check-ins to the app to monitor execution of a few vital Oban cron jobs.
There seems to be a bug (update: fixed) in the AppSignal check-in implementation: hackney connections remain checked out even after the check-in is finished.
> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 0, free_count: 0, queue_count: 0]
> Appsignal.CheckIn.cron("my_check_in")
> Process.sleep(10_000)
> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 1, free_count: 0, queue_count: 0]
> Appsignal.CheckIn.cron("my_check_in", fn -> :ok end)
> Process.sleep(10_000)
> :hackney_pool.get_stats(:default)
[name: :default, max: 50, in_use_count: 2, free_count: 0, queue_count: 0]
The result: The pool is slowly drained, and eventually completely empty.
Lessons learned
Don't use the hackney default
pool, as it is a central point of failure.
In this case, monitoring caused core features to fail.
This coupling was not all that obvious:
ex_aws
by default uses hackney and it's default pool.ex_aws
documents this and offers alternative adapters.appsignal
uses hackney and it's default pool. This is undocumented and not configurable.
We will keep the "workaround" in place and continue using the req
adapter.
This uses finch
under the hood (by default).
This provides better decoupling of external services:
When using HTTP/1, Finch will parse the passed in URL into a
{scheme, host, port}
tuple, and maintain one or more connection pools for each{scheme, host, port}
you interact with.
Further analyis of the app's mix.lock
file revealed further libraries that use hackney
.
But we categorized those usages as safe:
packmatic -> httpoison
: safe, because it doesn't use connection pooling (pool: false
)swoosh
: already configured to usefinch
instead ofhackney
(no, uses finch instead)tzdata
: updates data on application start, before pool can be drained, and also failure to update would not impact userswallaby -> web_driver_client
: only used for tests
Over to you: best check your dependencies and make sure you're not using the hackney default pool!