A Simple Erlang Application, with Prometheus

November 23, 2016

A simple HTTP server built with Elli that exposes metrics for Prometheus.

Face with a fiery torch in the background. class=
© 2016 Vinicius Amano for Unsplash

The Code

The code is identical to the code in A Simple Erlang Application, with the following modifications:

You can find the code for this blog entry here.

New dependency on Erlang client for Prometheus.io


	(cd ${SRC}; git clone ${PROMETHEUS_GIT})
	(cd ${PROMETHEUS}; git checkout tags/${PROMETHEUS_TAG})
	(cd ${PROMETHEUS} ; rebar3 compile)

This new section in the Makefile downloads and compiles the Erlang client for prometheus. The ${PROMETHEUS_APP}/ebin is then added as a dependency for compiling the Erlang code in this project.

New handler to log metrics and expose an endpoint for Prometheus

%% @doc Exposes HTTP responses and timings as Prometheus metrics.
%% Defines two metrics:
%%        * the counter http_requests_total, with the label statuscode.
%%        * the histogram http_response_microseconds
%% The duration measures the time, in microseconds, between when
%% the request line was received and the response was sent. This
%% is as close we can get to the actual time of the request as
%%  seen by the user.




-export([handle/2, handle_event/3]).

% Expose /metrics for Prometheus to pull.
handle(Req, _Args) ->
    handle(Req#req.method, elli_request:path(Req), Req).

handle(’GET’, [<<"metrics">>], _Req) ->
    {ok, [], prometheus_text_format:format()};
% All other requests are passed to normal handler.
handle(_Verb, _Path, _Req) -> ignore.

% Return path, minus any query string, as a binary.
rawpath(#req{raw_path = Path}) ->
    case binary:split(Path, [<<"?">>]) of
      [P, _] -> P;
      [P] -> P

             [Req, ResponseCode, _ResponseHeaders, _ResponseBody,
             _Config) ->
                           [ResponseCode, rawpath(Req)]),
    Start = proplists:get_value(request_start, Timings),
    End = proplists:get_value(request_end, Timings),
    Delta = timer:now_diff(End, Start),
    io:format("handle_event: ~p  ~p  ~p~n",
               erlang:convert_time_unit(Delta, native,
    % The "_microseconds" suffix in the metric name is magic.
    % Prometheus.erl converts the Erlang native time difference to microseconds.
                                 [rawpath(Req)], Delta),
             [Req, ResponseCode, ResponseHeaders, _ClosingEnd, Timings],
             Config) ->
                 [Req, ResponseCode, ResponseHeaders, <<>>, Timings],
handle_event(Event, [Req, _Exc, _Stack], _Config)
    when Event =:= request_throw;
         Event =:= request_exit;
         Event =:= request_error;
         Event =:= request_parse_error;
         Event =:= bad_request;
         Event =:= client_closed;
         Event =:= client_timeout ->
                           [Event, rawpath(Req)]),
handle_event(elli_startup, [], _Config) ->
    prometheus_counter:new([{name, http_requests_total},
                            {help, "Total HTTP requests"},
                            {labels, [status, path]}]),
                              {labels, [path]},
                               [10, 100, 1000, 10000, 100000, 300000, 500000,
                                750000, 1000000, 1500000, 2000000, 3000000]},
                               "Microseconds between request receipt "
                               "and response send."}]),
handle_event(_, _, _) ->
    %% Future-proof.

Lots of code here. Some things of note:

  1. This is a normal elli callback module, just like es_callback.erl in the last blog entry. As shown in the next section, the es_sup module stacks the handlers when configuring elli_middleware.
  2. If a callback module returns the atom ignore, the request is passed to the next handler in the stack.
  3. The elli_startup event (the last handle_event in the listing above) is where you define the metrics.
  4. A metric label let’s you create subtotals. For example, http_requests_total has subtotals by status and resource.
  5. The _microseconds suffix on the metric name is “magic”; that is, it tells the client library to convert native time to microseconds. See full list of magic suffixes at lines 27 - 37 of prometheus_time.erl
  6. This handler exposes the /metrics resource, which is the default endpoint that Prometheus looks for.
  7. This handler is decoupled from all other handlers.

I figured out this code by looking at other elli middleware (for example, access logging) and then finding the elli-lib/elli_prometheus example.

New child spec to call elli_middleware





start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    MiddlewareConfig = [{mods,
                         [{elli_prometheus, []}, {es_callback, []}]}],
    ElliOpts = [{callback, elli_middleware},
                {callback_args, MiddlewareConfig}, {port, 3000}],
    ElliSpec = {es_http, {elli, start_link, [ElliOpts]},
                permanent, 5000, worker, [elli]},
    {ok, {{one_for_one, 5, 10}, [ElliSpec]}}.

The difference from the last blog post is the MiddlewareConfig, which stacks the handlers and ElliOpts, which tells Elli to use the middleware (which itself is a handler).


  scrape_interval:     1s

    monitor: ’elli-monitor’

  - job_name: ’prometheus’
      - targets: [’localhost:9090’]

  - job_name: ’erlsrv’
      - targets: [’’]

This configures prometheus. Change to directory that holds this file and start prometheus.


elli: version 1.0.5 versus version 2.0.0

The elli_lib project on GitHub has released a 2.0.0 version of elli with these changes (among others):

  1. tracks response size as well as timings
  2. uses Erlang’s monotonic clock
  3. sprinkles lots of spec definitions throughout the code base
  4. changes the semantics of request_start event

The last change was giving me request durations that were over second seconds in some cases, which is why I stayed with version 1.0.5.

Built in metrics that come with prometheus.erl

By simply including the prometheus.erl client, I automatically get a ton of metrics exposed on the /metrics endpoint, including:

You can find the entire list in the README for https://github.com/deadtrickster/prometheus.erl in the section “Erlang VM & OTP Collectors.”

Prometheus Memory Consumption

I plan to run Prometheus on a Raspberry Pi 3, which only has 1G of RAM. Some notes on prometheus memory usage:

  1. stores metric data in “chunks”
  2. each chunk is 1024 bytes
  3. keeps all “currently used” chunks in memory
  4. limit on in-memory chunks set by storage.local.memory-chunks config, default value is 1048576 bytes (1 MB)
  5. don’t hit limit, as metric data is dropped (“throttled ingestion”)
  6. storage.local.retention config defines how long to store metric history on disk

Rule of thumb:

Reference: https://prometheus.io/docs/operating/storage/.


Starting up the application

$ ./run.sh
Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.0.1  (abort with ^G)
1> application:start(erlsrv).

and hitting it with some traffic

#! /bin/sh -e
# Lay siege to Elli.

siege --log=siege.log -c $CLIENTS -t $DURATION

siege --log=siege.log -c $CLIENTS -t $DURATION

siege --log=siege.log -c $CLIENTS -t $DURATION

loads Prometheus with some data. Here are a examples of Prometheus graphs.


Graph of http requests against time looks sigmoid-ish

Prometheus graphs metrics by label and job. I chose the same metric name as prometheus uses internally, so there is a name collision; prometheus uses different labels than I did, so the first four rows in the legend have the labels code, handler, instance, and method (the Prometheus labels), and the last three have instance, job and path (the erlsrv labels).

rate(response_time_in_microseconds_sum[15s]) / rate(response_time_in_microseconds_count[15s])

Prometheus supports expressions; this one computes average response time over last 15 seconds.

The /metrics page hovers under six microseconds, and /hello/world is well under one microsecond.


System memory up over 16M, process memory just over 4M, both stable. Not surprisingly, the “load” I put on (200 requests/second for a tiny response body with no backend work) doesn’t register.


This one kept climbing to the right of this graph after I shut off erlsrv, until it plateaued at 7.5K.