How to read a UTF-8 encoded file with Erlang

January 18, 2017

In this HOWTO, we create a file with UTF-8 encoded text, read it with an Erlang program, and verify that the bytes Erlang stores in-memory matches what is on disk.

A capital R with a right single quote painted on a brick wall. class= — © 2016 Daria Nepriakhina for Unsplash

Tools used in this tutorial:

Erlang/OTP 19
dc

All of the code used in this blog can be found at https://github.com/mbucc/markbucciarelli.com/tree/master/sandbox/utf8.

Step 1. Create a file with UTF-8 encoding.

The Unix utility dc⊕ “When its home Bell Labs received a PDP-11, dc—written in B—was the first language to run on the new computer, even before an assembler.” — Wikipedia is a reverse-polish calculator that provides a nice concise way of printing bytes to a file. Here, we use it to output the string “quoted” to a stdout:


    #! /bin/sh -e
    
    # "quoted" (but with curly quotes)
    #
    # http://unix.stackexchange.com/a/189810
    #
    #  character          byte(s), hex
    #  —————--  ————
    #  left_curly_quote   e2  80  9c
    #  q                  71
    #  u                  75
    #  o                  6F
    #  t                  74
    #  e                  65
    #  d                  64
    #  right_curly_quote  e2  80  9d
    
    dc<<EOF
    16i0
    $(printf %sP E2 80 9C 71 75 6F 74 65 64 E2 80 9D)
    EOF

The 16i0 tells bc to interpret the input as base 16 numbers. With this, we create our UTF-8 encoded file.


    $ ./makeutf8.sh
    “quoted”$
    $ ./makeutf8.sh > utf8.txt
    $ cat utf8.txt
    “quoted”$
    $

Step 2. Read the file with an Erlang program.

The usage of io:format to dump the hex value of the bytes Erlang has stored in memory is courtesy Hynek -Pichi- Vychodil⊕ for Stack Overflow. http://stackoverflow.com/a/3771421


    -module(file_read_file).
    
    -export([start/0]).
    
    dump(Bin) ->
        io:format("~s",
                  [[io_lib:format("~2.16.0B~n", [X]) || <<X:8>> <= Bin]]).
    
    start() ->
        {ok, Bin} = file:read_file("utf8.txt"), dump(Bin).

Step 3: Verify the bytes Erlang stores in-memory match the file contents.


    $ erlc file_read_file.erl
    $ erl -pa . -s file_read_file
    Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
    
    E2
    80
    9C
    71
    75
    6F
    74
    65
    64
    E2
    80
    9D
    Eshell V8.0.1  (abort with ^G)
    1> q().
    ok
    2> $

The output matches the sequence of bytes input to bc above, so Erlang’s file:read_file/1 reads in bytes as written to disk and thus can be used to read in UTF-8 encoded files.

Notes

`file:read_file/1` works for any byte sequence.

There is nothing special about the utf-8 encoding. The Erlang function file:read_file/1 reads in whatever byte sequence is in the file.

If we read a Mac OS Roman encoded file, which uses the byte 0xD2 to represent the left double quote, and 0xD3 for the right double quote, the bytes output match what is input.


    $ cp mac_os_roman.txt utf8.txt
    $ cat utf8.txt |od -v -An -t x1
               d2  71  75  6f  74  65  64  d3  0a
    
    $ erl -pa . -s file_read_file -s init stop
    Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
    
    D2
    71
    75
    6F
    74
    65
    64
    D3
    0A
    Eshell V8.0.1  (abort with ^G)
    1> $

I’m sticking with binaries.

Looks like Erlang binaries treat files as a list of bytes. Which is just how C treats strings, and I know that works just fine with encoded strings. I also know that appending binaries in Erlang is fast. So until I learn more about how Erlang strings work, I’ll keep my text data in binary form because I know that works.

Tags: erlang