How to read a UTF-8 encoded file with Erlang

January 18, 2017

In this HOWTO, we create a file with UTF-8 encoded text, read it with an Erlang program, and verify that the bytes Erlang stores in-memory matches what is on disk.

A capital R with a right single quote painted on a brick wall. class=
© 2016 Daria Nepriakhina for Unsplash

Tools used in this tutorial:

  1. Erlang/OTP 19
  2. dc

All of the code used in this blog can be found at https://github.com/mbucc/markbucciarelli.com/tree/master/sandbox/utf8.

Step 1. Create a file with UTF-8 encoding.

The Unix utility dc “When its home Bell Labs received a PDP-11, dc—written in B—was the first language to run on the new computer, even before an assembler.” — Wikipedia is a reverse-polish calculator that provides a nice concise way of printing bytes to a file. Here, we use it to output the string “quoted” to a stdout:

#! /bin/sh -e

# "quoted" (but with curly quotes)
#
# http://unix.stackexchange.com/a/189810
#
#  character          byte(s), hex
#  —————--  ————
#  left_curly_quote   e2  80  9c
#  q                  71
#  u                  75
#  o                  6F
#  t                  74
#  e                  65
#  d                  64
#  right_curly_quote  e2  80  9d

dc<<EOF
16i0
$(printf %sP E2 80 9C 71 75 6F 74 65 64 E2 80 9D)
EOF

The 16i0 tells bc to interpret the input as base 16 numbers. With this, we create our UTF-8 encoded file.

$ ./makeutf8.sh
“quoted”$
$ ./makeutf8.sh > utf8.txt
$ cat utf8.txt
“quoted”$
$

Step 2. Read the file with an Erlang program.

The usage of io:format to dump the hex value of the bytes Erlang has stored in memory is courtesy Hynek -Pichi- Vychodil for Stack Overflow. http://stackoverflow.com/a/3771421

-module(file_read_file).

-export([start/0]).

dump(Bin) ->
    io:format("~s",
              [[io_lib:format("~2.16.0B~n", [X]) || <<X:8>> <= Bin]]).

start() ->
    {ok, Bin} = file:read_file("utf8.txt"), dump(Bin).

Step 3: Verify the bytes Erlang stores in-memory match the file contents.

$ erlc file_read_file.erl
$ erl -pa . -s file_read_file
Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]

E2
80
9C
71
75
6F
74
65
64
E2
80
9D
Eshell V8.0.1  (abort with ^G)
1> q().
ok
2> $

The output matches the sequence of bytes input to bc above, so Erlang’s file:read_file/1 reads in bytes as written to disk and thus can be used to read in UTF-8 encoded files.

Notes

file:read_file/1 works for any byte sequence.

There is nothing special about the utf-8 encoding. The Erlang function file:read_file/1 reads in whatever byte sequence is in the file.

If we read a Mac OS Roman encoded file, which uses the byte 0xD2 to represent the left double quote, and 0xD3 for the right double quote, the bytes output match what is input.

$ cp mac_os_roman.txt utf8.txt
$ cat utf8.txt |od -v -An -t x1
           d2  71  75  6f  74  65  64  d3  0a

$ erl -pa . -s file_read_file -s init stop
Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]

D2
71
75
6F
74
65
64
D3
0A
Eshell V8.0.1  (abort with ^G)
1> $

I’m sticking with binaries.

Looks like Erlang binaries treat files as a list of bytes. Which is just how C treats strings, and I know that works just fine with encoded strings. I also know that appending binaries in Erlang is fast. So until I learn more about how Erlang strings work, I’ll keep my text data in binary form because I know that works.


If you see an error or something that could be improved, please let me know. This is a blog about me learning, so I expect I will get some stuff wrong. The best way to reach me is by email: mkbucc1234@gmail.com (after deleting all the numbers).

To make a comment, check for a thread on the subreddit and if there isn't one, then start one up.

Follow on Twitter: @mbucc

Back to the index.