How to read a UTF-8 encoded file with Erlang
January 18, 2017
In this HOWTO, we create a file with UTF-8 encoded text, read it with an Erlang program, and verify that the bytes Erlang stores in-memory matches what is on disk.
Tools used in this tutorial:
- Erlang/OTP 19
- dc
All of the code used in this blog can be found at https://github.com/mbucc/markbucciarelli.com/tree/master/sandbox/utf8.
Step 1. Create a file with UTF-8 encoding.
The Unix utility
dc
“When its home Bell Labs received a PDP-11, dc—written in B—was the first language to run on the new computer, even before an assembler.” — Wikipedia
is a reverse-polish calculator that provides a nice concise way of printing bytes to a file. Here, we use it to output the string “quoted” to a stdout:
#! /bin/sh -e
# "quoted" (but with curly quotes)
#
# http://unix.stackexchange.com/a/189810
#
# character byte(s), hex
# —————-- ————
# left_curly_quote e2 80 9c
# q 71
# u 75
# o 6F
# t 74
# e 65
# d 64
# right_curly_quote e2 80 9d
dc<<EOF
16i0
$(printf %sP E2 80 9C 71 75 6F 74 65 64 E2 80 9D)
EOF
The 16i0
tells bc
to interpret the input as base 16 numbers. With this, we create our UTF-8 encoded file.
$ ./makeutf8.sh
“quoted”$
$ ./makeutf8.sh > utf8.txt
$ cat utf8.txt
“quoted”$
$
Step 2. Read the file with an Erlang program.
The usage of io:format to dump the hex value of the bytes Erlang has stored in memory is courtesy Hynek -Pichi- Vychodil for Stack Overflow. http://stackoverflow.com/a/3771421
-module(file_read_file).
-export([start/0]).
dump(Bin) ->
io:format("~s",
[[io_lib:format("~2.16.0B~n", [X]) || <<X:8>> <= Bin]]).
start() ->
{ok, Bin} = file:read_file("utf8.txt"), dump(Bin).
Step 3: Verify the bytes Erlang stores in-memory match the file contents.
$ erlc file_read_file.erl
$ erl -pa . -s file_read_file
Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
E2
80
9C
71
75
6F
74
65
64
E2
80
9D
Eshell V8.0.1 (abort with ^G)
1> q().
ok
2> $
The output matches the sequence of bytes input to bc
above, so Erlang’s file:read_file/1
reads in bytes as written to disk and thus can be used to read in UTF-8 encoded files.
Notes
file:read_file/1
works for any byte sequence.
There is nothing special about the utf-8 encoding. The Erlang function file:read_file/1
reads in whatever byte sequence is in the file.
If we read a Mac OS Roman encoded file, which uses the byte 0xD2
to represent the left double quote, and 0xD3
for the right double quote, the bytes output match what is input.
$ cp mac_os_roman.txt utf8.txt
$ cat utf8.txt |od -v -An -t x1
d2 71 75 6f 74 65 64 d3 0a
$ erl -pa . -s file_read_file -s init stop
Erlang/OTP 19 [erts-8.0.1] [source-ca40008] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
D2
71
75
6F
74
65
64
D3
0A
Eshell V8.0.1 (abort with ^G)
1> $
I’m sticking with binaries.
Looks like Erlang binaries treat files as a list of bytes. Which is just how C treats strings, and I know that works just fine with encoded strings. I also know that appending binaries in Erlang is fast. So until I learn more about how Erlang strings work, I’ll keep my text data in binary form because I know that works.
Tags: erlang