August 17, 2018

Scanning text with a UTF8 BOM using Go

Using Bufio NewScanner for reading text files

There are many ways to read files in Go/Golang, but the one I prefer is this:

myFile, err := os.Open(filename)
check(err)
defer myFile.Close()

scanner := bufio.NewScanner(myFile)
scanner.Split(bufio.ScanLines)  // default; shown for clarity

for scanner.Scan() {
    line := scanner.Text()
    // do something ...
}

This pattern has the advantages of being:

  1. easy to read and understand
  2. memory efficient as it doesn’t load the whole file at once
  3. flexible as you can specify how to split (e.g. bufio.ScanLines)

The problem with UTF-8 and BOM

I have a variety of text processing tools of my own devising, and one of them reads the definition of an interactive fiction book and produces a range of outputs from it (ebooks, RTF etc).

To do this it understands directives within the file (eg. #GET Lamp) but I was finding that any such directives on the first line of the file were not recognised.

I use Visual Studio Code on the Mac for my Go development (with some great plugins) as it’s faster and lighter than Atom or Goland. When I opened my text file (which I started under Windows) VS Code showed that it was UTF-8 with BOM (Byte Order Mark).

A BOM is a special prefix in a text file which is allowed (but not recommended) by the relevant IETF spec. It’s a sequence of bytes (239, 187, 191 or, in hex, 0xEF, 0xBB, 0xBF). As these were at the start of the line, the line therefore doesn’t start with my directive.

Whilst Go has other file handling methods that know about the BOM, I like the scanner way. But how do you get it to work with these files?

Stripping the UTF-8 BOM whilst scanning the file

The easy solution is to check if you are on the first line and, if so, remove any BOM:

myFile, err := os.Open(filename)
check(err)
defer myFile.Close()

scanner := bufio.NewScanner(myFile)
scanner.Split(bufio.ScanLines)

lineNumber := 1
BOM := string([]byte{239, 187, 191})  // UTF-8 specific

for scanner.Scan() {
    line := scanner.Text()
    if lineNumber == 1 {
        line = strings.TrimPrefix(line, BOM)
    }

    // do something ...

    lineNumber++
}

If you’re on line 1 strip the BOM. If there is no BOM the line is not affected. With this small change, the directives are now recognised on the first line.

It was a very specific use case for me, but the general principle applies surprisingly often. For example systemd service files will usually fail to run with a BOM at the start.

Note that if you re-output the text which is now BOM-less you should remember that the standards recommend you round-trip the BOM as other software may need it to work correctly. My workflow doesn’t require it, so I can simply remove it.