Scanning text with a UTF8 BOM using Go.

17th August 2016

Using Bufio NewScanner for reading text files.

There are many ways to read files in Go/Golang, but the one I prefer is this:

myFile, err := os.Open(filename)
check(err)
defer myFile.Close()

scanner := bufio.NewScanner(myFile)
scanner.Split(bufio.ScanLines)  // default; shown for clarity

for scanner.Scan() {
    line := scanner.Text()

    // do something ...

}

This pattern has the advantages of being:

  1. easy to read and understand
  2. memory efficient as it doesn’t load the whole file at once
  3. flexible as you can specify how to split (e.g. bufio.ScanLines)

The problem with UTF-8 and BOM.

I have a variety of text processing tools of my own devising, and one of them reads the definition of an interactive fiction book and produces a range of outputs from it (ebooks, RTF etc).

To do this it understands directives within the file (eg. #GET Lamp) but I was finding that any such directives on the first line of the file were not recognised.

I use Visual Studio Code on the Mac for my Go development (with some great plugins) as it’s faster and lighter than Atom and WebStorm/PyCharm/Gogland. When I opened my text file (which I started under Windows) VS Code showed that it was UTF-8 with BOM - a Byte Order Mark.

A BOM is a special prefix in a text file which is allowed (but not recommended) by the relevant IETF spec. It’s a sequence of bytes (239, 187, 191 or, in hex, 0xEF, 0xBB, 0xBF). As these were at the start of the line, the line therefore doesn’t start with my directive.

Whilst Go has other file handling methods that know about the BOM, I like the scanner way. But how do you get it to work with these files?

Stripping the UTF-8 BOM whilst scanning the file.

The easy solution is to check if you are on the first line and, if so, remove any BOM:

myFile, err := os.Open(filename)
check(err)
defer myFile.Close()

scanner := bufio.NewScanner(myFile)
scanner.Split(bufio.ScanLines)

lineNumber := 1
BOM := string([]byte{239, 187, 191})  // UTF-8 specific

for scanner.Scan() {
    line := scanner.Text()
    if lineNumber == 1 {
        line = strings.TrimPrefix(line, BOM)
    }

    // do something ...

    lineNumber++
}

If it is line 1 then strip the BOM. If there is no BOM the line is not affected. With this small change, the directives are now recognised on the first line.

Note that if you then re-output that text bear in mind that the standard recommends you round-trip the BOM as other software may need it to work correctly. My workflow doesn’t require it, so I can simply remove it.