Tags: bom, codecs, encoded, example, file, programming, python, string, text, thistext, txt, utf-8, utf8

remove BOM from string read from utf-8 file

On Programmer » Python

4,713 words with 4 Comments; publish: Wed, 26 Dec 2007 23:19:00 GMT; (200171.88, « »)

Hi,

I read some text from a utf-8 encoded text file like this:

text = codecs.open('example.txt','r','utf8').read()

If I pass this text to a COM object, I can see that there is still the BOM

in the file, which marks the file as utf-8. Simply removing the first

character in the string is not ok, because the BOM is optional. So I tried

something like this:

if text.startswith(codecs.BOM_UTF8):

print "found BOM"

but then I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:

ordinal not in range(128)

What's the right way to remove the BOM from the string?

regards,

Achim

All Comments

Leave a comment...

  • 4 Comments
    • >>>>> "Achim Domma" <domma.python.todaysummary.com.procoders.net> (AD) wrote:

      AD> Hi,

      AD> I read some text from a utf-8 encoded text file like this:

      AD> text = codecs.open('example.txt','r','utf8').read()

      AD> If I pass this text to a COM object, I can see that there is still the BOM

      AD> in the file, which marks the file as utf-8. Simply removing the first

      AD> character in the string is not ok, because the BOM is optional. So I tried

      AD> something like this:

      The BOM is in the file, but not in the string 'text'

      text is a unicode string which consists of Unicode characters and the BOM

      is not a Unicode character.

      Check text[0] and len(text) to verify.

      Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that

      is the reason for the complaint.

      --

      Piet van Oostrum <piet.python.todaysummary.com.cs.uu.nl>

      URL: http://www.cs.uu.nl/~piet [PGP]

      Private email: P.van.Oostrum.python.todaysummary.com.hccnet.nl

      #1; Wed, 26 Dec 2007 23:20:00 GMT
    • "Piet van Oostrum" <piet.python.todaysummary.com.cs.uu.nl> wrote in message

      news:wzoerkinig.fsf.python.todaysummary.com.Ordesa.local...

      > Check text[0] and len(text) to verify.

      That's what I did. The file contains 24 chinese characters and len(text) is

      25. And 0xef is the hex code for the BOM if I'm not completely wrong.

      Achim

      #2; Wed, 26 Dec 2007 23:21:00 GMT
    • I found myself often needing to read text files that might be utf-8, unicode

      or ansi, without knowing beforehand which, so I wrote a single function to

      do it. I don't know if this is the correct way to handle this situation,

      but I couldn't find any function that would simply open a file with the

      appropriate codec automatically, so I use this (it doesn't handle all cases,

      but just the ones I've needed so far):

      import os, codecs

      #-----------------------

      -

      # OpenTextFile()

      #

      # Opens a file correctly whether it is unicode or ansi. If the file

      # doesn't exist, then the default encoding is unicode (UTF-16).

      #

      # Python documentation of the codecs module is pretty weak; for instance

      # there are all these:

      # BOM

      # BOM_BE

      # BOM_LE

      # BOM_UTF8

      # BOM_UTF16

      # BOM_UTF16_BE

      # BOM_UTF16_LE

      # BOM_UTF32

      # BOM_UTF32_BE

      # BOM_UTF32_LE

      # but no explanation of how they map to the encodings like 'utf-16'. Some

      # can be inferred, but some are not so clear.

      #-----------------------

      -

      def OpenTextFile(filename,mode='r',encoding=None):

      if os.path.isfile(filename):

      f = file(filename,'rb')

      header = f.read(4) # Read just the first four bytes.

      f.close()

      # Don't change this to a map, because it is ordered!!!

      encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),

      ( codecs.BOM_UTF16, 'utf-16' ),

      ( codecs.BOM_UTF8, 'utf-8' ) ]

      for h,e in encodings:

      if header.find(h) == 0:

      encoding = e

      break

      return codecs.open(filename,mode,encoding)

      #3; Wed, 26 Dec 2007 23:22:00 GMT
    • >>>>> "Achim Domma" <domma.python.todaysummary.com.procoders.net> (AD) wrote:

      AD> "Piet van Oostrum" <piet.python.todaysummary.com.cs.uu.nl> wrote in message

      AD> news:wzoerkinig.fsf.python.todaysummary.com.Ordesa.local...

      >> Check text[0] and len(text) to verify.

      AD> That's what I did. The file contains 24 chinese characters and len(text) is

      AD> 25. And 0xef is the hex code for the BOM if I'm not completely wrong.

      Sorry, I was wrong.

      You have to check for text.startswith(u'\ufeff')

      --

      Piet van Oostrum <piet.python.todaysummary.com.cs.uu.nl>

      URL: http://www.cs.uu.nl/~piet [PGP]

      Private email: P.van.Oostrum.python.todaysummary.com.hccnet.nl

      #4; Wed, 26 Dec 2007 23:23:00 GMT