Tags: ascii, below, certainly, discussedalready, files, minidom, programming, python, unicode, xml

minidom xml & non ascii / unicode & files

On Programmer » Python

24,519 words with 4 Comments; publish: Sun, 30 Dec 2007 22:56:00 GMT; (200140.63, « »)

lo all,

some of the questions i'll ask below have most certainly been discussed

already, i just hope someone's kind enough to answer them again to help

me out..

so i started a python 2.3 script that grabs some web pages from the web,

regex parse the data and stores it localy to xml file for further use..

at first i had no problem using python minidom and everything concerning

my regex/xml processing works fine, until i tested my tool on some

french page with "non ascii" chars and my script started to throw errors

all over the place..

I've looked into the matter and discovered the unicode / string encoding

processes implied when dealing with non ascii texts and i must say i

almost lost my mind.. I'm loosing it actually..

so here are the few questions i'd like to have answers for :

1. when fetching a web page from the net, how am i supposed to know how

it's encoded.. And can i decode it to unicode and encode it back to a

byte string so i can use it in my code, with the charsets i want, like

utf-8.. ?

2. in the same idea could anyone try to post the few lines that would

actually parse an xml file, with non ascii chars, with minidom

(parseString i guess).

Then convert a string grabbed from the net so parts of it can be

inserted in that dom object into new nodes or existing nodes.

And finally write that dom object back to a file in a way it can be used

again later with the same script..

I've been trying to do that for a few days with no luck..

I can do each separate part of the job, not that i'm quite sure how i

decode/encode stuff in there, but as soon as i try to do everything at

the same time i get encoding errors thrown all the time..

3. in order to help me understand what's going on when doing

encodes/decodes could you please tell me if in the following example, s

and backToBytes are actually the same thing ??

s = "hello normal string"

u = unicode( s, "utf-8" )

backToBytes = u.encode( "utf-8" )

i knwo they both are bytestrings but i doubt they have actually the same

content..

4. I've also tried to set the default encoding of python for my script

using the sys.setdefaultencoding('utf-8') but it keeps telling me that

this module does not have that method.. i'm left no choice but to edit

the site.py file manually to change "ascii" to "utf-8", but i won't be

able to do that on the client computers so..

Anyways i don't know if it would help my script at all..

any help will be greatly appreciated

thx

Marc

All Comments

Leave a comment...

  • 4 Comments
    • webdev wrote:

      > lo all,

      > some of the questions i'll ask below have most certainly been discussed

      > already, i just hope someone's kind enough to answer them again to help

      > me out..

      > so i started a python 2.3 script that grabs some web pages from the web,

      > regex parse the data and stores it localy to xml file for further use..

      > at first i had no problem using python minidom and everything concerning

      > my regex/xml processing works fine, until i tested my tool on some

      > french page with "non ascii" chars and my script started to throw errors

      > all over the place..

      > I've looked into the matter and discovered the unicode / string encoding

      > processes implied when dealing with non ascii texts and i must say i

      > almost lost my mind.. I'm loosing it actually..

      The general idea is:

      - convert everything that's coming in (from the net, database, files) into

      unicode

      - do all your processing with unicode strings

      - encode the strings to your preferred/the required encoding when you write

      it to the net/database/file

      > so here are the few questions i'd like to have answers for :

      > 1. when fetching a web page from the net, how am i supposed to know how

      > it's encoded.. And can i decode it to unicode and encode it back to a

      > byte string so i can use it in my code, with the charsets i want, like

      > utf-8.. ?

      First look at the HTTP 'Content-Type' header. If it has a parameter

      'charset', that the encoding to use, e.g.

      Content-Type: text/html; charset=iso-8859-1

      If there's not encoding specified in the header, look at the <?xml .. ?>

      prolog, if you have a XHTML document at hand (and it's present). Look below

      for the syntax.

      The last fallback is the <meta http-equiv="Content-Type" content="..."> tag.

      The content attribute has the same format as the HTTP header.

      But you can still run into UnicodeDecodeErrors, because many website just

      don't get their encoding issues right. Browser do some (more or less)

      educated guesses and often manage to display the document as intended.

      You should probably use htmlData.encode(encoding, "ignore") or

      htmlData.encode(encoding, "replace") to work around these problems (but

      loose some characters).

      And, as said above: don't encode the unicode string into bytestrings and

      process the bytestrings in your program - that's a bad idea. Defer the

      encoding until you absolutely necessary (usually file.write()).

      > 2. in the same idea could anyone try to post the few lines that would

      > actually parse an xml file, with non ascii chars, with minidom

      > (parseString i guess).

      The parser determines the encoding of the file from the <?xml..?> line. E.g.

      if your file is encoded in utf-8, add the line

      <?xml version="1.0" encoding="utf-8"?>

      at the top of it, if it's not already present.

      The parser will then decode everything into unicode strings - all TextNodes,

      attributes etc. should be unicode strings.

      When writing the manipulated DOM back to disk, use toxml() which has an

      encoding argument.

      > Then convert a string grabbed from the net so parts of it can be

      > inserted in that dom object into new nodes or existing nodes.

      > And finally write that dom object back to a file in a way it can be used

      > again later with the same script..

      Just insert the unicode strings.

      > I've been trying to do that for a few days with no luck..

      > I can do each separate part of the job, not that i'm quite sure how i

      > decode/encode stuff in there, but as soon as i try to do everything at

      > the same time i get encoding errors thrown all the time..

      > 3. in order to help me understand what's going on when doing

      > encodes/decodes could you please tell me if in the following example, s

      > and backToBytes are actually the same thing ??

      > s = "hello normal string"

      > u = unicode( s, "utf-8" )

      > backToBytes = u.encode( "utf-8" )

      > i knwo they both are bytestrings but i doubt they have actually the same

      > content..

      Why not try it yourself?

      "hello normal string" is just US-ASCII. The utf-8 encoded version of the

      unicode string u"hello normal string" will be identical to the ASCII byte

      string "hello normal string".

      > 4. I've also tried to set the default encoding of python for my script

      > using the sys.setdefaultencoding('utf-8') but it keeps telling me that

      > this module does not have that method.. i'm left no choice but to edit

      > the site.py file manually to change "ascii" to "utf-8", but i won't be

      > able to do that on the client computers so..

      > Anyways i don't know if it would help my script at all..

      There was just recently a discussing on setdefaultencoding() on various

      pythonistic blogs, e.g.

      http://blog.ianbicking.org/python-u...eally-suck.html

      > any help will be greatly appreciated

      > thx

      > Marc

      --

      Benjamin Niemann

      Email: pink at odahoda dot de

      WWW: http://www.odahoda.de/

      #1; Sun, 30 Dec 2007 22:57:00 GMT
    • webdev wrote:

      > 1. when fetching a web page from the net, how am i supposed to know how

      > it's encoded.. And can i decode it to unicode and encode it back to a

      > byte string so i can use it in my code, with the charsets i want, like

      > utf-8.. ?

      It depends on the content type. If the HTTP header declares a charset=

      attribute for content-type, then use that (beware: some web servers

      report the content type incorrectly. To deal with that gracefully,

      you have to implement very complex algorithms, which are part of

      any recent web browser).

      If there is no charset= attribute, then

      - if the content type is text/html, look at a meta http-equiv tag

      in the content. If that declares a charset, use that.

      - if the content type is xml (plain, or xhtml+xml), look at the

      XML declaration. Alternatively, pass it to your XML parser.

      > 2. in the same idea could anyone try to post the few lines that would

      > actually parse an xml file, with non ascii chars, with minidom

      > (parseString i guess).

      doc = xml.dom.minidom.parse("foo.xml")

      > Then convert a string grabbed from the net so parts of it can be

      > inserted in that dom object into new nodes or existing nodes.

      doc..documentElement.setAttribute("bar", text_from_net.decode("koi-8r"))

      > And finally write that dom object back to a file in a way it can be used

      > again later with the same script..

      open("/tmp/foo.txt","w").write(doc.toxml())

      > I've been trying to do that for a few days with no luck..

      > I can do each separate part of the job, not that i'm quite sure how i

      > decode/encode stuff in there, but as soon as i try to do everything at

      > the same time i get encoding errors thrown all the time..

      It would help if you would state what precise code you are using,

      and what precise error you are getting (for what precise input).

      > 3. in order to help me understand what's going on when doing

      > encodes/decodes could you please tell me if in the following example, s

      > and backToBytes are actually the same thing ??

      > s = "hello normal string"

      > u = unicode( s, "utf-8" )

      > backToBytes = u.encode( "utf-8" )

      > i knwo they both are bytestrings but i doubt they have actually the same

      > content..

      They do have the same content. There is nothing to a byte string except

      for the bytes. If the byte string is meant to represent characters,

      they are the same "thing" only if the assumed encoding is the same.

      Since the assumed encoding is "utf-8" for both s and backToBytes,

      they are the same thing.

      > 4. I've also tried to set the default encoding of python for my script

      > using the sys.setdefaultencoding('utf-8') but it keeps telling me that

      > this module does not have that method.. i'm left no choice but to edit

      > the site.py file manually to change "ascii" to "utf-8", but i won't be

      > able to do that on the client computers so..

      Don't do that. It's meant as a last resort for backwards compatibility,

      and shouldn't be used for new code.

      Regards,

      Martin

      #2; Sun, 30 Dec 2007 22:58:00 GMT
    • Thx Martin for your comments.

      indeed the charset of the web document is set in the meta tag, it's

      iso-8859-1 so i'll decode it to unicode using something like:

      html = html.decode('iso-8859-1')

      html then contains the unicode version of the html document

      As i've finally managed to make this work i'll post here my comments on

      the few things i still don't understand, maybe you can explain why it

      works that way with more technical terms than i can provide myself..

      So the whole thing is to regex parse some html document, and store the

      results inside an xml file that can be parsed again by python minidom

      for further use..

      ############### CODE START ###############

      import urllib, string, codecs, types

      import sys, traceback, os.path, re, shutil

      import cachedhttp

      from xml.dom.minidom import parse, parseString

      NODE_ELEMENT=1

      NODE_ATTRIBUTE=2

      NODE_TEXT=3

      NODE_CDATA_SECTION=4

      httpFetcher=cachedhttp.CachedHTTP()

      # Fetch Menu Links Page, httpFetcher is from the cachedhttp lib

      developped by someone for another script, it returns a bytestring from

      the local cached file, once downloaded of the internet, using a simple f

      = open(file,'r') & f.read()

      data = httpFetcher.urlopen('http://www.canalplus.fr/pid6.htm')

      data = data.decode('iso-8859-1')

      # at that point i have my html document in unicode

      # utf8bin.xml is an utf-8 encoded xml file, "bin" is because of the way

      i have to use to save it back to file, see at bottom

      dom = parse('utf8bin.xml')

      # find the data we need from the html document

      # title contains the text and so some special chars

      x = re.compile('<li[^>]*>[^<]*<a

      href="http://www.canalplus.fr/(?P<url>[^"]+)"[^>]*>(?:<b>)?(?P<title>[^<]+)(?:</b>)?</a>[^<]*</li>',

      re.DOTALL|re.IGNORECASE|re.UNICODE)

      for match in x.finditer(data):

      urlid = match.group('url')

      url = match.expand('http://www.canalplus.fr/\g<url>')

      title = match.expand('\g<title>')

      # everything here is still unicode objects

      match = None

      nodes = dom.getElementsByTagName('page')

      for node in nodes:

      if GetNodeValue(node,'title') == title:

      print 'Found Match: ' + title + ' == ' + GetNodeValue(node,'title')

      match = node

      break

      if match is None:

      # create page node and set attributes

      newnode = dom.createElement('page')

      att = dom.createAttribute('id')

      newnode.setAttributeNode(att)

      newnode.setAttribute('id',urlid)

      # create title childnode and set CDATA section

      vnode = dom.createElement('title')

      newnode.appendChild(vnode)

      dnode = dom.createCDATASection(title)

      vnode.appendChild(dnode)

      # create value childnode and set CDATA section

      vnode = dom.createElement('value')

      newnode.appendChild(vnode)

      dnode = dom.createCDATASection(url)

      vnode.appendChild(dnode)

      root = dom.documentElement

      root.appendChild(newnode)

      f = open('utf8bin.xml', 'wb')

      f.write(dom.toxml(encoding="utf-8"))

      f.close()

      # just to make sure we can still parse our xml file

      print '\nParsing utf8bin.xml and Printing titles'

      dom = parse('utf8bin.xml')

      nodes = dom.getElementsByTagName('page')

      for node in nodes:

      print GetNodeValue(node,'title')

      # Some xml helper functions

      # GetNodeText returns a unicode object

      def GetNodeText(node):

      dout=''

      for tnode in node.childNodes:

      if (tnode.nodeType==NODE_TEXT)|(tnode.nodeType==NODE_ CDATA_SECTION):

      dout=dout+tnode.nodeValue

      return dout

      # GetNodeValue returns a unicode object or None

      def GetNodeValue(node,tag=None):

      if tag is None: return GetNodeText(node)

      nattr=node.attributes.getNamedItem(tag)

      if not (nattr is None): return nattr.value

      for child in node.childNodes:

      if child.nodeName == tag:

      return GetNodeText(child)

      return None

      ############### CODE END ###############

      Now the comments :

      so what i understood of all this, is that once you're using unicode

      objects you're safe !

      At least as long as you don't use statements or operators that will

      implicitely try to convert the unicode object back to bytestring using

      your default encoding (ascii) which will most certainly result in codec

      Errors...

      Also, minidom seems to use unicode object what was not really documented

      in the python 2.3 doc i've read about it..

      so passing the unicode object from my regex matches to minidom elements

      will make minidom behave nicely..

      If you start to pass encoded bytestrings to minidom elements it may fail

      when you call "toxml()".. I know i managed to do that once or twice i

      don't remember exactly what kind of bytestrings i passed to the minidom

      element but one thing's for sure it made "toxml()" fail whatever

      encoding you specify..

      So if you stick to unicode, it will then encode all that unicode content

      to whatever encoding you've specified when calling

      "dom.toxml(encoding="utf-8")"

      then you just have to store the output of that as it is without any

      further encoding

      As a matter of fact using the following sequence will most certainly fail :

      f = codecs.open('utf8codecs.xml', 'w', 'utf-8')

      f.write(dom.toxml(encoding="utf-8"))

      f.close()

      then again maybe this will work, i just thought of it..

      f = codecs.open('utf8codecs.xml', 'w', 'utf-8')

      f.write(dom.toxml())

      f.close()

      I didn't understand at first that once you're using unicode object and

      as long as you've properly decoded your bytestring source, then unicode

      is unicode and you can forget about encodings "ascii", "iso-", "utf-"..

      The next important thing is to make sure to use functions and objects

      that support unicode all the way, like minidom seems to do..

      my original script has another function "FindDataNode" that will do a

      more sofisticated loop, into the dom object you provide, in order to

      check if there's already a node with the same title, and i use there

      some .lower() methods and a another "Sanitize" function that replaces a

      few chars.. So i guess i'll have to make sure that none of those

      manipulations converts my unicode obect back to bytestrings..

      Thx for reading, let me know if you see really really weird (bad?)

      things in my code, or if you have further comments to add on the unicode

      topic..

      Marc

      Martin v. Lwis wrote:

      > webdev wrote:

      >>1. when fetching a web page from the net, how am i supposed to know how

      >>it's encoded.. And can i decode it to unicode and encode it back to a

      >>byte string so i can use it in my code, with the charsets i want, like

      >>utf-8.. ?

      >

      > It depends on the content type. If the HTTP header declares a charset=

      > attribute for content-type, then use that (beware: some web servers

      > report the content type incorrectly. To deal with that gracefully,

      > you have to implement very complex algorithms, which are part of

      > any recent web browser).

      > If there is no charset= attribute, then

      > - if the content type is text/html, look at a meta http-equiv tag

      > in the content. If that declares a charset, use that.

      > - if the content type is xml (plain, or xhtml+xml), look at the

      > XML declaration. Alternatively, pass it to your XML parser.

      >

      >>2. in the same idea could anyone try to post the few lines that would

      >>actually parse an xml file, with non ascii chars, with minidom

      >>(parseString i guess).

      >

      > doc = xml.dom.minidom.parse("foo.xml")

      >

      >>Then convert a string grabbed from the net so parts of it can be

      >>inserted in that dom object into new nodes or existing nodes.

      >

      > doc..documentElement.setAttribute("bar", text_from_net.decode("koi-8r"))

      >

      >>And finally write that dom object back to a file in a way it can be used

      >>again later with the same script..

      >

      > open("/tmp/foo.txt","w").write(doc.toxml())

      >

      >>I've been trying to do that for a few days with no luck..

      >>I can do each separate part of the job, not that i'm quite sure how i

      >>decode/encode stuff in there, but as soon as i try to do everything at

      >>the same time i get encoding errors thrown all the time..

      >

      > It would help if you would state what precise code you are using,

      > and what precise error you are getting (for what precise input).

      >

      >>3. in order to help me understand what's going on when doing

      >>encodes/decodes could you please tell me if in the following example, s

      >>and backToBytes are actually the same thing ??

      >>

      >>s = "hello normal string"

      >>u = unicode( s, "utf-8" )

      >>backToBytes = u.encode( "utf-8" )

      >>

      >>i knwo they both are bytestrings but i doubt they have actually the same

      >>content..

      >

      > They do have the same content. There is nothing to a byte string except

      > for the bytes. If the byte string is meant to represent characters,

      > they are the same "thing" only if the assumed encoding is the same.

      > Since the assumed encoding is "utf-8" for both s and backToBytes,

      > they are the same thing.

      >

      >>4. I've also tried to set the default encoding of python for my script

      >>using the sys.setdefaultencoding('utf-8') but it keeps telling me that

      >>this module does not have that method.. i'm left no choice but to edit

      >>the site.py file manually to change "ascii" to "utf-8", but i won't be

      >>able to do that on the client computers so..

      >

      > Don't do that. It's meant as a last resort for backwards compatibility,

      > and shouldn't be used for new code.

      > Regards,

      > Martin

      #3; Sun, 30 Dec 2007 22:59:00 GMT
    • > so what i understood of all this, is that once you're using unicode

      > objects you're safe !

      > At least as long as you don't use statements or operators that will

      > implicitely try to convert the unicode object back to bytestring using

      > your default encoding (ascii) which will most certainly result in codec

      > Errors...

      Correct.

      > Also, minidom seems to use unicode object what was not really documented

      > in the python 2.3 doc i've read about it..

      It might be somewhat hidden:

      http://docs.python.org/lib/dom-type-mapping.html

      "DOMString defined in the recommendation is mapped to a Python string or

      Unicode string. Applications should be able to handle Unicode whenever a

      string is returned from the DOM."

      http://docs.python.org/lib/minidom-and-dom.html

      "The type DOMString maps to Python strings. xml.dom.minidom supports

      either byte or Unicode strings, but will normally produce Unicode

      strings. Values of type DOMString may also be None where allowed to have

      the IDL null value by the DOM specification from the W3C."

      In principle, you should fill Unicode strings into DOM trees all the

      time, but it will work with byte strings as well as long as they are

      ASCII.

      > As a matter of fact using the following sequence will most certainly fail :

      > f = codecs.open('utf8codecs.xml', 'w', 'utf-8')

      > f.write(dom.toxml(encoding="utf-8"))

      > f.close()

      Correct. A codecs.StreamWriter expects Unicode objects, whereas toxml

      returns byte strings (atleast if you pass an encoding - because of a

      bug, it might return a Unicode string otherwise)

      > then again maybe this will work, i just thought of it..

      > f = codecs.open('utf8codecs.xml', 'w', 'utf-8')

      > f.write(dom.toxml())

      > f.close()

      Yeah, toxml() returned Unicode because of a bug - but for backwards

      compatibility, this cannot be changed. People should explicitly pass

      an encoding.

      > The next important thing is to make sure to use functions and objects

      > that support unicode all the way, like minidom seems to do..

      Indeed, there are still many functions in the standard library which

      don't work with Unicode strings, but should. Some functions, of course,

      are only meaningful for byte strings (like networking API).

      Regards,

      Martin

      #4; Sun, 30 Dec 2007 23:00:00 GMT