Examine a UTF-8 encoded XML string and raise a pywbem.ParseError
exception if the response contains Bytes that are invalid UTF-8
sequences (incorrectly encoded or ill-formed) or that are invalid XML
characters.
This function works in both "wide" and "narrow" Unicode builds of Python
and supports the full range of Unicode characters from U+0000 to U+10FFFF.
This function is just a workaround for the bad error handling of Python's
xml.dom.minidom package. It replaces the not very informative
ExpatError "not well-formed (invalid token): line: x, column: y" with a
pywbem.ParseError providing more useful information.
Notes on Unicode support in Python:
- For internally representing Unicode characters in the unicode type, a
"wide" Unicode build of Python uses UTF-32, while a "narrow" Unicode
build uses UTF-16. The difference is visible to Python programs for
Unicode characters assigned to code points above U+FFFF: The "narrow"
build uses 2 characters (a surrogate pair) for them, while the "wide"
build uses just 1 character. This affects all position- and
length-oriented functions, such as
len() or string slicing.
- In a "wide" Unicode build of Python, the Unicode characters assigned to
code points U+10000 to U+10FFFF are represented directly (using code
points U+10000 to U+10FFFF) and the surrogate code points
U+D800...U+DFFF are never used; in a "narrow" Unicode build of Python,
the Unicode characters assigned to code points U+10000 to U+10FFFF are
represented using pairs of the surrogate code points U+D800...U+DFFF.
Notes on the Unicode code points U+D800...U+DFFF ("surrogate code points"):
These code points have no corresponding Unicode characters assigned,
because they are reserved for surrogates in the UTF-16 encoding.
The UTF-8 encoding can technically represent the surrogate code points.
ISO/IEC 10646 defines that a UTF-8 sequence containing the surrogate
code points is ill-formed, but it is technically possible that such a
sequence is in a UTF-8 encoded XML string.
The Python escapes \u and \U used in literal strings can
represent the surrogate code points (as well as all other code points,
regardless of whether they are assigned to Unicode characters).
The Python str.encode() and str.decode() functions successfully
translate the surrogate code points back and forth for encoding UTF-8.
For example, '\xed\xb0\x80'.decode("utf-8") = u'\udc00'.
Because Python supports the encoding and decoding of UTF-8 sequences
also for the surrogate code points, the "narrow" Unicode build of
Python can be (mis-)used to transport each surrogate unit separately
encoded in (ill-formed) UTF-8.
For example, code point U+10122 can be (illegally) created from a
sequence of code points U+D800,U+DD22 represented in UTF-8:
'\xED\xA0\x80\xED\xB4\xA2'.decode("utf-8") = u'\U00010122'
while the correct UTF-8 sequence for this code point is:
u'\U00010122'.encode("utf-8") = '\xf0\x90\x84\xa2'
Notes on XML characters:
The legal XML characters are defined in W3C XML 1.0 (Fith Edition):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
These are the code points of Unicode characters using a non-surrogate
representation.
- Parameters:
utf8_xml (string) - The UTF-8 encoded XML string to be examined.
meaning (string) - Short text with meaning of the XML string, for messages in exceptions.
Exceptions:
TypeError , if invoked with incorrect Python object type for utf8_xml .
pywbem.ParseError, if utf8_xml contains Bytes that are invalid UTF-8
sequences (incorrectly encoded or ill-formed) or invalid XML characters.
|