<dt><spanclass="section"><ahref="loading.html#manual.loading.w3c"> Conformance to W3C specification</a></span></dt>
</dl></div>
<p>
pugixml provides several functions for loading XML data from various places
- files, C++ iostreams, memory buffers. All functions use an extremely fast
non-validating parser. This parser is not fully W3C conformant - it can load
any valid XML document, but does not perform some well-formedness checks. While
considerable effort is made to reject invalid XML documents, some validation
is not performed because of performance reasons. Also some XML transformations
(i.e. EOL handling or attribute value normalization) can impact parsing speed
and thus can be disabled. However for vast majority of XML documents there
is no performance difference between different parsing options. Parsing options
also control whether certain XML nodes are parsed; see <aclass="xref"href="loading.html#manual.loading.options"title="Parsing options"> Parsing options</a> for
more information.
</p>
<p>
XML data is always converted to internal character format (see <aclass="xref"href="dom.html#manual.dom.unicode"title="Unicode interface"> Unicode interface</a>)
before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16
(big and little endian), UTF-32 (big and little endian); UCS-2 is naturally
supported since it's a strict subset of UTF-16) and handles all encoding conversions
automatically. Unless explicit encoding is specified, loading functions perform
automatic encoding detection based on first few characters of XML data, so
in almost all cases you do not have to specify document encoding. Encoding
conversion is described in more detail in <aclass="xref"href="loading.html#manual.loading.encoding"title="Encodings"> Encodings</a>.
</p>
<divclass="section">
<divclass="titlepage"><div><div><h3class="title">
<aname="manual.loading.file"></a><aclass="link"href="loading.html#manual.loading.file"title="Loading document from file"> Loading document from file</a>
arguments, which specify parsing options (see <aclass="xref"href="loading.html#manual.loading.options"title="Parsing options"> Parsing options</a>) and
input data encoding (see <aclass="xref"href="loading.html#manual.loading.encoding"title="Encodings"> Encodings</a>). The path has the target
operating system format, so it can be a relative or absolute one, it should
have the delimiters of target system, it should have the exact case if target
<aname="manual.loading.memory"></a><aclass="link"href="loading.html#manual.loading.memory"title="Loading document from memory"> Loading document from memory</a>
All functions accept the buffer which is represented by a pointer to XML
data, <codeclass="computeroutput"><spanclass="identifier">contents</span></code>, and data
size in bytes. Also there are two optional arguments, which specify parsing
options (see <aclass="xref"href="loading.html#manual.loading.options"title="Parsing options"> Parsing options</a>) and input data encoding (see <aclass="xref"href="loading.html#manual.loading.encoding"title="Encodings"> Encodings</a>).
The buffer does not have to be zero-terminated.
</p>
<p>
<codeclass="computeroutput"><spanclass="identifier">load_buffer</span></code> function works
with immutable buffer - it does not ever modify the buffer. Because of this
restriction it has to create a private buffer and copy XML data to it before
parsing (applying encoding conversions if necessary). This copy operation
carries a performance penalty, so inplace functions are provided - <codeclass="computeroutput"><spanclass="identifier">load_buffer_inplace</span></code> and <codeclass="computeroutput"><spanclass="identifier">load_buffer_inplace_own</span></code>
store the document data in the buffer, modifying it in the process. In order
for the document to stay valid, you have to make sure that the buffer's lifetime
exceeds that of the tree if you're using inplace functions. In addition to
<preclass="programlisting"><spanclass="comment">// You can use load_buffer_inplace to load document from mutable memory block; the block's lifetime must exceed that of document
<preclass="programlisting"><spanclass="comment">// You can use load_buffer_inplace_own to load document from mutable memory block and to pass the ownership of this block
</span><spanclass="comment">// The block has to be allocated via pugixml allocation function - using i.e. operator new here is incorrect
<aname="manual.loading.stream"></a><aclass="link"href="loading.html#manual.loading.stream"title="Loading document from C++ IOstreams"> Loading document from C++ IOstreams</a>
</h3></div></div></div>
<aname="xml_document::load_stream"></a><p>
For additional interoperability pugixml provides functions for loading document
from any object which implements C++ <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">istream</span></code>
interface. This allows you to load documents from any standard C++ stream
(i.e. file stream) or any third-party compliant implementation (i.e. Boost
Iostreams). There are two functions, one works with narrow character streams,
<codeclass="computeroutput"><spanclass="identifier">load</span></code> with <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">istream</span></code>
argument loads the document from stream from the current read position to
the end, treating the stream contents as a byte stream of the specified encoding
(with encoding autodetection as necessary). Thus calling <codeclass="computeroutput"><spanclass="identifier">xml_document</span><spanclass="special">::</span><spanclass="identifier">load</span></code>
on an opened <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">ifstream</span></code> object is equivalent to calling
<codeclass="computeroutput"><spanclass="identifier">load</span></code> with <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">wstream</span></code>
argument treats the stream contents as a wide character stream (encoding
is always <codeclass="computeroutput"><spanclass="identifier">encoding_wchar</span></code>).
Because of this, using <codeclass="computeroutput"><spanclass="identifier">load</span></code>
with wide character streams requires careful (usually platform-specific)
stream setup (i.e. using the <codeclass="computeroutput"><spanclass="identifier">imbue</span></code>
function). Generally use of wide streams is discouraged, however it provides
you the ability to load documents from non-Unicode encodings, i.e. you can
load Shift-JIS encoded data if you set the correct locale.
</p>
<p>
This is a simple example of loading XML document from file using streams
All document loading functions return the parsing result via <codeclass="computeroutput"><spanclass="identifier">xml_parse_result</span></code> object. It contains parsing
status, the offset of last successfully parsed character from the beginning
of the source stream, and the encoding of the source stream:
<aname="status_ok"></a><codeclass="literal">status_ok</code> means that no error was encountered
during parsing; the source stream represents the valid XML document which
was fully parsed and converted to a tree. <br><br>
</li>
<liclass="listitem">
<aname="status_file_not_found"></a><codeclass="literal">status_file_not_found</code> is only
returned by <codeclass="computeroutput"><spanclass="identifier">load_file</span></code>
function and means that file could not be opened.
</li>
<liclass="listitem">
<aname="status_io_error"></a><codeclass="literal">status_io_error</code> is returned by <codeclass="computeroutput"><spanclass="identifier">load_file</span></code> function and by <codeclass="computeroutput"><spanclass="identifier">load</span></code> functions with <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">istream</span></code>/<codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">wstream</span></code> arguments; it means that some
I/O error has occured during reading the file/stream.
</li>
<liclass="listitem">
<aname="status_out_of_memory"></a><codeclass="literal">status_out_of_memory</code> means that
there was not enough memory during some allocation; any allocation failure
during parsing results in this error.
</li>
<liclass="listitem">
<aname="status_internal_error"></a><codeclass="literal">status_internal_error</code> means that
something went horribly wrong; currently this error does not occur <br><br>
</li>
<liclass="listitem">
<aname="status_unrecognized_tag"></a><codeclass="literal">status_unrecognized_tag</code> means
that parsing stopped due to a tag with either an empty name or a name
which starts with incorrect character, such as <codeclass="literal">#</code>.
</li>
<liclass="listitem">
<aname="status_bad_pi"></a><codeclass="literal">status_bad_pi</code> means that parsing stopped
due to incorrect document declaration/processing instruction
<aname="status_bad_doctype"></a><codeclass="literal">status_bad_doctype</code> and <aname="status_bad_pcdata"></a><codeclass="literal">status_bad_pcdata</code>
mean that parsing stopped due to the invalid construct of the respective
type
</li>
<liclass="listitem">
<aname="status_bad_start_element"></a><codeclass="literal">status_bad_start_element</code> means
that parsing stopped because starting tag either had no closing <codeclass="computeroutput"><spanclass="special">></span></code> symbol or contained some incorrect
symbol
</li>
<liclass="listitem">
<aname="status_bad_attribute"></a><codeclass="literal">status_bad_attribute</code> means that
parsing stopped because there was an incorrect attribute, such as an
attribute without value or with value that is not quoted (note that
means that parsing stopped because the closing tag did not match the
opening one (i.e. <codeclass="computeroutput"><spanclass="special"><</span><spanclass="identifier">node</span><spanclass="special">></</span><spanclass="identifier">nedo</span><spanclass="special">></span></code>) or because some tag was not closed
member function can be used to convert parsing status to a string; the returned
message is always in English, so you'll have to write your own function if
you need a localized string. However please note that the exact messages
returned by <codeclass="computeroutput"><spanclass="identifier">description</span><spanclass="special">()</span></code>
function may change from version to version, so any complex status handling
should be based on <codeclass="computeroutput"><spanclass="identifier">status</span></code>
value.
</p>
<p>
If parsing failed because the source data was not a valid XML, the resulting
tree is not destroyed - despite the fact that load function returns error,
you can use the part of the tree that was successfully parsed. Obviously,
the last element may have an unexpected name/value; for example, if the attribute
value does not end with the necessary quotation mark, like in <codeclass="literal"><node
attr="value>some data</node></code> example, the value of
attribute <codeclass="computeroutput"><spanclass="identifier">attr</span></code> will contain
the string <codeclass="computeroutput"><spanclass="identifier">value</span><spanclass="special">></span><spanclass="identifier">some</span><spanclass="identifier">data</span><spanclass="special"></</span><spanclass="identifier">node</span><spanclass="special">></span></code>.
</p>
<aname="xml_parse_result::offset"></a><p>
In addition to the status code, parsing result has an <codeclass="computeroutput"><spanclass="identifier">offset</span></code>
member, which contains the offset of last successfully parsed character if
parsing failed because of an error in source data; otherwise <codeclass="computeroutput"><spanclass="identifier">offset</span></code> is 0. For parsing efficiency reasons,
pugixml does not track the current line during parsing; this offset is in
units of <codeclass="computeroutput"><spanclass="identifier">pugi</span><spanclass="special">::</span><spanclass="identifier">char_t</span></code> (bytes for character mode, wide
characters for wide character mode). Many text editors support 'Go To Position'
feature - you can use it to locate the exact error position. Alternatively,
if you're loading the document from memory, you can display the error chunk
along with the error description (see the example code below).
All document loading functions accept the optional parameter <codeclass="computeroutput"><spanclass="identifier">options</span></code>. This is a bitmask that customizes
the parsing process: you can select the node types that are parsed and various
transformations that are performed with the XML text. Disabling certain transformations
can improve parsing performance for some documents; however, the code for
all transformations is very well optimized, and thus the majority of documents
won't get any performance benefit. As a rule of thumb, only modify parsing
flags if you want to get some nodes in the document that are excluded by
You should use the usual bitwise arithmetics to manipulate the bitmask:
to enable a flag, use <codeclass="computeroutput"><spanclass="identifier">mask</span><spanclass="special">|</span><spanclass="identifier">flag</span></code>;
to disable a flag, use <codeclass="computeroutput"><spanclass="identifier">mask</span><spanclass="special">&</span><spanclass="special">~</span><spanclass="identifier">flag</span></code>.
<aname="parse_declaration"></a><codeclass="literal">parse_declaration</code> determines if XML
document declaration (node with type <aclass="link"href="dom.html#node_declaration">node_declaration</a>)
are to be put in DOM tree. If this flag is off, it is not put in the
tree, but is still parsed and checked for correctness. This flag is
<spanclass="bold"><strong>off</strong></span> by default. <br><br>
</li>
<liclass="listitem">
<aname="parse_pi"></a><codeclass="literal">parse_pi</code> determines if processing instructions
(nodes with type <aclass="link"href="dom.html#node_pi">node_pi</a>) are to be put
in DOM tree. If this flag is off, they are not put in the tree, but are
still parsed and checked for correctness. Note that <codeclass="computeroutput"><spanclass="special"><?</span><spanclass="identifier">xml</span><spanclass="special">...?></span></code>
(document declaration) is not considered to be a PI. This flag is <spanclass="bold"><strong>off</strong></span> by default. <br><br>
</li>
<liclass="listitem">
<aname="parse_comments"></a><codeclass="literal">parse_comments</code> determines if comments
(nodes with type <aclass="link"href="dom.html#node_comment">node_comment</a>) are
to be put in DOM tree. If this flag is off, they are not put in the tree,
but are still parsed and checked for correctness. This flag is <spanclass="bold"><strong>off</strong></span> by default. <br><br>
</li>
<liclass="listitem">
<aname="parse_cdata"></a><codeclass="literal">parse_cdata</code> determines if CDATA sections
(nodes with type <aclass="link"href="dom.html#node_cdata">node_cdata</a>) are to
be put in DOM tree. If this flag is off, they are not put in the tree,
but are still parsed and checked for correctness. This flag is <spanclass="bold"><strong>on</strong></span> by default. <br><br>
</li>
<liclass="listitem">
<aname="parse_ws_pcdata"></a><codeclass="literal">parse_ws_pcdata</code> determines if PCDATA
nodes (nodes with type <aclass="link"href="dom.html#node_pcdata">node_pcdata</a>)
that consist only of whitespace characters are to be put in DOM tree.
Often whitespace-only data is not significant for the application, and
the cost of allocating and storing such nodes (both memory and speed-wise)
can be significant. For example, after parsing XML string <codeclass="computeroutput"><spanclass="special"><</span><spanclass="identifier">node</span><spanclass="special">></span><spanclass="special"><</span><spanclass="identifier">a</span><spanclass="special">/></span><spanclass="special"></</span><spanclass="identifier">node</span><spanclass="special">></span></code>, <codeclass="computeroutput"><spanclass="special"><</span><spanclass="identifier">node</span><spanclass="special">></span></code>
element will have three children when <codeclass="computeroutput"><spanclass="identifier">parse_ws_pcdata</span></code>
is set (child with type <codeclass="computeroutput"><spanclass="identifier">node_pcdata</span></code>
and value <codeclass="computeroutput"><spanclass="string">" "</span></code>,
child with type <codeclass="computeroutput"><spanclass="identifier">node_element</span></code>
and name <codeclass="computeroutput"><spanclass="string">"a"</span></code>, and
another child with type <codeclass="computeroutput"><spanclass="identifier">node_pcdata</span></code>
and value <codeclass="computeroutput"><spanclass="string">" "</span></code>),
and only one child when <codeclass="computeroutput"><spanclass="identifier">parse_ws_pcdata</span></code>
is not set. This flag is <spanclass="bold"><strong>off</strong></span> by default.
</li>
</ul></div>
<p>
These flags control the transformation of tree element contents:
<aname="parse_escapes"></a><codeclass="literal">parse_escapes</code> determines if character
and entity references are to be expanded during the parsing process.
Character references have the form <codeclass="literal">&#...;</code> or
<codeclass="literal">&#x...;</code> (<codeclass="literal">...</code> is Unicode numeric
representation of character in either decimal (<codeclass="literal">&#...;</code>)
or hexadecimal (<codeclass="literal">&#x...;</code>) form), entity references
are <codeclass="literal">&lt;</code>, <codeclass="literal">&gt;</code>, <codeclass="literal">&amp;</code>,
<codeclass="literal">&apos;</code> and <codeclass="literal">&quot;</code> (note
that as pugixml does not handle DTD, the only allowed entities are predefined
ones). If character/entity reference can not be expanded, it is left
as is, so you can do additional processing later. Reference expansion
is performed in attribute values and PCDATA content. This flag is <spanclass="bold"><strong>on</strong></span> by default. <br><br>
</li>
<liclass="listitem">
<aname="parse_eol"></a><codeclass="literal">parse_eol</code> determines if EOL handling (that
is, replacing sequences <codeclass="computeroutput"><spanclass="number">0x0d</span><spanclass="number">0x0a</span></code> by a single <codeclass="computeroutput"><spanclass="number">0x0a</span></code>
character, and replacing all standalone <codeclass="computeroutput"><spanclass="number">0x0d</span></code>
characters by <codeclass="computeroutput"><spanclass="number">0x0a</span></code>) is to
be performed on input data (that is, comments contents, PCDATA/CDATA
contents and attribute values). This flag is <spanclass="bold"><strong>on</strong></span>
if attribute value normalization should be performed for all attributes.
This means, that whitespace characters (new line, tab and space) are
replaced with space (<codeclass="computeroutput"><spanclass="char">' '</span></code>).
New line characters are always treated as if <codeclass="computeroutput"><spanclass="identifier">parse_eol</span></code>
is set, i.e. <codeclass="computeroutput"><spanclass="special">\</span><spanclass="identifier">r</span><spanclass="special">\</span><spanclass="identifier">n</span></code>
is converted to single space. This flag is <spanclass="bold"><strong>on</strong></span>
<aname="parse_minimal"></a><codeclass="literal">parse_minimal</code> has all options turned
off. This option mask means that pugixml does not add declaration nodes,
PI nodes, CDATA sections and comments to the resulting tree and does
not perform any conversion for input data, so theoretically it is the
fastest mode. However, as discussed above, in practice <codeclass="computeroutput"><spanclass="identifier">parse_default</span></code> is usually equally fast.
<br><br>
</li>
<liclass="listitem">
<aname="parse_default"></a><codeclass="literal">parse_default</code> is the default set of flags,
i.e. it has all options set to their default values. It includes parsing
CDATA sections (comments/PIs are not parsed), performing character and
entity reference expansion, replacing whitespace characters with spaces
in attribute values and performing EOL handling. Note, that PCDATA sections
consisting only of whitespace characters are not parsed (by default)
for performance reasons.
</li>
</ul></div>
<p>
This is an example of using different parsing options (<ahref="../samples/load_options.cpp"target="_top">samples/load_options.cpp</a>):
The current behavior for Unicode conversion is to skip all invalid UTF
sequences during conversion. This behavior should not be relied upon; moreover,
in case no encoding conversion is performed, the invalid sequences are
not removed, so you'll get them as is in node/attribute contents.
</p></td></tr>
</table></div>
</div>
<divclass="section">
<divclass="titlepage"><div><div><h3class="title">
<aname="manual.loading.w3c"></a><aclass="link"href="loading.html#manual.loading.w3c"title="Conformance to W3C specification"> Conformance to W3C specification</a>
</h3></div></div></div>
<p>
pugixml is not fully W3C conformant - it can load any valid XML document,
but does not perform some well-formedness checks. While considerable effort
is made to reject invalid XML documents, some validation is not performed
because of performance reasons.
</p>
<p>
There is only one non-conformant behavior when dealing with valid XML documents:
pugixml does not use information supplied in document type declaration for
parsing. This means that entities declared in DOCTYPE are not expanded, and
all attribute/PCDATA values are always processed in a uniform way that depends
only on parsing options.
</p>
<p>
As for rejecting invalid XML documents, there are a number of incompatibilities