Given an unsorted sequence, remove_duplicates would sort it using the pointer value of attributes/nodes and then remove consecutive duplicates. This was problematic because it meant that the result of XPath queries was dependent on the memory allocation pattern. While it's technically incorrect to rely on the order, this results in easy to miss bugs. This is particularly common when XPath queries use union operators - although we also will call remove_duplicates in other cases. This change reworks the code to use a hash set instead, using the same hash function we use for compact storage. To make sure it performs well, we allocate enough buckets for count * 1.5 (assuming all elements are unique); since each bucket is a single pointer unlike xpath_node which is two pointers, we need somewhere between size * 0.75 and size * 1.5 temporary storage. The resulting filtering is stable - we remove elements that we have seen before but we don't change the order - and is actually significantly faster than sorting was. With a large union operation, before this change it took ~56 ms per 100 query invocations to remove duplicates, and after this change it takes ~20ms. Fixes #254. |
||
|---|---|---|
| contrib | ||
| docs | ||
| scripts | ||
| src | ||
| tests | ||
| .codecov.yml | ||
| .gitattributes | ||
| .gitignore | ||
| .travis.yml | ||
| appveyor.yml | ||
| CMakeLists.txt | ||
| LICENSE.md | ||
| Makefile | ||
| README.md | ||
| readme.txt | ||
pugixml

pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings (which happen automatically during parsing/saving).
pugixml is used by a lot of projects, both open-source and proprietary, for performance and easy-to-use interface.
Documentation
Documentation for the current release of pugixml is available on-line as two separate documents:
- Quick-start guide, that aims to provide enough information to start using the library;
- Complete reference manual, that describes all features of the library in detail.
You’re advised to start with the quick-start guide; however, many important library features are either not described in it at all or only mentioned briefly; if you require more information you should read the complete manual.
Example
Here's an example of how code using pugixml looks; it opens an XML file, goes over all Tool nodes and prints tools that have a Timeout attribute greater than 0:
#include "pugixml.hpp"
#include <iostream>
int main()
{
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load_file("xgconsole.xml");
if (!result)
return -1;
for (pugi::xml_node tool: doc.child("Profile").child("Tools").children("Tool"))
{
int timeout = tool.attribute("Timeout").as_int();
if (timeout > 0)
std::cout << "Tool " << tool.attribute("Filename").value() << " has timeout " << timeout << "\n";
}
}
And the same example using XPath:
#include "pugixml.hpp"
#include <iostream>
int main()
{
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load_file("xgconsole.xml");
if (!result)
return -1;
pugi::xpath_node_set tools_with_timeout = doc.select_nodes("/Profile/Tools/Tool[@Timeout > 0]");
for (pugi::xpath_node node: tools_with_timeout)
{
pugi::xml_node tool = node.node();
std::cout << "Tool " << tool.attribute("Filename").value() <<
" has timeout " << tool.attribute("Timeout").as_int() << "\n";
}
}
License
This library is available to anybody free of charge, under the terms of MIT License (see LICENSE.md).