Biswajit Banerjee

### Reading XML files containing gzipped data in C++

How to read particle input files created with R in XML format

#### Introduction

We saw how to create an XML file containing compressed particle data in the article “XML format for particle input files”. Let us now explore how to read in that data in our C++ particle simulation code.

#### Recap

Recall that the compressed base64 XML file contains data of the form shown below. We would like to convert these data back into numerical values that can be used by the simulation.

We use a ParticleFileReader object to read the file. The declaration of the object is listed below. Particle data are stored in an array of pointers to Particle objects, called ParticlePArray.

The main workhorse methods in this class are the templated functions readParticleValues, decodeAndUncompress, and convert. Templates are used because similar logic is used for different variable types.

#### The implementation

Let us now look at the implementations of these functions. We will ignore any checks that are necessary to make sure that the XML file is readable and contains the right data.

##### The read function

The point of entry is the read function:

The particle data associated with each tag is an array containing either 1 or 3 components. We use the explicitly instantiated templated function readParticleValues<T> to read in the data into arrays.

##### The readParticleValues templated function

Let us now look at the readParticleValues function that does the extraction and conversion of the compressed and encoded data.

The function just extracts the encoded data from the XML file and the number of components in the data (1 or 3). It then passes these on to the actual decode and uncompress code.

##### The decodeAndUncompress templated function

This is where the main work is done. For decoding the data into binary form, we use the cppcodec library. For decompression we use ZLib. To make sure that the cppcodec library is available in the repository where our code is stored, we add it as a submodule using

git submodule add git://github.com/tplgy/cppcodec.git cppcodec


For the Zlib library to be available to our cmake build system, we add the following to our CMakeLists.txt file:

#-------------------------------------------------------
# Add requirements for Zlib compression library
#-------------------------------------------------------
find_package(ZLIB REQUIRED)
if (ZLIB_FOUND)
message(STATUS "Zlib compression library found")
include_directories(\${ZLIB_INCLUDE_DIRS})
else()
set(ZLIB_DIR "")
set(ZLIB_LIBRARIES "")
set(ZLIB_INCLUDE_DIRS "")
endif()


The code for the decodeAndUncompress function is listed below.

Here the main complication arises during the inflation of the compressed data. We don’t know the size of the output buffer beforehand and have to read the buffer repeatedly until the entire input buffer has been inflated. After each chunk has been read into the out vector, we insert the data into uncompressed and continue the process.

After the entire stream has be uncompressed, we convert the string into the correct size type using the convert<T> function. Notice that this function is implicitly instantiated using output.push_back(convert<T>(str)). Template specialization is needed at this stage to make sure the right work is work during the conversion of each type. To see why this is not always a good idea, see the article Why Not Specialize Function Templates?. Care is needed to make sure that we don’t try to explictly instantiate convert<T> elsewhere, and modern compilers will probably throw an error if that is attempted.

##### The convert<T> template specializations

We will define two specializations here; the first function deals with properties such as particle ID while the second deals with vector properties such as position and force.

That completes the implementation. To see a version of this approach in action, look at ParticleFileReader.cpp.

#### Remarks

We can see that the process or decoding and unzipping the data in the XML file is quite straightforward. But it takes a bit more effort than reading a formatted text file. However, if our data include millions of particles, and these particles have to be broadcast to several nodes of a multiprocessor system, compression can not only save us a lot of communication time during simulations but also disk space.

In the next article, we will explore some more aspects of our particle simulation code.