Clustering Documentation - a code documentation for the clustering package

Generate Clusters via Free Energy Screening

Introduction

In this step all geometrically connected regions are identified. Those are determined for all points below a free energy threshold \(\tilde{G}\). This is repeated consequently for all values threshold specified with \(\mathtt{\mbox{-}T}\).

Execution

             
clustering density -f coordinate_file
                   -T from_fe step_fe [to_fe]
                   -r radius
                   -d/-D free_energy_file
                   -b/-B nearest_neighbor_file
                   -o output_basename
                   -n N
                   -v

Parameters

Input Parameters

Parameter	Description
\(\mathtt{\mbox{-}f :}\)	The name(path) of the input coordinates. The dimension should not exceed 10.
\(\mathtt{\mbox{-}T :}\)	The screening of the free energy landscape. format: FROM STEP TO; e.g.: \(\mathtt{\mbox{-}T\ 0.1\ 0.1\ 11.1}\). Set \(\mathtt{\mbox{-}T\ \mbox{-}1}\) for default values: FROM \(=0.1\), STEP \(=0.1\), TO \(=G_\text{max}\). If TO is not given, e.g.: '\(\mathtt{\mbox{-}T\ 0.2\ 0.4}\)', than TO \(=G_\text{max}\) to start at \(0.2\) and go to \(G_\text{max}\) at steps of \(0.4\).
\(\mathtt{\mbox{-}r :}\)	The cluster radius \(R\) (same unit as input) used for calculating the free energy and the nearest neighbor with lower energy. If this flag is not set the lumping radius \(d_\text{lump}\) will be used, so \(R=d_\text{lump}\). This will be ignored if used together with \(\mathtt{\mbox{-}D}\).
\(\mathtt{\mbox{-}D :}\)	Filename to read the free energies from. This should be used with \(\mathtt{\mbox{-}B}\) and a nearest neighbor file generated with the same cluster radius, otherwise use \(\mathtt{\mbox{-}b}\). If used, \(\mathtt{\mbox{-}r}\) is ignored.
\(\mathtt{\mbox{-}B :}\)	Filename to read the nearest neighbors from. This should be used with \(\mathtt{\mbox{-}D}\) and a nearest neighbor file generated with the same clustering radius, or together with \(\mathtt{\mbox{-}r}\) and \(\mathtt{\mbox{-}d}\). This option is ignored if \(\mathtt{\mbox{-}R}\) is set.

Output Parameters

Parameter	Description
\(\mathtt{\mbox{-}o}\)	This is the basename of the cluster output files. For each free energy threshold a file with the found clusters is saved, e.g. for \(\Delta \tilde{G}=2.6\) a file named \(\mathtt{basename.2.60}\) is generated. \(0\) corresponds to frames with a higher free energy than the current threshold \(\Delta \tilde{G}\).
\(\mathtt{\mbox{-}p}\)	Filename for the output file of the populations for the given radius. This option is ignored if not given together with \(\mathtt{\mbox{-}d}\).
\(\mathtt{\mbox{-}d}\)	Filename for the output file of the free energies. The same as \(\mathtt{\mbox{-}D}\), but generates the free energies for the given radius.
\(\mathtt{\mbox{-}b}\)	Filename for the output file of the nearest neighbors. The same as \(\mathtt{\mbox{-}B}\), but the nearest neighbors get generated for the given radius.

Miscellaneous Parameters

Parameter	Description
\(\mathtt{\mbox{-}n}\)	The number of parallel threads to use (for SMP machines). This is ignored if CUDA is used.
\(\mathtt{\mbox{-}v}\)	Verbose mode with some output.

Detailed Description

Before we explain the procedure, we want to note here that clusters do not correspond to states. Clusters just separate the space into geometrically connected regions. Those are per construction separated at the highest available free energy, while states are separated at all barriers. Therefore, gathers the major cluster normally up to \(90\%\) of the total population.

Given the readily computed free energies (\(\mathtt{\mbox{-}d}\)) and neighborhoods (\(\mathtt{\mbox{-}b}\)) per frame for the selected radius, we continue the clustering by screening the free energy landscape. This means, we choose a (initially low) free energy cutoff \(\tilde{G}\), select all frames below and combine them into clusters, if they are geometrically connected with respect to the lumping distance \(d_\text{lump}=2\sigma\), which is automatically computed for the given data set. \(\sigma\) denotes here the standard deviation of the nearest neighbor distances. An illustration is provided in the following figures. If the threshold raises previous clusters can be merged.

This is done repeatedly, with increasing free energy cutoffs. The end result will be multiple plain-text files, one for each cutoff, that define cluster trajectories. Their format is a single column of integers, which act as a membership id of the given frame (identified by the row number) to a certain cluster. For each file, the microstate ids start with the value 1 and are numbered incrementally. All frames above the given free energy threshold are denoted by \(0\), which acts as a kind of melting pot for frames with unknown affiliation.

Do not worry if there are still frames denoted by \(0\) in the cluster file of highest free energy. The clustering program rounds the maximum free energy to 2 decimal places for screening, which may result in a cutoff that is slightly lower than the true maximum. Thus there may be frames left, which are not assigned to a cluster since they are not below the cutoff.

However, since we are going to generate states from selecting the basins of the clusters and 'filling up' the free energy landscape from bottom to top, these states are not interesting on their own, anyhow, and will be assigned to a distinct state in the next step.