Clustering Documentation - a code documentation for the clustering package

Generate Free Energy and Nearest Neighbors

Introduction

Other than k-means, this density-based clustering algorithms estimates the optimal seeds of the microstates, based on the free energy landscape. Regions with local high density representing local minima in the free energy landscape. Latter is calculated with $$ G_{R}(\vec{r})= - \ln (P_R(\vec{r})/P_{R}^{\text{max}}), $$ where $P_{R}(\vec{r})$ is the local population at the point $\vec{r}$ with respect to the sphere of radius $R$.

Execution

             
clustering density -f coordinate_file
                   -r/-R radius/radii
                   -p population_file
                   -d/-D free_energy_file
                   -b/-B nearest_neighbor_file
                   -n N
                   -v

Parameters

Input Parameters

Parameter	Description
$\mathtt{\mbox{-}f :}$	The name(path) of the input coordinates. The dimension should not exceed 10.
$\mathtt{\mbox{-}r :}$	The cluster radius $R$ (same unit as input) used for calculating the free energy and the nearest neighbor with lower energy. If this flag is not set the lumping radius $d_\text{lump}$ will be used, so $R=d_\text{lump}$. This will be ignored if used together with $\mathtt{\mbox{-}D}$.
$\mathtt{\mbox{-}R :}$	Same as $\mathtt{\mbox{-}r}$ but takes several radii and add those to the output names $\mathtt{\mbox{-}p}$, $\mathtt{\mbox{-}d}$. With this option $\mathtt{\mbox{-}B}$ and $\mathtt{\mbox{-}b}$ are ignored!
$\mathtt{\mbox{-}D :}$	Filename to read the free energies from. This should be used with $\mathtt{\mbox{-}B}$ and a nearest neighbor file generated with the same cluster radius, otherwise use $\mathtt{\mbox{-}b}$. If used, $\mathtt{\mbox{-}r}$/$\mathtt{\mbox{-}R}$ are ignored.
$\mathtt{\mbox{-}B :}$	Filename to read the nearest neighbors from. This should be used with $\mathtt{\mbox{-}D}$ and a nearest neighbor file generated with the same clustering radius, or together with $\mathtt{\mbox{-}r}$ and $\mathtt{\mbox{-}d}$. This option is ignored if $\mathtt{\mbox{-}R}$ is set.

Output Parameters

Parameter	Description
$\mathtt{\mbox{-}p}$	Filename for the output file of the populations for the given radius.
$\mathtt{\mbox{-}d}$	Filename for the output file of the free energies. The same as $\mathtt{\mbox{-}D}$, but generates the free energies for the given radius.
$\mathtt{\mbox{-}b}$	Filename for the output file of the nearest neighbors. The same as $\mathtt{\mbox{-}B}$, but the nearest neighbors get generated for the given radius.

Miscellaneous Parameters

Parameter	Description
$\mathtt{\mbox{-}n}$	The number of parallel threads to use (for SMP machines). This is ignored if CUDA is used.
$\mathtt{\mbox{-}v}$	Verbose mode with some output.

Detailed Description

The algorithm can be split into several steps. The order listed here represents the order in which the code is executed.

1.: $\mathtt{\mbox{-}f}$: Read in the position of each trajectory point in the multidimensional space from the coords file
2.a: $\mathtt{\mbox{-}D}$: If this flag is provided, the free energies are read from the given file and the following points (2.b) are skipped.
2.b.1: $\mathtt{\mbox{-}r}$/$\mathtt{\mbox{-}R}$: If one of this flag is used, it loops over all passed values by setting the value as clustering radius $R$ (same dimension as input coordinates, see -f) and performs 2.b.2-2.b.3. Otherwise, the clustering radius will be set as the lumping radius, so $R=d_\text{lump}$.
2.b.2: $\mathtt{\mbox{-}p}$: For each point $\vec{r}_i$ in the trajectory, count the number of points (neighbourhood population) $P_{R}(\vec{r}_i)$ within a hypersphere with radius $R$ around the point $\vec{r}_i$. This can be expressed as $$ P_{R}(\vec{r}_i)= \sum_{j\neq i}^{N} \theta (R- d(\vec{r}_j ,\vec{r}_i)) $$ where $N$ is the total number of trajectory points, $\theta$ is the heavyside step function, and $d(\vec{r}_j ,\vec{r}_i)$ is the distance between point $\vec{r}_j$ and the center of the hypersphere $\vec{r}_i$ (measured by the N-dimensional euclidean norm).
This generates the population output file, here stored with the given output name $\mathtt{\mbox{-}p}$. It is a single column file with the population = number of points within the hypersphere for each trajectory frame as values.
2.b.3: $\mathtt{\mbox{-}d}$: Calculate the free energy $G_R (\textbf{r})/\text{k}T$ for each point as $$ G_{R}(\textbf{r})= - \ln (P_R(\textbf{r})/P_{R}^{\text{max}}), $$ where $P_{R}^{\text{max}}$ is the highest maximum population of any point in the trajectory; this ensures that the frame with highest point density will be assigned the lowest free energy $F_{R}(\textbf{r})=0$.
This will be stored in the output name given by $\mathtt{\mbox{-}d}$ in a single column file with a free energy value for each frame of the trajectory.
3.a: $\mathtt{\mbox{-}B}$: If this flag is provided, the nearest neighbors and the nearest neighbors with lower free energies are read from the given file and the following point (3.b) is skipped. This option can not be combined with $\mathtt{\mbox{-}R}$.
3.b: $\mathtt{\mbox{-}b}$: For each frame of the trajectory, find the closest point (called nearest neighbor, abbreviated $\text{nn}$) and the squared distance to it $d^{2}_\text{nn}$, as well as the nearest neighbor that has a higher density/ lower free energy (abbreviated $\text{nn}_\text{hd}$) and the distance to it. It should be noted that the latter depends indirectly on the choice of the clustering Radius $R$, due to the dependence on the free energies. The distance is again the euclidian distance in N dimensions. This generates the output file $\mathtt{\mbox{-}b}$, a four column file with the information:

$\text{ID of }\text{nn} \quad\quad d^{2}_\text{nn} \quad\quad \text{ID of }\text{nn}_\text{hd} \quad\quad d^{2}_{\text{nn}_\text{hd}}$
The ID of a frame corresponds to its row in the trajectory file. This option can not be combined with $\mathtt{\mbox{-}R}$.

Parameter	Description
\(\mathtt{\mbox{-}f :}\)	The name(path) of the input coordinates. The dimension should not exceed 10.
\(\mathtt{\mbox{-}r :}\)	The cluster radius \(R\) (same unit as input) used for calculating the free energy and the nearest neighbor with lower energy. If this flag is not set the lumping radius \(d_\text{lump}\) will be used, so \(R=d_\text{lump}\). This will be ignored if used together with \(\mathtt{\mbox{-}D}\).
\(\mathtt{\mbox{-}R :}\)	Same as \(\mathtt{\mbox{-}r}\) but takes several radii and add those to the output names \(\mathtt{\mbox{-}p}\), \(\mathtt{\mbox{-}d}\). With this option \(\mathtt{\mbox{-}B}\) and \(\mathtt{\mbox{-}b}\) are ignored!
\(\mathtt{\mbox{-}D :}\)	Filename to read the free energies from. This should be used with \(\mathtt{\mbox{-}B}\) and a nearest neighbor file generated with the same cluster radius, otherwise use \(\mathtt{\mbox{-}b}\). If used, \(\mathtt{\mbox{-}r}\)/\(\mathtt{\mbox{-}R}\) are ignored.
\(\mathtt{\mbox{-}B :}\)	Filename to read the nearest neighbors from. This should be used with \(\mathtt{\mbox{-}D}\) and a nearest neighbor file generated with the same clustering radius, or together with \(\mathtt{\mbox{-}r}\) and \(\mathtt{\mbox{-}d}\). This option is ignored if \(\mathtt{\mbox{-}R}\) is set.

Parameter	Description
\(\mathtt{\mbox{-}p}\)	Filename for the output file of the populations for the given radius.
\(\mathtt{\mbox{-}d}\)	Filename for the output file of the free energies. The same as \(\mathtt{\mbox{-}D}\), but generates the free energies for the given radius.
\(\mathtt{\mbox{-}b}\)	Filename for the output file of the nearest neighbors. The same as \(\mathtt{\mbox{-}B}\), but the nearest neighbors get generated for the given radius.

Parameter	Description
\(\mathtt{\mbox{-}n}\)	The number of parallel threads to use (for SMP machines). This is ignored if CUDA is used.
\(\mathtt{\mbox{-}v}\)	Verbose mode with some output.