mosaic
MoSAIC is an advanced Python package specifically designed for the analysis of discrete time series data from Molecular Dynamics (MD) simulations. It offers a wide range of capabilities to identify collective coordinates that describe the same biomolecular process.
With MoSAIC, researchers and engineers can easily analyze large and complex datasets to gain insights into the underlying dynamics of biomolecular processes. The package provides the capability to calculate various similarity measures, such as linear correlation and mutual information, and to apply different clustering algorithms to find groups of coordinates which move in a concerted manner. By doing so, MoSAIC allows researchers and engineers to identify groups of coordinates that collectively describe the same process in MD simulations.
MoSAIC can be used as a stand-alone analysis tool or as a preprocessing step for feature selection for subsequent Markov state modeling. It is structured into the following submodules:
-
similarity This submodule introduces a versatile class that enables the calculation of similarity measures based on different correlation metrics. Users can choose from a set of popular metrics, such as absolute value of Pearson correlation, of different normalizations of mutual information. The result is always a similarity matrix, which scales from 0 to 1. This submodule also supports efficient memory management and flexible normalization options for mutual information-based measures, making it a valuable addition to any data analysis pipeline.
-
clustering This submodule is the most central component of MoSAIC that offers various techniques for analyzing similarity matrices. It provides different modes for clustering a correlation matrix, including the Leiden algorithm with different objective functions, linkage clustering, and k-medoids, and supports both weighted and unweighted, as well as full and sparse graphs. The resulting clusters and labels can be accessed through the attributes of the class.
-
gridsearch This submodule provides a class for performing grid search cross validation. It allows users to explore different combinations of parameter settings for a clustering model and provides evaluation metrics for each combination. The best combination of parameters and the corresponding model can be easily retrieved using the provided attributes.
-
utils This submodule provides utility functions that can be used to store and load the data or to provide runtume user information.
Clustering(*, mode='CPM', weighted=True, n_neighbors=None, resolution_parameter=None, n_clusters=None, seed=None)
¶
Bases: ClusterMixin
, BaseEstimator
Class for clustering a correlation matrix.
Parameters:
-
mode
(str
, default:'CPM'
) –the mode which determines the quality function optimized by the Leiden algorithm ('CPM', or 'modularity') or linkage clustering. - 'CPM': will use the constant Potts model on the full, weighted graph - 'modularity': will use modularity on a knn-graph - 'linkage': will use complete-linkage clustering - 'kmedoids': will use k-medoids clustering
-
weighted
(bool
, default:True
) –If True, the underlying graph has weighted edges. Otherwise, the graph is constructed using the adjacency matrix.
-
n_neighbors
(int
, default:None
) –This parameter specifies whether the whole matrix should be used, or a knn-graph, which reduces the required memory. The default depends on the
mode
- 'CPM':None
uses the full graph, and - 'modularity':None
uses square root of the number of features. -
resolution_parameter
(float
, default:None
) –Required for mode 'CPM' and 'linkage'. If None, the resolution parameter will be set to the third quartile of
X
forn_neighbors=None
and else to the mean value of the knn graph. -
n_clusters
(int
, default:None
) –Required for 'kmedoids'. The number of medoids which will constitute the later clusters.
-
seed
(int
, default:None
) –Use an integer to make the randomness of Leidenalg deterministic. By default uses a random seed if nothing is specified.
Attributes:
-
clusters_
(ndarray of shape (n_clusters, )
) –The result of the clustering process. A list of arrays, each containing all indices (features) corresponging to each cluster.
-
labels_
(ndarray of shape (n_features, )
) –Labels of each feature.
-
matrix_
(ndarray of shape (n_features, n_features)
) –Permuted matrix according to the determined clusters.
-
ticks_
(ndarray of shape (n_clusters, )
) –The cumulative number of features containing to the clusters. May be used as ticks for plotting
matrix_
. -
permutation_
(ndarray of shape (n_features, )
) –Permutation of the input features (corresponds to flattened
clusters_
). -
n_neighbors_
(int
) –Only avaiable when using knn graph. Indicates the number of nearest neighbors used for constructin the knn-graph.
-
resolution_param_
(float
) –Only for mode 'CPM' and 'linkage'. Indicates the resolution parameter used for the CPM based Leiden clustering.
-
linkage_matrix_
(ndarray of shape (n_clusters - 1, 4)
) –Only for mode 'linkage'. Contains the hierarchical clustering encoded as a linkage matrix, see scipy:spatial.distance.linkage.
Examples:
>>> import mosaic
>>> mat = np.array([[1.0, 0.1, 0.9], [0.1, 1.0, 0.1], [0.9, 0.1, 1.0]])
>>> clust = mosaic.Clustering()
>>> clust.fit(mat)
Clustering(resolution_parameter=0.7)
>>> clust.matrix_
array([[1. , 0.9, 0.1],
[0.9, 1. , 0.1],
[0.1, 0.1, 1. ]])
>>> clust.clusters_
array([list([2, 0]), list([1])], dtype=object)
Initialize Clustering class.
Source code in src/mosaic/clustering.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
|
fit(X, y=None)
¶
Clusters the correlation matrix by Leiden clustering on a graph.
Parameters:
-
X
(ndarray of shape (n_features, n_features)
) –Matrix containing the correlation metric which is clustered. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
self
(object
) –Fitted estimator.
Source code in src/mosaic/clustering.py
258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 |
|
fit_predict(X, y=None)
¶
Clusters the correlation matrix by Leiden clustering on a graph.
Parameters:
-
X
(ndarray of shape (n_features, n_features)
) –Matrix containing the correlation metric which is clustered. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
labels
(ndarray of shape (n_samples,)
) –Cluster labels.
Source code in src/mosaic/clustering.py
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 |
|
score(X, y=None, sample_weight=None)
¶
Estimate silhouette_score of new correlation matrix.
Parameters:
-
X
(ndarray of shape (n_features, n_features)
) –New matrix containing the correlation metric to score. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
-
sample_weight
(Optional[ndarray]
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
score
(float
) –Silhouette score of new correlation matrix based on fitted labels.
Source code in src/mosaic/clustering.py
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 |
|
GridSearchCV(*, similarity, clustering, param_grid, gridsearch_kwargs={})
¶
Bases: GridSearchCV
Class for grid search cross validation.
Parameters:
-
similarity
(Similarity
) –Similarity instance setup with constant parameters, see
mosaic.Similarity
for available parameters.low_memory
is not supported. -
clustering
(Clustering
) –Clustering instance setup with constant parameters, see
mosaic.Clustering
for available parameters. -
param_grid
(dict
) –Dictionary with parameters names (
str
) as keys and lists of parameter settings to try as values, or list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. -
gridsearch_kwargs
(dict
, default:{}
) –Dictionary with parameters to be used for
sklearn.model_selection.GridSearchCV
class. The parameterestimator
is not supported andparam_grid
needs to be passed directly to the class.
Attributes:
-
cv_results_
(dict of numpy (masked) ndarrays
) –A dict with keys as column headers and values as columns.
-
best_estimator_
(estimator
) –Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data.
-
best_score_
(float
) –Mean cross-validated score of the best_estimator.
-
best_params_
(dict
) –Parameter setting that gave the best results on the hold out data.
-
best_index_
(int
) –The index (of the
cv_results_
arrays) which corresponds to the best candidate parameter setting. -
n_splits_
(int
) –The number of cross-validation splits (folds/iterations).
Notes
Check out sklearn.model_selection.GridSearchCV for an overview of all available attributes and more detailed description.
Examples:
>>> import mosaic
>>> # create two correlated data sets
>>> traj = np.array([
... func(np.linspace(0, 20, 1000))
... for func in (
... np.sin,
... lambda x: np.sin(x + 0.1),
... np.cos,
... lambda x: np.cos(x + 0.1),
... )
... ]).T
>>> search = mosaic.GridSearchCV(
... similarity=mosaic.Similarity(),
... clustering=mosaic.Clustering(),
... param_grid={'resolution_parameter': [0.05, 0.2]},
... )
>>> search.fit(traj)
GridSearchCV(clustering=Clustering(),
param_grid={'clust__resolution_parameter': [0.05, 0.2]},
similarity=Similarity())
>>> search.best_params_
{'clust__resolution_parameter': 0.2}
>>> search.best_estimator_
Pipeline(steps=[('sim', Similarity()),
('clust', Clustering(resolution_parameter=0.2))])
Initialize GridSearchCV class.
Source code in src/mosaic/gridsearch.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
fit(X, y=None)
¶
Clusters the correlation matrix by Leiden clustering on a graph.
Parameters:
-
X
(ndarray of shape (n_samples, n_features)
) –Training vector, where
n_samples
is the number of samples andn_features
is the number of features. -
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
self
(object
) –Fitted estimator.
Source code in src/mosaic/gridsearch.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)
¶
Bases: BaseEstimator
Class for calculating the similarity measure.
Parameters:
-
metric
(str
, default:'correlation'
) –the correlation metric to use for the feature distance matrix. -
'correlation'
will use the absolute value of the Pearson correlation -'NMI'
will use the mutual information normalized by joined entropy -'GY'
uses Gel'fand and Yaglom normalization[^1] -'JSD'
will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note:'NMI'
is supported only with low_memory=False -
low_memory
(bool
, default:False
) –If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.
-
normalize_method
(str
, default:'geometric'
) –Only required for metric
'NMI'
. Determines the normalization factor for the mutual information: -'joint'
is the joint entropy -'max'
is the maximum of the individual entropies -'arithmetic'
is the mean of the individual entropies -'geometric'
is the square root of the product of the individual entropies -'min'
is the minimum of the individual entropies -
use_knn_estimator
(bool
, default:False
) –Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.
Attributes:
-
matrix_
(ndarray of shape (n_features, n_features)
) –The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].
Examples:
>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1. , 0.9697832],
[0.9697832, 1. ]])
Notes
The Pearson correlation coefficient is defined as
For the online (low memory) option the Welford algorithm2 is used.
The Jensen-Shannon divergence is defined as
where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.
-
Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. ↩
-
Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley. ↩
-
B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" ↩
Initialize Similarity class.
Source code in src/mosaic/similarity.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
|
fit(X, y=None)
¶
Compute the correlation/nmi distance matrix.
Parameters:
-
X
(ndarray of shape (n_samples, n_features) or str if low_memory=True
) –Training data.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
self
(object
) –Fitted estimator.
Source code in src/mosaic/similarity.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
|
fit_transform(X, y=None)
¶
Compute the correlation/nmi distance matrix and returns it.
Parameters:
-
X
(ndarray of shape (n_samples, n_features) or str if low_memory=True
) –Training data.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
Similarity
(ndarray of shape (n_features, n_features)
) –Similarity matrix.
Source code in src/mosaic/similarity.py
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
|
transform(X)
¶
Compute the correlation/nmi distance matrix and returns it.
Parameters:
-
X
(ndarray of shape (n_samples, n_features) or str if low_memory=True
) –Training data.
Returns:
-
Similarity
(ndarray of shape (n_features, n_features)
) –Similarity matrix.
Source code in src/mosaic/similarity.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
|