mosaic

MoSAIC is an advanced Python package specifically designed for the analysis of discrete time series data from Molecular Dynamics (MD) simulations. It offers a wide range of capabilities to identify collective coordinates that describe the same biomolecular process.

With MoSAIC, researchers and engineers can easily analyze large and complex datasets to gain insights into the underlying dynamics of biomolecular processes. The package provides the capability to calculate various similarity measures, such as linear correlation and mutual information, and to apply different clustering algorithms to find groups of coordinates which move in a concerted manner. By doing so, MoSAIC allows researchers and engineers to identify groups of coordinates that collectively describe the same process in MD simulations.

MoSAIC can be used as a stand-alone analysis tool or as a preprocessing step for feature selection for subsequent Markov state modeling. It is structured into the following submodules:

similarity This submodule introduces a versatile class that enables the calculation of similarity measures based on different correlation metrics. Users can choose from a set of popular metrics, such as absolute value of Pearson correlation, of different normalizations of mutual information. The result is always a similarity matrix, which scales from 0 to 1. This submodule also supports efficient memory management and flexible normalization options for mutual information-based measures, making it a valuable addition to any data analysis pipeline.
clustering This submodule is the most central component of MoSAIC that offers various techniques for analyzing similarity matrices. It provides different modes for clustering a correlation matrix, including the Leiden algorithm with different objective functions and linkage clustering, and supports both weighted and unweighted, as well as full and sparse graphs. The resulting clusters and labels can be accessed through the attributes of the class.
gridsearch This submodule provides a class for performing grid search cross validation. It allows users to explore different combinations of parameter settings for a clustering model and provides evaluation metrics for each combination. The best combination of parameters and the corresponding model can be easily retrieved using the provided attributes.
utils This submodule provides utility functions that can be used to store and load the data or to provide runtume user information.

`Clustering(*, mode='CPM', weighted=True, n_neighbors=None, resolution_parameter=None, seed=None)` ¶

Bases: ClusterMixin, BaseEstimator

Class for clustering a correlation matrix.

Parameters:

mode (str, default: 'CPM' ) –

the mode which determines the quality function optimized by the Leiden algorithm ('CPM', or 'modularity') or linkage clustering. - 'CPM': will use the constant Potts model on the full, weighted graph - 'modularity': will use modularity on a knn-graph - 'linkage': will use complete-linkage clustering
weighted (bool, default: True ) –

If True, the underlying graph has weighted edges. Otherwise, the graph is constructed using the adjacency matrix.
n_neighbors (int, default: None ) –

This parameter specifies whether the whole matrix should be used, or a knn-graph, which reduces the required memory. The default depends on the mode - 'CPM': None uses the full graph, and - 'modularity': None uses square root of the number of features.
resolution_parameter (float, default: None ) –

Required for mode 'CPM' and 'linkage'. If None, the resolution parameter will be set to the third quartile of X for n_neighbors=None and else to the mean value of the knn graph.
seed (int, default: None ) –

Use an integer to make the randomness of Leidenalg deterministic. By default uses a random seed if nothing is specified.

Attributes:

clusters_ (ndarray of shape (n_clusters, )) –

The result of the clustering process. A list of arrays, each containing all indices (features) corresponging to each cluster.
labels_ (ndarray of shape (n_features, )) –

Labels of each feature.
matrix_ (ndarray of shape (n_features, n_features)) –

Permuted matrix according to the determined clusters.
ticks_ (ndarray of shape (n_clusters, )) –

The cumulative number of features containing to the clusters. May be used as ticks for plotting matrix_.
permutation_ (ndarray of shape (n_features, )) –

Permutation of the input features (corresponds to flattened clusters_).
n_neighbors_ (int) –

Only avaiable when using knn graph. Indicates the number of nearest neighbors used for constructin the knn-graph.
resolution_param_ (float) –

Only for mode 'CPM' and 'linkage'. Indicates the resolution parameter used for the CPM based Leiden clustering.
linkage_matrix_ (ndarray of shape (n_clusters - 1, 4)) –

Only for mode 'linkage'. Contains the hierarchical clustering encoded as a linkage matrix, see scipy:spatial.distance.linkage.

Examples:

>>> import mosaic
>>> mat = np.array([[1.0, 0.1, 0.9], [0.1, 1.0, 0.1], [0.9, 0.1, 1.0]])
>>> clust = mosaic.Clustering()
>>> clust.fit(mat)
Clustering(resolution_parameter=0.7)
>>> clust.matrix_
array([[1. , 0.9, 0.1],
       [0.9, 1. , 0.1],
       [0.1, 0.1, 1. ]])
>>> clust.clusters_
array([list([2, 0]), list([1])], dtype=object)

Initialize Clustering class.

Source code in src/mosaic/clustering.py

@beartype
def __init__(
    self,
    *,
    mode: ClusteringModeString = 'CPM',
    weighted: bool = True,
    n_neighbors: Optional[PositiveInt] = None,
    resolution_parameter: Optional[NumInRange0to1] = None,
    seed: Optional[int] = None,
) -> None:
    """Initialize Clustering class."""
    self.mode: ClusteringModeString = mode
    self.n_neighbors: Optional[PositiveInt] = n_neighbors
    self.resolution_parameter: Optional[NumInRange0to1] = (
        resolution_parameter
    )
    self.seed: Optional[int] = seed
    self.weighted: bool = weighted

    if mode == 'linkage' and self.n_neighbors is not None:
        raise NotImplementedError(
            f"mode='{mode}' does not support knn-graphs.",
        )

    if mode in {'CPM', 'linkage'}:
        if not weighted:
            raise NotImplementedError(
                f"mode='{mode}' does not support weighted=False",
            )
    elif resolution_parameter is not None:
        raise NotImplementedError(
            f"mode='{mode}' does not support the usage of the "
            'resolution_parameter',
        )

`fit(X, y=None)` ¶

Clusters the correlation matrix by Leiden clustering on a graph.

Parameters:

X (ndarray of shape (n_features, n_features)) –

Matrix containing the correlation metric which is clustered. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

self ( object ) –

Fitted estimator.

Source code in src/mosaic/clustering.py

@beartype
def fit(self, X: SimilarityMatrix, y: Optional[np.ndarray] = None):
    """Clusters the correlation matrix by Leiden clustering on a graph.

    Parameters
    ----------
    X : ndarray of shape (n_features, n_features)
        Matrix containing the correlation metric which is clustered. The
        values should go from [0, 1] where 1 means completely correlated
        and 0 no correlation.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    self._reset()

    # prepare matric for graph construction
    mat: FloatMatrix
    if self.mode == 'linkage':
        mat = np.copy(X)
    elif self.mode == 'CPM' and self.n_neighbors is None:
        mat = np.copy(X)
    else:
        mat = self._construct_knn_mat(X)

    if self.mode in {'CPM', 'linkage'}:
        # mask diagonal and zero elements
        mat[mat == 0] = np.nan
        mat[np.diag_indices_from(mat)] = np.nan

        if self.resolution_parameter is None:
            if self.n_neighbors is None:
                third_quartile = 0.75
                self.resolution_parameter = np.nanquantile(
                    mat, third_quartile,
                )
            else:
                self.resolution_parameter = np.nanmean(mat)

        self.resolution_param_: NumInRange0to1 = (
            self.resolution_parameter
        )

    # create graph
    mat[np.isnan(mat)] = 0

    clusters: Object1DArray
    if self.mode == 'linkage':
        clusters = self._clustering_linkage(mat)
    else:  # _mode in {'CPM', 'modularity'}
        graph: ig.Graph = ig.Graph.Weighted_Adjacency(
            list(mat.astype(np.float64)), loops=False,
        )
        clusters = self._clustering_leiden(graph)

    self.clusters_: Object1DArray = _sort_clusters(clusters, X)
    self.permutation_: Index1DArray = np.hstack(self.clusters_)
    self.matrix_: Float2DArray = np.copy(X)[
        np.ix_(self.permutation_, self.permutation_)
    ]
    self.ticks_: Index1DArray = np.cumsum(
        [len(cluster) for cluster in self.clusters_],
    )
    labels: Index1DArray = np.empty_like(self.permutation_)
    for idx, cluster in enumerate(self.clusters_):
        labels[cluster] = idx
    self.labels_: Index1DArray = labels

    return self

`fit_predict(X, y=None)` ¶

Clusters the correlation matrix by Leiden clustering on a graph.

Parameters:

X (ndarray of shape (n_features, n_features)) –

Matrix containing the correlation metric which is clustered. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

labels ( ndarray of shape (n_samples,) ) –

Cluster labels.

Source code in src/mosaic/clustering.py

@beartype
def fit_predict(
    self, X: SimilarityMatrix, y: Optional[np.ndarray] = None,
) -> Index1DArray:
    """Clusters the correlation matrix by Leiden clustering on a graph.

    Parameters
    ----------
    X : ndarray of shape (n_features, n_features)
        Matrix containing the correlation metric which is clustered. The
        values should go from [0, 1] where 1 means completely correlated
        and 0 no correlation.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    labels : ndarray of shape (n_samples,)
        Cluster labels.

    """
    return super().fit_predict(X, y)

`score(X, y=None, sample_weight=None)` ¶

Estimate silhouette_score of new correlation matrix.

Parameters:

X (ndarray of shape (n_features, n_features)) –

New matrix containing the correlation metric to score. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.
sample_weight (Optional[ndarray], default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

score ( float ) –

Silhouette score of new correlation matrix based on fitted labels.

Source code in src/mosaic/clustering.py

@beartype
def score(
    self,
    X: SimilarityMatrix,
    y: Optional[np.ndarray] = None,
    sample_weight: Optional[np.ndarray] = None,
) -> Float:
    """Estimate silhouette_score of new correlation matrix.

    Parameters
    ----------
    X : ndarray of shape (n_features, n_features)
        New matrix containing the correlation metric to score. The
        values should go from [0, 1] where 1 means completely correlated
        and 0 no correlation.
    y : Ignored
        Not used, present for scikit API consistency by convention.
    sample_weight: Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    score : float
        Silhouette score of new correlation matrix based on fitted labels.

    """
    check_is_fitted(self, attributes=['labels_', 'matrix_'])

    n_labels = len(self.labels_)
    n_unique_labels = len(np.unique(self.labels_))

    if n_labels != len(X):
        raise ValueError(
            f'Dimension of X d={len(X):.0f} needs to agree with the '
            f'dimension of the fitted data d={n_labels:.0f}.',
        )

    if n_unique_labels in {1, n_labels}:
        return -1.0
    return silhouette_score(X, labels=self.labels_)

`GridSearchCV(*, similarity, clustering, param_grid, gridsearch_kwargs={})` ¶

Bases: GridSearchCV

Class for grid search cross validation.

Parameters:

similarity (Similarity) –

Similarity instance setup with constant parameters, see mosaic.Similarity for available parameters. low_memory is not supported.
clustering (Clustering) –

Clustering instance setup with constant parameters, see mosaic.Clustering for available parameters.
param_grid (dict) –

Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored.
gridsearch_kwargs (dict, default: {} ) –

Dictionary with parameters to be used for sklearn.model_selection.GridSearchCV class. The parameter estimator is not supported and param_grid needs to be passed directly to the class.

Attributes:

cv_results_ (dict of numpy (masked) ndarrays) –

A dict with keys as column headers and values as columns.
best_estimator_ (estimator) –

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data.
best_score_ (float) –

Mean cross-validated score of the best_estimator.
best_params_ (dict) –

Parameter setting that gave the best results on the hold out data.
best_index_ (int) –

The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.
n_splits_ (int) –

The number of cross-validation splits (folds/iterations).

Notes

Check out sklearn.model_selection.GridSearchCV for an overview of all available attributes and more detailed description.

Examples:

>>> import mosaic
>>> # create two correlated data sets
>>> traj = np.array([
...     func(np.linspace(0, 20, 1000))
...     for  func in (
...         np.sin,
...         lambda x: np.sin(x + 0.1),
...         np.cos,
...         lambda x: np.cos(x + 0.1),
...     )
... ]).T
>>> search = mosaic.GridSearchCV(
...     similarity=mosaic.Similarity(),
...     clustering=mosaic.Clustering(),
...     param_grid={'resolution_parameter': [0.05, 0.2]},
... )
>>> search.fit(traj)
GridSearchCV(clustering=Clustering(),
             param_grid={'clust__resolution_parameter': [0.05, 0.2]},
             similarity=Similarity())
>>> search.best_params_
{'clust__resolution_parameter': 0.2}
>>> search.best_estimator_
Pipeline(steps=[('sim', Similarity()),
                ('clust', Clustering(resolution_parameter=0.2))])

Initialize GridSearchCV class.

Source code in src/mosaic/gridsearch.py

@beartype
def __init__(
    self,
    *,
    similarity: Similarity,
    clustering: Clustering,
    param_grid: Dict,
    gridsearch_kwargs: Dict = {},
) -> None:
    """Initialize GridSearchCV class."""
    self.similarity: Similarity = similarity
    self.clustering: Clustering = clustering
    self.gridsearch_kwargs: Dict = gridsearch_kwargs

    if 'estimator' in self.gridsearch_kwargs:
        raise NotImplementedError(
            'Custom estimators are not supported. Please use the '
            'sklearn class GirdSearchCV directly.',
        )

    if 'param_grid' in self.gridsearch_kwargs:
        raise NotImplementedError(
            "Please pass 'param_grid' directly to the the class.",
        )

    if similarity.get_params()['low_memory']:
        raise NotImplementedError(
            "'low_memory' is currently not implemented.",
        )

    if not param_grid:
        raise ValueError(
            'At least a single parameter needs to be provided',
        )

    self.pipeline = Pipeline([
        (self._sim_prefix, self.similarity),
        (self._clust_prefix, self.clustering),
    ])

    self.param_grid: Dict = {}
    for param, values in param_grid.items():
        if param in similarity.get_params():
            self.param_grid[
                f'{self._sim_prefix}__{param}'
            ] = values
        elif param in clustering.get_params():
            self.param_grid[
                f'{self._clust_prefix}__{param}'
            ] = values
            print('######################', self.param_grid)
        else:
            raise ValueError(
                f"param_grid key '{param}' is not available."
            )

    super().__init__(
        estimator=self.pipeline,
        param_grid=self.param_grid,
        **self.gridsearch_kwargs,
    )

`fit(X, y=None)` ¶

Clusters the correlation matrix by Leiden clustering on a graph.

Parameters:

X (ndarray of shape (n_samples, n_features)) –

Training vector, where n_samples is the number of samples and n_features is the number of features.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

self ( object ) –

Fitted estimator.

Source code in src/mosaic/gridsearch.py

@beartype
def fit(
    self,
    X: FloatMax2DArray,
    y: Optional[np.ndarray] = None,
):
    """Clusters the correlation matrix by Leiden clustering on a graph.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Training vector, where `n_samples` is the number of samples and
        `n_features` is the number of features.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    return super().fit(X)

`Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)` ¶

Bases: BaseEstimator

Class for calculating the similarity measure.

Parameters:

metric (str, default: 'correlation' ) –

the correlation metric to use for the feature distance matrix. - 'correlation' will use the absolute value of the Pearson correlation - 'NMI' will use the mutual information normalized by joined entropy - 'GY' uses Gel'fand and Yaglom normalization[^1] - 'JSD' will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note: 'NMI' is supported only with low_memory=False
low_memory (bool, default: False ) –

If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.
normalize_method (str, default: 'geometric' ) –

Only required for metric 'NMI'. Determines the normalization factor for the mutual information: - 'joint' is the joint entropy - 'max' is the maximum of the individual entropies - 'arithmetic' is the mean of the individual entropies - 'geometric' is the square root of the product of the individual entropies - 'min' is the minimum of the individual entropies
use_knn_estimator (bool, default: False ) –

Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.

Attributes:

matrix_ (ndarray of shape (n_features, n_features)) –

The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].

Examples:

>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1.       , 0.9697832],
       [0.9697832, 1.       ]])

Notes

The Pearson correlation coefficient is defined as

\[\rho_{X,Y} = \frac{\langle(X -\mu_X)(Y -\mu_Y)\rangle}{\sigma_X\sigma_Y}.\]

For the online (low memory) option the Welford algorithm² is used.

The Jensen-Shannon divergence is defined as

\[D_{\text{JS}} = \frac{1}{2} D_{\text{KL}}(p(x,y)||M) + \frac{1}{2} D_{\text{KL}}(p(x)p(y)||M)\;,\]

where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.

Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. ↩
Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3^rd edn., p. 232. Boston: Addison-Wesley. ↩
B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" ↩

Initialize Similarity class.

Source code in src/mosaic/similarity.py

@beartype
def __init__(
    self,
    *,
    metric: MetricString = 'correlation',
    low_memory: bool = False,
    normalize_method: Optional[NormString] = None,
    use_knn_estimator: bool = False,
):
    """Initialize Similarity class."""
    self.metric: MetricString = metric
    self.low_memory: bool = low_memory
    self.use_knn_estimator: bool = use_knn_estimator
    if self.metric == 'NMI':
        if normalize_method is None:
            normalize_method = self._default_normalize_method
    elif normalize_method is not None:
        raise NotImplementedError(
            'Normalize methods are only supported with metric="NMI"',
        )
    self.normalize_method: NormString = normalize_method
    if self.metric != 'GY' and self.use_knn_estimator:
        raise NotImplementedError(
            (
                'The mutual information estimate based on k-nearest'
                'neighbors distances is only supported with metric="GY"'
            ),
        )

`fit(X, y=None)` ¶

Compute the correlation/nmi distance matrix.

Parameters:

X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

Training data.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

self ( object ) –

Fitted estimator.

Source code in src/mosaic/similarity.py

@singledispatchmethod
@beartype
def fit(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
):
    """Compute the correlation/nmi distance matrix.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    raise NotImplementedError('Fatal error, this should never be reached.')

`fit_transform(X, y=None)` ¶

Compute the correlation/nmi distance matrix and returns it.

Parameters:

X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

Training data.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

Similarity ( ndarray of shape (n_features, n_features) ) –

Similarity matrix.

Source code in src/mosaic/similarity.py

@beartype
def fit_transform(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    self.fit(X)
    return self.matrix_

`transform(X)` ¶

Compute the correlation/nmi distance matrix and returns it.

Parameters:

X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

Training data.

Returns:

Similarity ( ndarray of shape (n_features, n_features) ) –

Similarity matrix.

Source code in src/mosaic/similarity.py

@beartype
def transform(
    self,
    X: Union[FloatMax2DArray, str],
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    return self.fit_transform(X)

mosaic

Clustering(*, mode='CPM', weighted=True, n_neighbors=None, resolution_parameter=None, seed=None) ¶

fit(X, y=None) ¶

fit_predict(X, y=None) ¶

score(X, y=None, sample_weight=None) ¶

GridSearchCV(*, similarity, clustering, param_grid, gridsearch_kwargs={}) ¶

fit(X, y=None) ¶

Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False) ¶

fit(X, y=None) ¶

fit_transform(X, y=None) ¶

transform(X) ¶

`Clustering(*, mode='CPM', weighted=True, n_neighbors=None, resolution_parameter=None, seed=None)` ¶

`fit(X, y=None)` ¶

`fit_predict(X, y=None)` ¶

`score(X, y=None, sample_weight=None)` ¶

`GridSearchCV(*, similarity, clustering, param_grid, gridsearch_kwargs={})` ¶

`fit(X, y=None)` ¶

`Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)` ¶

`fit(X, y=None)` ¶

`fit_transform(X, y=None)` ¶

`transform(X)` ¶