Skip to content

mosaic

MoSAIC is an advanced Python package specifically designed for the analysis of discrete time series data from Molecular Dynamics (MD) simulations. It offers a wide range of capabilities to identify collective coordinates that describe the same biomolecular process.

With MoSAIC, researchers and engineers can easily analyze large and complex datasets to gain insights into the underlying dynamics of biomolecular processes. The package provides the capability to calculate various similarity measures, such as linear correlation and mutual information, and to apply different clustering algorithms to find groups of coordinates which move in a concerted manner. By doing so, MoSAIC allows researchers and engineers to identify groups of coordinates that collectively describe the same process in MD simulations.

MoSAIC can be used as a stand-alone analysis tool or as a preprocessing step for feature selection for subsequent Markov state modeling. It is structured into the following submodules:

  • similarity This submodule introduces a versatile class that enables the calculation of similarity measures based on different correlation metrics. Users can choose from a set of popular metrics, such as absolute value of Pearson correlation, of different normalizations of mutual information. The result is always a similarity matrix, which scales from 0 to 1. This submodule also supports efficient memory management and flexible normalization options for mutual information-based measures, making it a valuable addition to any data analysis pipeline.

  • clustering This submodule is the most central component of MoSAIC that offers various techniques for analyzing similarity matrices. It provides different modes for clustering a correlation matrix, including the Leiden algorithm with different objective functions, linkage clustering, and k-medoids, and supports both weighted and unweighted, as well as full and sparse graphs. The resulting clusters and labels can be accessed through the attributes of the class.

  • gridsearch This submodule provides a class for performing grid search cross validation. It allows users to explore different combinations of parameter settings for a clustering model and provides evaluation metrics for each combination. The best combination of parameters and the corresponding model can be easily retrieved using the provided attributes.

  • utils This submodule provides utility functions that can be used to store and load the data or to provide runtume user information.

Clustering(*, mode='CPM', weighted=True, n_neighbors=None, resolution_parameter=None, n_clusters=None, seed=None)

Bases: ClusterMixin, BaseEstimator

Class for clustering a correlation matrix.

Parameters:

  • mode (str, default: 'CPM' ) –

    the mode which determines the quality function optimized by the Leiden algorithm ('CPM', or 'modularity') or linkage clustering. - 'CPM': will use the constant Potts model on the full, weighted graph - 'modularity': will use modularity on a knn-graph - 'linkage': will use complete-linkage clustering - 'kmedoids': will use k-medoids clustering

  • weighted (bool, default: True ) –

    If True, the underlying graph has weighted edges. Otherwise, the graph is constructed using the adjacency matrix.

  • n_neighbors (int, default: None ) –

    This parameter specifies whether the whole matrix should be used, or a knn-graph, which reduces the required memory. The default depends on the mode - 'CPM': None uses the full graph, and - 'modularity': None uses square root of the number of features.

  • resolution_parameter (float, default: None ) –

    Required for mode 'CPM' and 'linkage'. If None, the resolution parameter will be set to the third quartile of X for n_neighbors=None and else to the mean value of the knn graph.

  • n_clusters (int, default: None ) –

    Required for 'kmedoids'. The number of medoids which will constitute the later clusters.

  • seed (int, default: None ) –

    Use an integer to make the randomness of Leidenalg deterministic. By default uses a random seed if nothing is specified.

Attributes:

  • clusters_ (ndarray of shape (n_clusters, )) –

    The result of the clustering process. A list of arrays, each containing all indices (features) corresponging to each cluster.

  • labels_ (ndarray of shape (n_features, )) –

    Labels of each feature.

  • matrix_ (ndarray of shape (n_features, n_features)) –

    Permuted matrix according to the determined clusters.

  • ticks_ (ndarray of shape (n_clusters, )) –

    The cumulative number of features containing to the clusters. May be used as ticks for plotting matrix_.

  • permutation_ (ndarray of shape (n_features, )) –

    Permutation of the input features (corresponds to flattened clusters_).

  • n_neighbors_ (int) –

    Only avaiable when using knn graph. Indicates the number of nearest neighbors used for constructin the knn-graph.

  • resolution_param_ (float) –

    Only for mode 'CPM' and 'linkage'. Indicates the resolution parameter used for the CPM based Leiden clustering.

  • linkage_matrix_ (ndarray of shape (n_clusters - 1, 4)) –

    Only for mode 'linkage'. Contains the hierarchical clustering encoded as a linkage matrix, see scipy:spatial.distance.linkage.

Examples:

>>> import mosaic
>>> mat = np.array([[1.0, 0.1, 0.9], [0.1, 1.0, 0.1], [0.9, 0.1, 1.0]])
>>> clust = mosaic.Clustering()
>>> clust.fit(mat)
Clustering(resolution_parameter=0.7)
>>> clust.matrix_
array([[1. , 0.9, 0.1],
       [0.9, 1. , 0.1],
       [0.1, 0.1, 1. ]])
>>> clust.clusters_
array([list([2, 0]), list([1])], dtype=object)

Initialize Clustering class.

Source code in src/mosaic/clustering.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
@beartype
def __init__(
    self,
    *,
    mode: ClusteringModeString = 'CPM',
    weighted: bool = True,
    n_neighbors: Optional[PositiveInt] = None,
    resolution_parameter: Optional[NumInRange0to1] = None,
    n_clusters: Optional[PositiveInt] = None,
    seed: Optional[int] = None,
) -> None:
    """Initialize Clustering class."""
    self.mode: ClusteringModeString = mode
    self.n_clusters: Optional[PositiveInt] = n_clusters
    self.n_neighbors: Optional[PositiveInt] = n_neighbors
    self.resolution_parameter: Optional[NumInRange0to1] = (
        resolution_parameter
    )
    self.seed: Optional[int] = seed
    self.weighted: bool = weighted

    if mode in {'linkage', 'kmedoids'} and self.n_neighbors is not None:
        raise NotImplementedError(
            f"mode='{mode}' does not support knn-graphs.",
        )

    if mode == 'kmedoids':
        warnings.warn(
            "The 'kmedoids' mode is deprecated and will be removed in a "
            "future release.",
            DeprecationWarning,
            stacklevel=2
        )
        if self.n_clusters is None:
            raise TypeError(
                f"mode='{mode}' needs parameter 'n_clusters'",
            )
    elif mode != 'kmedoids' and self.n_clusters is not None:
        raise NotImplementedError(
            f"mode='{mode}' does not support the usage of 'n_clusters'",
        )

    if mode in {'CPM', 'linkage'}:
        if not weighted:
            raise NotImplementedError(
                f"mode='{mode}' does not support weighted=False",
            )
    elif resolution_parameter is not None:
        raise NotImplementedError(
            f"mode='{mode}' does not support the usage of the "
            'resolution_parameter',
        )

fit(X, y=None)

Clusters the correlation matrix by Leiden clustering on a graph.

Parameters:

  • X (ndarray of shape (n_features, n_features)) –

    Matrix containing the correlation metric which is clustered. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • self ( object ) –

    Fitted estimator.

Source code in src/mosaic/clustering.py
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
@beartype
def fit(self, X: SimilarityMatrix, y: Optional[np.ndarray] = None):
    """Clusters the correlation matrix by Leiden clustering on a graph.

    Parameters
    ----------
    X : ndarray of shape (n_features, n_features)
        Matrix containing the correlation metric which is clustered. The
        values should go from [0, 1] where 1 means completely correlated
        and 0 no correlation.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    self._reset()

    # prepare matric for graph construction
    mat: FloatMatrix
    if self.mode in {'linkage', 'kmedoids'}:
        mat = np.copy(X)
    elif self.mode == 'CPM' and self.n_neighbors is None:
        mat = np.copy(X)
    else:
        mat = self._construct_knn_mat(X)

    if self.mode in {'CPM', 'linkage'}:
        # mask diagonal and zero elements
        mat[mat == 0] = np.nan
        mat[np.diag_indices_from(mat)] = np.nan

        if self.resolution_parameter is None:
            if self.n_neighbors is None:
                third_quartile = 0.75
                self.resolution_parameter = np.nanquantile(
                    mat, third_quartile,
                )
            else:
                self.resolution_parameter = np.nanmean(mat)

        self.resolution_param_: NumInRange0to1 = (
            self.resolution_parameter
        )

    # create graph
    mat[np.isnan(mat)] = 0

    clusters: Object1DArray
    if self.mode == 'linkage':
        clusters = self._clustering_linkage(mat)
    elif self.mode == 'kmedoids':
        clusters = self._clustering_kmedoids(mat)
    else:  # _mode in {'CPM', 'modularity'}
        graph: ig.Graph = ig.Graph.Weighted_Adjacency(
            list(mat.astype(np.float64)), loops=False,
        )
        clusters = self._clustering_leiden(graph)

    self.clusters_: Object1DArray = _sort_clusters(clusters, X)
    self.permutation_: Index1DArray = np.hstack(self.clusters_)
    self.matrix_: Float2DArray = np.copy(X)[
        np.ix_(self.permutation_, self.permutation_)
    ]
    self.ticks_: Index1DArray = np.cumsum(
        [len(cluster) for cluster in self.clusters_],
    )
    labels: Index1DArray = np.empty_like(self.permutation_)
    for idx, cluster in enumerate(self.clusters_):
        labels[cluster] = idx
    self.labels_: Index1DArray = labels

    return self

fit_predict(X, y=None)

Clusters the correlation matrix by Leiden clustering on a graph.

Parameters:

  • X (ndarray of shape (n_features, n_features)) –

    Matrix containing the correlation metric which is clustered. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • labels ( ndarray of shape (n_samples,) ) –

    Cluster labels.

Source code in src/mosaic/clustering.py
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
@beartype
def fit_predict(
    self, X: SimilarityMatrix, y: Optional[np.ndarray] = None,
) -> Index1DArray:
    """Clusters the correlation matrix by Leiden clustering on a graph.

    Parameters
    ----------
    X : ndarray of shape (n_features, n_features)
        Matrix containing the correlation metric which is clustered. The
        values should go from [0, 1] where 1 means completely correlated
        and 0 no correlation.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    labels : ndarray of shape (n_samples,)
        Cluster labels.

    """
    return super().fit_predict(X, y)

score(X, y=None, sample_weight=None)

Estimate silhouette_score of new correlation matrix.

Parameters:

  • X (ndarray of shape (n_features, n_features)) –

    New matrix containing the correlation metric to score. The values should go from [0, 1] where 1 means completely correlated and 0 no correlation.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

  • sample_weight (Optional[ndarray], default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • score ( float ) –

    Silhouette score of new correlation matrix based on fitted labels.

Source code in src/mosaic/clustering.py
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
@beartype
def score(
    self,
    X: SimilarityMatrix,
    y: Optional[np.ndarray] = None,
    sample_weight: Optional[np.ndarray] = None,
) -> Float:
    """Estimate silhouette_score of new correlation matrix.

    Parameters
    ----------
    X : ndarray of shape (n_features, n_features)
        New matrix containing the correlation metric to score. The
        values should go from [0, 1] where 1 means completely correlated
        and 0 no correlation.
    y : Ignored
        Not used, present for scikit API consistency by convention.
    sample_weight: Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    score : float
        Silhouette score of new correlation matrix based on fitted labels.

    """
    check_is_fitted(self, attributes=['labels_', 'matrix_'])

    n_labels = len(self.labels_)
    n_unique_labels = len(np.unique(self.labels_))

    if n_labels != len(X):
        raise ValueError(
            f'Dimension of X d={len(X):.0f} needs to agree with the '
            f'dimension of the fitted data d={n_labels:.0f}.',
        )

    if n_unique_labels in {1, n_labels}:
        return -1.0
    return silhouette_score(X, labels=self.labels_)

GridSearchCV(*, similarity, clustering, param_grid, gridsearch_kwargs={})

Bases: GridSearchCV

Class for grid search cross validation.

Parameters:

  • similarity (Similarity) –

    Similarity instance setup with constant parameters, see mosaic.Similarity for available parameters. low_memory is not supported.

  • clustering (Clustering) –

    Clustering instance setup with constant parameters, see mosaic.Clustering for available parameters.

  • param_grid (dict) –

    Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored.

  • gridsearch_kwargs (dict, default: {} ) –

    Dictionary with parameters to be used for sklearn.model_selection.GridSearchCV class. The parameter estimator is not supported and param_grid needs to be passed directly to the class.

Attributes:

  • cv_results_ (dict of numpy (masked) ndarrays) –

    A dict with keys as column headers and values as columns.

  • best_estimator_ (estimator) –

    Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data.

  • best_score_ (float) –

    Mean cross-validated score of the best_estimator.

  • best_params_ (dict) –

    Parameter setting that gave the best results on the hold out data.

  • best_index_ (int) –

    The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

  • n_splits_ (int) –

    The number of cross-validation splits (folds/iterations).

Notes

Check out sklearn.model_selection.GridSearchCV for an overview of all available attributes and more detailed description.

Examples:

>>> import mosaic
>>> # create two correlated data sets
>>> traj = np.array([
...     func(np.linspace(0, 20, 1000))
...     for  func in (
...         np.sin,
...         lambda x: np.sin(x + 0.1),
...         np.cos,
...         lambda x: np.cos(x + 0.1),
...     )
... ]).T
>>> search = mosaic.GridSearchCV(
...     similarity=mosaic.Similarity(),
...     clustering=mosaic.Clustering(),
...     param_grid={'resolution_parameter': [0.05, 0.2]},
... )
>>> search.fit(traj)
GridSearchCV(clustering=Clustering(),
             param_grid={'clust__resolution_parameter': [0.05, 0.2]},
             similarity=Similarity())
>>> search.best_params_
{'clust__resolution_parameter': 0.2}
>>> search.best_estimator_
Pipeline(steps=[('sim', Similarity()),
                ('clust', Clustering(resolution_parameter=0.2))])

Initialize GridSearchCV class.

Source code in src/mosaic/gridsearch.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
@beartype
def __init__(
    self,
    *,
    similarity: Similarity,
    clustering: Clustering,
    param_grid: Dict,
    gridsearch_kwargs: Dict = {},
) -> None:
    """Initialize GridSearchCV class."""
    self.similarity: Similarity = similarity
    self.clustering: Clustering = clustering
    self.gridsearch_kwargs: Dict = gridsearch_kwargs

    if 'estimator' in self.gridsearch_kwargs:
        raise NotImplementedError(
            'Custom estimators are not supported. Please use the '
            'sklearn class GirdSearchCV directly.',
        )

    if 'param_grid' in self.gridsearch_kwargs:
        raise NotImplementedError(
            "Please pass 'param_grid' directly to the the class.",
        )

    if similarity.get_params()['low_memory']:
        raise NotImplementedError(
            "'low_memory' is currently not implemented.",
        )

    if not param_grid:
        raise ValueError(
            'At least a single parameter needs to be provided',
        )

    self.pipeline = Pipeline([
        (self._sim_prefix, self.similarity),
        (self._clust_prefix, self.clustering),
    ])

    self.param_grid: Dict = {}
    for param, values in param_grid.items():
        if param in similarity.get_params():
            self.param_grid[
                f'{self._sim_prefix}__{param}'
            ] = values
        elif param in clustering.get_params():
            self.param_grid[
                f'{self._clust_prefix}__{param}'
            ] = values
            print('######################', self.param_grid)
        else:
            raise ValueError(
                f"param_grid key '{param}' is not available."
            )

    super().__init__(
        estimator=self.pipeline,
        param_grid=self.param_grid,
        **self.gridsearch_kwargs,
    )

fit(X, y=None)

Clusters the correlation matrix by Leiden clustering on a graph.

Parameters:

  • X (ndarray of shape (n_samples, n_features)) –

    Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • self ( object ) –

    Fitted estimator.

Source code in src/mosaic/gridsearch.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
@beartype
def fit(
    self,
    X: FloatMax2DArray,
    y: Optional[np.ndarray] = None,
):
    """Clusters the correlation matrix by Leiden clustering on a graph.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Training vector, where `n_samples` is the number of samples and
        `n_features` is the number of features.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    return super().fit(X)

Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)

Bases: BaseEstimator

Class for calculating the similarity measure.

Parameters:

  • metric (str, default: 'correlation' ) –

    the correlation metric to use for the feature distance matrix. - 'correlation' will use the absolute value of the Pearson correlation - 'NMI' will use the mutual information normalized by joined entropy - 'GY' uses Gel'fand and Yaglom normalization[^1] - 'JSD' will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note: 'NMI' is supported only with low_memory=False

  • low_memory (bool, default: False ) –

    If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.

  • normalize_method (str, default: 'geometric' ) –

    Only required for metric 'NMI'. Determines the normalization factor for the mutual information: - 'joint' is the joint entropy - 'max' is the maximum of the individual entropies - 'arithmetic' is the mean of the individual entropies - 'geometric' is the square root of the product of the individual entropies - 'min' is the minimum of the individual entropies

  • use_knn_estimator (bool, default: False ) –

    Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.

Attributes:

  • matrix_ (ndarray of shape (n_features, n_features)) –

    The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].

Examples:

>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1.       , 0.9697832],
       [0.9697832, 1.       ]])
Notes

The Pearson correlation coefficient is defined as

\[\rho_{X,Y} = \frac{\langle(X -\mu_X)(Y -\mu_Y)\rangle}{\sigma_X\sigma_Y}.\]

For the online (low memory) option the Welford algorithm2 is used.

The Jensen-Shannon divergence is defined as

\[D_{\text{JS}} = \frac{1}{2} D_{\text{KL}}(p(x,y)||M) + \frac{1}{2} D_{\text{KL}}(p(x)p(y)||M)\;,\]

where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.


  1. Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. 

  2. Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley. 

  3. B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" 

Initialize Similarity class.

Source code in src/mosaic/similarity.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
@beartype
def __init__(
    self,
    *,
    metric: MetricString = 'correlation',
    low_memory: bool = False,
    normalize_method: Optional[NormString] = None,
    use_knn_estimator: bool = False,
):
    """Initialize Similarity class."""
    self.metric: MetricString = metric
    self.low_memory: bool = low_memory
    self.use_knn_estimator: bool = use_knn_estimator
    if self.metric == 'NMI':
        if normalize_method is None:
            normalize_method = self._default_normalize_method
    elif normalize_method is not None:
        raise NotImplementedError(
            'Normalize methods are only supported with metric="NMI"',
        )
    self.normalize_method: NormString = normalize_method
    if self.metric != 'GY' and self.use_knn_estimator:
        raise NotImplementedError(
            (
                'The mutual information estimate based on k-nearest'
                'neighbors distances is only supported with metric="GY"'
            ),
        )

fit(X, y=None)

Compute the correlation/nmi distance matrix.

Parameters:

  • X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

    Training data.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • self ( object ) –

    Fitted estimator.

Source code in src/mosaic/similarity.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
@singledispatchmethod
@beartype
def fit(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
):
    """Compute the correlation/nmi distance matrix.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    raise NotImplementedError('Fatal error, this should never be reached.')

fit_transform(X, y=None)

Compute the correlation/nmi distance matrix and returns it.

Parameters:

  • X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

    Training data.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • Similarity ( ndarray of shape (n_features, n_features) ) –

    Similarity matrix.

Source code in src/mosaic/similarity.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
@beartype
def fit_transform(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    self.fit(X)
    return self.matrix_

transform(X)

Compute the correlation/nmi distance matrix and returns it.

Parameters:

  • X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

    Training data.

Returns:

  • Similarity ( ndarray of shape (n_features, n_features) ) –

    Similarity matrix.

Source code in src/mosaic/similarity.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
@beartype
def transform(
    self,
    X: Union[FloatMax2DArray, str],
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    return self.fit_transform(X)