Skip to content

similarity

Class for estimating correlation matrices.

MIT License Copyright © 2021-2022, Daniel Nagel, Georg Diez All rights reserved.

Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)

Bases: BaseEstimator

Class for calculating the similarity measure.

Parameters:

  • metric (str, default: 'correlation' ) –

    the correlation metric to use for the feature distance matrix. - 'correlation' will use the absolute value of the Pearson correlation - 'NMI' will use the mutual information normalized by joined entropy - 'GY' uses Gel'fand and Yaglom normalization[^1] - 'JSD' will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note: 'NMI' is supported only with low_memory=False

  • low_memory (bool, default: False ) –

    If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.

  • normalize_method (str, default: 'geometric' ) –

    Only required for metric 'NMI'. Determines the normalization factor for the mutual information: - 'joint' is the joint entropy - 'max' is the maximum of the individual entropies - 'arithmetic' is the mean of the individual entropies - 'geometric' is the square root of the product of the individual entropies - 'min' is the minimum of the individual entropies

  • use_knn_estimator (bool, default: False ) –

    Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.

Attributes:

  • matrix_ (ndarray of shape (n_features, n_features)) –

    The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].

Examples:

>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1.       , 0.9697832],
       [0.9697832, 1.       ]])

Notes

The Pearson correlation coefficient is defined as

\[\rho_{X,Y} = \frac{\langle(X -\mu_X)(Y -\mu_Y)\rangle}{\sigma_X\sigma_Y}.\]

For the online (low memory) option the Welford algorithm2 is used.

The Jensen-Shannon divergence is defined as

\[D_{\text{JS}} = \frac{1}{2} D_{\text{KL}}(p(x,y)||M) + \frac{1}{2} D_{\text{KL}}(p(x)p(y)||M)\;,\]

where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.


  1. Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. 

  2. Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley. 

  3. B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" 

Initialize Similarity class.

Source code in src/mosaic/similarity.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
@beartype
def __init__(
    self,
    *,
    metric: MetricString = 'correlation',
    low_memory: bool = False,
    normalize_method: Optional[NormString] = None,
    use_knn_estimator: bool = False,
):
    """Initialize Similarity class."""
    self.metric: MetricString = metric
    self.low_memory: bool = low_memory
    self.use_knn_estimator: bool = use_knn_estimator
    if self.metric == 'NMI':
        if normalize_method is None:
            normalize_method = self._default_normalize_method
    elif normalize_method is not None:
        raise NotImplementedError(
            'Normalize methods are only supported with metric="NMI"',
        )
    self.normalize_method: NormString = normalize_method
    if self.metric != 'GY' and self.use_knn_estimator:
        raise NotImplementedError(
            (
                'The mutual information estimate based on k-nearest'
                'neighbors distances is only supported with metric="GY"'
            ),
        )

fit(X, y=None)

Compute the correlation/nmi distance matrix.

Parameters:

  • X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

    Training data.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • self ( object ) –

    Fitted estimator.

Source code in src/mosaic/similarity.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
@singledispatchmethod
@beartype
def fit(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
):
    """Compute the correlation/nmi distance matrix.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    raise NotImplementedError('Fatal error, this should never be reached.')

fit_transform(X, y=None)

Compute the correlation/nmi distance matrix and returns it.

Parameters:

  • X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

    Training data.

  • y (Ignored, default: None ) –

    Not used, present for scikit API consistency by convention.

Returns:

  • Similarity ( ndarray of shape (n_features, n_features) ) –

    Similarity matrix.

Source code in src/mosaic/similarity.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
@beartype
def fit_transform(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    self.fit(X)
    return self.matrix_

transform(X)

Compute the correlation/nmi distance matrix and returns it.

Parameters:

  • X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

    Training data.

Returns:

  • Similarity ( ndarray of shape (n_features, n_features) ) –

    Similarity matrix.

Source code in src/mosaic/similarity.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
@beartype
def transform(
    self,
    X: Union[FloatMax2DArray, str],
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    return self.fit_transform(X)