similarity

Class for estimating correlation matrices.

`Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)` ¶

Bases: BaseEstimator

Class for calculating the similarity measure.

Parameters:

metric (str, default: 'correlation' ) –

the correlation metric to use for the feature distance matrix. - 'correlation' will use the absolute value of the Pearson correlation - 'NMI' will use the mutual information normalized by joined entropy - 'GY' uses Gel'fand and Yaglom normalization[^1] - 'JSD' will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note: 'NMI' is supported only with low_memory=False
low_memory (bool, default: False ) –

If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.
normalize_method (str, default: 'geometric' ) –

Only required for metric 'NMI'. Determines the normalization factor for the mutual information: - 'joint' is the joint entropy - 'max' is the maximum of the individual entropies - 'arithmetic' is the mean of the individual entropies - 'geometric' is the square root of the product of the individual entropies - 'min' is the minimum of the individual entropies
use_knn_estimator (bool, default: False ) –

Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.

Attributes:

matrix_ (ndarray of shape (n_features, n_features)) –

The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].

Examples:

>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1.       , 0.9697832],
       [0.9697832, 1.       ]])

Notes

The Pearson correlation coefficient is defined as

\[\rho_{X,Y} = \frac{\langle(X -\mu_X)(Y -\mu_Y)\rangle}{\sigma_X\sigma_Y}.\]

For the online (low memory) option the Welford algorithm² is used.

The Jensen-Shannon divergence is defined as

\[D_{\text{JS}} = \frac{1}{2} D_{\text{KL}}(p(x,y)||M) + \frac{1}{2} D_{\text{KL}}(p(x)p(y)||M)\;,\]

where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.

Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. ↩
Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3^rd edn., p. 232. Boston: Addison-Wesley. ↩
B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" ↩

Initialize Similarity class.

Source code in src/mosaic/similarity.py

@beartype
def __init__(
    self,
    *,
    metric: MetricString = 'correlation',
    low_memory: bool = False,
    normalize_method: Optional[NormString] = None,
    use_knn_estimator: bool = False,
):
    """Initialize Similarity class."""
    self.metric: MetricString = metric
    self.low_memory: bool = low_memory
    self.use_knn_estimator: bool = use_knn_estimator
    if self.metric == 'NMI':
        if normalize_method is None:
            normalize_method = self._default_normalize_method
    elif normalize_method is not None:
        raise NotImplementedError(
            'Normalize methods are only supported with metric="NMI"',
        )
    self.normalize_method: NormString = normalize_method
    if self.metric != 'GY' and self.use_knn_estimator:
        raise NotImplementedError(
            (
                'The mutual information estimate based on k-nearest'
                'neighbors distances is only supported with metric="GY"'
            ),
        )

`fit(X, y=None)` ¶

Compute the correlation/nmi distance matrix.

Parameters:

X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

Training data.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

self ( object ) –

Fitted estimator.

Source code in src/mosaic/similarity.py

@singledispatchmethod
@beartype
def fit(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
):
    """Compute the correlation/nmi distance matrix.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.

    """
    raise NotImplementedError('Fatal error, this should never be reached.')

`fit_transform(X, y=None)` ¶

Compute the correlation/nmi distance matrix and returns it.

Parameters:

X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

Training data.
y (Ignored, default: None ) –

Not used, present for scikit API consistency by convention.

Returns:

Similarity ( ndarray of shape (n_features, n_features) ) –

Similarity matrix.

Source code in src/mosaic/similarity.py

@beartype
def fit_transform(
    self,
    X: Union[FloatMax2DArray, str],
    y: Optional[ArrayLikeFloat] = None,
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.
    y : Ignored
        Not used, present for scikit API consistency by convention.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    self.fit(X)
    return self.matrix_

`transform(X)` ¶

Compute the correlation/nmi distance matrix and returns it.

Parameters:

X (ndarray of shape (n_samples, n_features) or str if low_memory=True) –

Training data.

Returns:

Similarity ( ndarray of shape (n_features, n_features) ) –

Similarity matrix.

Source code in src/mosaic/similarity.py

@beartype
def transform(
    self,
    X: Union[FloatMax2DArray, str],
) -> FloatMatrix:
    """Compute the correlation/nmi distance matrix and returns it.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features) or str if low_memory=True
        Training data.

    Returns
    -------
    Similarity : ndarray of shape (n_features, n_features)
        Similarity matrix.

    """
    return self.fit_transform(X)

similarity

Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False) ¶

fit(X, y=None) ¶

fit_transform(X, y=None) ¶

transform(X) ¶

`Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)` ¶

`fit(X, y=None)` ¶

`fit_transform(X, y=None)` ¶

`transform(X)` ¶