similarity
Class for estimating correlation matrices.
MIT License Copyright © 2021-2022, Daniel Nagel, Georg Diez All rights reserved.
Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)
¶
Bases: BaseEstimator
Class for calculating the similarity measure.
Parameters:
-
metric
(str
, default:'correlation'
) –the correlation metric to use for the feature distance matrix. -
'correlation'
will use the absolute value of the Pearson correlation -'NMI'
will use the mutual information normalized by joined entropy -'GY'
uses Gel'fand and Yaglom normalization[^1] -'JSD'
will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note:'NMI'
is supported only with low_memory=False -
low_memory
(bool
, default:False
) –If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.
-
normalize_method
(str
, default:'geometric'
) –Only required for metric
'NMI'
. Determines the normalization factor for the mutual information: -'joint'
is the joint entropy -'max'
is the maximum of the individual entropies -'arithmetic'
is the mean of the individual entropies -'geometric'
is the square root of the product of the individual entropies -'min'
is the minimum of the individual entropies -
use_knn_estimator
(bool
, default:False
) –Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.
Attributes:
-
matrix_
(ndarray of shape (n_features, n_features)
) –The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].
Examples:
>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1. , 0.9697832],
[0.9697832, 1. ]])
Notes
The Pearson correlation coefficient is defined as
For the online (low memory) option the Welford algorithm2 is used.
The Jensen-Shannon divergence is defined as
where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.
-
Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. ↩
-
Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley. ↩
-
B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" ↩
Initialize Similarity class.
Source code in src/mosaic/similarity.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
|
fit(X, y=None)
¶
Compute the correlation/nmi distance matrix.
Parameters:
-
X
(ndarray of shape (n_samples, n_features) or str if low_memory=True
) –Training data.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
self
(object
) –Fitted estimator.
Source code in src/mosaic/similarity.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
|
fit_transform(X, y=None)
¶
Compute the correlation/nmi distance matrix and returns it.
Parameters:
-
X
(ndarray of shape (n_samples, n_features) or str if low_memory=True
) –Training data.
-
y
(Ignored
, default:None
) –Not used, present for scikit API consistency by convention.
Returns:
-
Similarity
(ndarray of shape (n_features, n_features)
) –Similarity matrix.
Source code in src/mosaic/similarity.py
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
|
transform(X)
¶
Compute the correlation/nmi distance matrix and returns it.
Parameters:
-
X
(ndarray of shape (n_samples, n_features) or str if low_memory=True
) –Training data.
Returns:
-
Similarity
(ndarray of shape (n_features, n_features)
) –Similarity matrix.
Source code in src/mosaic/similarity.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
|