similarity
Class for estimating correlation matrices.
MIT License Copyright © 2021-2022, Daniel Nagel, Georg Diez All rights reserved.
Similarity(*, metric='correlation', low_memory=False, normalize_method=None, use_knn_estimator=False)
¶
Bases: BaseEstimator
Class for calculating the similarity measure.
Parameters:
-
metric(str, default:'correlation') –the correlation metric to use for the feature distance matrix. -
'correlation'will use the absolute value of the Pearson correlation -'NMI'will use the mutual information normalized by joined entropy -'GY'uses Gel'fand and Yaglom normalization[^1] -'JSD'will use the Jensen-Shannon divergence between the joint probability distribution and the product of the marginal probability distributions to calculate their dissimilarity Note:'NMI'is supported only with low_memory=False -
low_memory(bool, default:False) –If True, the input of fit X needs to be a file name and the correlation is calculated on the fly. Otherwise, an array is assumed as input X.
-
normalize_method(str, default:'geometric') –Only required for metric
'NMI'. Determines the normalization factor for the mutual information: -'joint'is the joint entropy -'max'is the maximum of the individual entropies -'arithmetic'is the mean of the individual entropies -'geometric'is the square root of the product of the individual entropies -'min'is the minimum of the individual entropies -
use_knn_estimator(bool, default:False) –Can only be set for metric GY. If True, the mutual information is estimated reliably by a parameter free method based on entropy estimation from k-nearest neighbors distances[^3]. It considerably increases the computational time and is thus only advisable for relatively small data-sets.
Attributes:
-
matrix_(ndarray of shape (n_features, n_features)) –The correlation-measure-based pairwise distance matrix of the data. It scales from [0, 1].
Examples:
>>> import mosaic
>>> x = np.linspace(0, np.pi, 1000)
>>> data = np.array([np.cos(x), np.cos(x + np.pi / 6)]).T
>>> sim = mosaic.Similarity()
>>> sim.fit(data)
Similarity()
>>> sim.matrix_
array([[1. , 0.9697832],
[0.9697832, 1. ]])
Notes
The Pearson correlation coefficient is defined as
For the online (low memory) option the Welford algorithm2 is used.
The Jensen-Shannon divergence is defined as
where \(M = \frac{1}{2} [p(x,y) + p(x)p(y)]\) is an averaged probability distribution and \(D_{\text{KL}}\) denotes the Kullback-Leibler divergence.
-
Gel'fand, I.M. and Yaglom, A.M. (1957). "Calculation of amount of information about a random function contained in another such function". American Mathematical Society Translations, series 2, 12, pp. 199–246. ↩
-
Welford algorithm, generalized to correlation. Taken from: Donald E. Knuth (1998). "The Art of Computer Programming", volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley. ↩
-
B.C. Ross, PLoS ONE 9(2) (2014), "Mutual Information between Discrete and Continuous Data Sets" ↩
Initialize Similarity class.
Source code in src/mosaic/similarity.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
fit(X, y=None)
¶
Compute the correlation/nmi distance matrix.
Parameters:
-
X(ndarray of shape (n_samples, n_features) or str if low_memory=True) –Training data.
-
y(Ignored, default:None) –Not used, present for scikit API consistency by convention.
Returns:
-
self(object) –Fitted estimator.
Source code in src/mosaic/similarity.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | |
fit_transform(X, y=None)
¶
Compute the correlation/nmi distance matrix and returns it.
Parameters:
-
X(ndarray of shape (n_samples, n_features) or str if low_memory=True) –Training data.
-
y(Ignored, default:None) –Not used, present for scikit API consistency by convention.
Returns:
-
Similarity(ndarray of shape (n_features, n_features)) –Similarity matrix.
Source code in src/mosaic/similarity.py
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | |
transform(X)
¶
Compute the correlation/nmi distance matrix and returns it.
Parameters:
-
X(ndarray of shape (n_samples, n_features) or str if low_memory=True) –Training data.
Returns:
-
Similarity(ndarray of shape (n_features, n_features)) –Similarity matrix.
Source code in src/mosaic/similarity.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |