Der effizienteste Weg, um den Modus im Numpy-Array zu finden

Question 1

Ich habe ein 2D-Array mit ganzen Zahlen (sowohl positiv als auch negativ). Jede Zeile repräsentiert die zeitlichen Werte für einen bestimmten räumlichen Standort, während jede Spalte Werte für verschiedene räumliche Standorte für eine bestimmte Zeit darstellt.

Also, wenn das Array wie folgt ist:

1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1

Das Ergebnis sollte sein

1 3 2 2 2 1

Beachten Sie, dass bei mehreren Werten für den Modus jeder (zufällig ausgewählte) Wert als Modus festgelegt werden kann.

Ich kann den Spaltenfindungsmodus einzeln durchlaufen, aber ich hatte gehofft, dass numpy eine eingebaute Funktion hat, um dies zu tun. Oder wenn es einen Trick gibt, um dies ohne Schleifen effizient zu finden.

Question 2

Überprüfen Sie scipy.stats.mode()(inspiriert von @ tom10s Kommentar):

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
              [5, 2, 2, 1, 4, 1],
              [3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Ausgabe:

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

Wie Sie sehen können, werden sowohl der Modus als auch die Anzahl zurückgegeben. Sie können die Modi direkt auswählen über m[0]:

print(m[0])

Ausgabe:

[[1 3 2 2 1 1]]

Question 3

Aktualisieren

Die scipy.stats.modeFunktion wurde seit diesem Beitrag erheblich optimiert und wäre die empfohlene Methode

Alte Antwort

Dies ist ein heikles Problem, da es nicht viel gibt, um den Modus entlang einer Achse zu berechnen. Die Lösung ist einfach für 1-D-Arrays, wo numpy.bincountes praktisch ist, zusammen numpy.uniquemit dem return_countsArgument as True. Die häufigste n-dimensionale Funktion, die ich sehe, ist scipy.stats.mode, obwohl sie unerschwinglich langsam ist - insbesondere für große Arrays mit vielen eindeutigen Werten. Als Lösung habe ich diese Funktion entwickelt und benutze sie stark:

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Ergebnis:

In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
                         [5, 2, 2, 1, 4, 1],
                         [3, 3, 2, 2, 1, 1]])

In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))

Einige Benchmarks:

In [4]: import scipy.stats

In [5]: a = numpy.random.randint(1,10,(1000,1000))

In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop

In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop

In [8]: a = numpy.random.randint(1,500,(1000,1000))

In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop

In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop

In [11]: a = numpy.random.random((200,200))

In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop

In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop

BEARBEITEN: Bietet mehr Hintergrundinformationen und modifiziert den Ansatz, um speichereffizienter zu sein

Question 4

Erweitern Sie diese Methode , um den Modus der Daten zu ermitteln, in dem Sie möglicherweise den Index des tatsächlichen Arrays benötigen, um zu sehen, wie weit der Wert vom Zentrum der Verteilung entfernt ist.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Denken Sie daran, den Modus zu verwerfen, wenn len (np.argmax (count))> 1 ist. Um zu überprüfen, ob er tatsächlich für die zentrale Verteilung Ihrer Daten repräsentativ ist, können Sie überprüfen, ob er in Ihr Standardabweichungsintervall fällt.

Question 5

Eine nette Lösung, die nur verwendet numpy(weder scipynoch die CounterKlasse):

A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])

np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)

Array ([1, 3, 2, 2, 1, 1])

Question 6

Wenn Sie nur numpy verwenden möchten:

x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)

gibt

(array([-1,  1,  2,  3]), array([1, 1, 1, 2]))

Und extrahieren Sie es:

index = np.argmax(counts)
return vals[index]

Question 7

Ich denke, ein sehr einfacher Weg wäre die Verwendung der Counter-Klasse. Sie können dann die most_common () Funktion der Gegeninstanz verwenden wie erwähnt hier .

Für 1-D-Arrays:

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

Für mehrdimensionale Arrays (kleiner Unterschied):

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 
nparr = nparr.reshape((10,2,5))     #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1)  # just use .flatten() method

# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

Dies kann eine effiziente Implementierung sein oder auch nicht, ist jedoch zweckmäßig.

Question 8

from collections import Counter

n = int(input())
data = sorted([int(i) for i in input().split()])

sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]

print(Mean)

Das Counter(data)zählt die Häufigkeit und gibt ein Standarddikt zurück. sorted(Counter(data).items())sortiert mit den Tasten, nicht mit der Frequenz. Schließlich müssen Sie die Frequenz mit einer anderen sortiert mit sortieren key = lambda x: x[1]. Die Rückseite weist Python an, die Frequenz vom größten zum kleinsten zu sortieren.

Question 9

einfachste Möglichkeit in Python, den Modus einer Liste oder eines Arrays abzurufen a

   import statistics
   print("mode = "+str(statistics.(mode(a)))

Das ist es