Eine effiziente Datenstruktur, die Insert, Delete und MostFrequent unterstützt

Angenommen, wir haben eine Menge $D$ und jedes Mitglied von $D$ ist ein Daten- und Schlüsselpaar. Wir möchten eine Datenstruktur, die die folgenden Operationen unterstützt:

Setze in , $(d,k)$ $D$
Löschen Sie das Mitglied (es muss nicht gesucht werden, um zu finden , z. B. zeigt auf ein Mitglied in ). $e$ $e$ $e$ $D$
MostFrequent, das ein Mitglied zurückgibt, so dass eine der häufigsten Tasten ist (Beachten Sie, dass die häufigste Schlüssel nicht eindeutig zu sein braucht). $e \in D$ $e.key$ $D$

Was wäre eine effiziente Implementierung dieser Datenstruktur?

My solution is a heap for the keys and their frequencies prioritized by the frequencies plus a hash table where the hash function maps members with the same key to the same slot in the hash table (with pointers from each part to the other).

This can give $\Theta(\lg n)$ for the first two operations and $\Theta(1)$ for the third (worst case running time).

I am wondering if there is more efficient solution? (or a simpler solution with the same efficiency?)

algorithms data-structures abstract-data-types

— Kaveh
quelle

You could use a simple balanced binary search tree instead of a hash table if you want.

— Joe

Hash table uses a lot of unnessesary space, i would propose priority queue. It would give you the same time complexity on insert and delete but memory complexity would be better.

— Bartosz Przybylski

@Joe, using a BST in place of a hash table would make MostFrequent operations less efficient, but that might be a reasonable trade-off for memory.

— Kaveh

If using comparisons only, at least one of Insert/MostFrequent has to be amortized

Ω (\log n)

$\Omega(\log n)$ , because of the lower bounds for element distinctness problem.

— Aryabhata

There are also some interesting structures in the streaming model. springerlink.com/content/t17nhd9hwwry909p

— Joe

Antworten:

In the comparison-based model of computation, you may implement the priority queue using a Fibonacci heap instead of an ordinary heap. This will give you the following bounds: $\mathcal{O}(1)$ amortized time for insert and $\mathcal{O}(\log n)$ amortized time for deletion operations.

If you depart from the comparison-based model and adopt the RAM model where keys are regarded as binary strings, each one contained in one or more machine words, you may implement your priority queue in $o(\log n)$ . Indeed, you can achieve for both insert and delete operations $\mathcal{O}(\sqrt{\log \log n})$ and $\mathcal{O}(1)$ time for the findMin operation. Thorup proved that

If we can sort $n$ keys in time $S(n)$ per key, then we can implement a priority queue supporting find-min in constant time and updates (insert and delete) in $S(n)$ time.

See M. Thorup. Equivalence between priority queues and sorting, 2002. in Proc. FOCS 2002

Since we can sort in $\mathcal{O}(n \sqrt{\log \log n})$ expected time and linear space, as shown by

Y. Han and M. Thorup. Integer sorting in $\mathcal{O}(n \sqrt{\log \log n})$ expected time and linear space. in Proc. FOCS 2002

the bound is proved.

— Massimo Cafaro
quelle

You can do all of these in $O(1)$ expected amortized time. The essential trick is that we don't need the full power of a priority queue, since key frequency only changes by 1 during each insert or delete.

My solution below is really just your solution with an "inefficient" priority queue that happens to work well for this case: a max priority queue implemented as a doubly linked lists of buckets of keys has O(1) insertMin, deleteMax, removeFromBucket, and increaseKey.

Maintain a doubly-linked list of Buckets, where each Bucket has a non-empty hash set of keys (that I'll call a Cohort) and a positive integer (that I'll call the ValCount). In a Bucket b, each key k in the Cohort of b has the same number of unique values associated with it in the set you are maintaining. For example, if your set has the pairs (a,apple), (a,avocado), (b,banana), (c,cucumber), (d,dragon fruit) where the single letters are the keys and the fruits are the values, then you would have two Buckets: One Bucket would have a ValCount of 2 and a Cohort consisting only of one key: a. The other Bucket would have a ValCount of 1 and a Cohort consisting of the three keys b, c, and d.

The doubly-linked list of Bucket should be kept ordered by the ValCount. It will be important that we can find the head and the tail of the list in $O(1)$ time and that we can splice in a new Bucket in $O(1)$ time if we know its neighbors. Unimaginatively, I'll call the list of Buckets the BucketList.

In addition to the BucketList, we'll need a SetMap, which is a hash map mapping keys to ValueBuckets. A ValueBucket is a pair consisting of the ValueSet (a non-empty hash set of values) and a non-null pointer to a Bucket. The ValueSet associated with a key k contains all the unique values associated with k. The Bucket pointer associated with a ValueSet has a Cohort equal to the size of the ValueSet. The Bucket associated with a key k in the SetMap is also associated with the key k in the BucketList.

In C++:

struct Bucket {
    unsigned ValCount;
    unordered_set<Key> Cohort;
    Bucket * heavier;
    Bucket * lighter;
};
Bucket * BucketListHead;
Bucket * BucketListTail;

struct ValueBucket {
  unordered_set<Value> ValueSet;
  Bucket * bucket;
};
unordered_map<Key, ValueBucket> SetMap;

To find a max-frequency key-value pair, we just need to look at the head of the BucketList, find a key in the Cohort, look up that key in the SetMap, and find a value in the ValueSet of its ValueBucket. (phew!)

Inserting and deleting key-value pairs is trickier.

To insert or delete a key-value pair, we first insert or delete it in the SetMap This will change the size of the ValueSet, so we need to modify the Bucket associated with the key. The only Buckets we will need to look at to make this change will be the immediate neighbors of the Bucket the key used to be in. There are several cases here, and they are probably not worth spelling out fully, though I'd be happy to elaborate if you're still having trouble.

— jbapple
quelle

Thanks. Actually we had a similar idea for a solution, but the was a problem with it. I have to check what was the problem as I don't recall it right now. I will check this more carefully next week to see if it avoids the problem that our solution had.

— Kaveh

Here's another way to think about my solution: it's really just your solution with an "inefficient" priority queue that happens to work well for this case. A max priority queue implemented as a doubly linked lists of buckets of keys has O(1) insertMin, deleteMax, removeFromBucket, and increaseKey.

— jbapple

The most efficient way (in terms of worst case) to maintain the mapping from keys for ValueBuckets is probably a search tree.

— Raphael

Raphael - I'm not sure what you're getting at. Are you saying that hash tables are expensive in practice, or that they have bad performance in the worst case, or some third thing?

— jbapple

-3

Worst-case complexity

Insert: $\mathcal{O}(1)$

Find-min: $\mathcal{O}(1)$

Decrease-key: $\mathcal{O}(1)$

Delete: $\mathcal{O}(\log \log n)$

Space: $\mathcal{O}(n)$

Proof

THEOREM 1. We can implement a priority queue that with n integer keys in the range $[0 , N )$ in linear space supporting ﬁnd-min, insert, and dec-key in constant time, and delete in $\mathrm{\mathcal{O}(log\ log\ min \{n, N\})}$ time.

Which is established with a combination of:

LEMMA 3. Let $\tau(n, N)$ be the delete time for a priority queue for up to $n$ integers in the range $[0 , N)$ supporting insert and dec-key in constant time. Then $\tau ( n, N ) \le τ ( N, N)$ . This holds whether $\tau$ is amortized or worst-case.

and:

THEOREM 6. Theorem 6. We can implement a priority queue that with $n$ integer keys in the range $[0 , N)$ in linear space supporting ﬁnd-min, insert, and dec-key in constant time, and delete in $\mathrm{\mathcal{O}(1 + log\ log\ n − log\ log\ q)}$ time for a key of rank $q$ .

Reference

Thorup, Mikkel. “Integer Priority Queues with Decrease Key in Constant Time and the Single Source Shortest Paths Problem.” In Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing, 149–158. STOC ’03. New York, NY, USA: ACM, 2003.

— A T
quelle

Note that in all priority queues it is trivial to move to a structure supporting 'get-min' and 'extract-min' to a structure supporting 'get-max' and 'extract-max'.

— A T

Ping...: @Kaveh

— A T