Discard Trees: a data structure for finding nonintersecting combinations

in #programming6 years ago (edited)

A "discard tree" is a data structure I invented that is keyed by bitvectors representing K-combinations. It permits us to efficiently find all bitvectors that do not intersect with the search bitvector.

Motivation

Suppose we're trying to solve Chinese Poker. There are various ways to set your 13 (or 17) cards. Which one is best may depend on how your opponent is setting their cards. But that's exactly the problem we are trying to solve! So the approach I took is to iteratively improve the hand settings through multiple rounds of evaluation, based on how they were set in the previous round.

It's not possible to store all 52C13 = 634,013,559,600 combinations on an ordinary desktop, and it would be lengthy to run through them all. (Though as a side note, if we store 3 64-bit words per hand to represent its key and its setting, then that is "only" 15.24 TB, and soon we will be able to get flash drives that large. But 52C17 gives 527TB of storage.)
So instead we will resort to statistical sampling. We'll draw a large sample of hands, compare against just that sample, and make our decisions upon the results over the sample.

Two hands can be compared only if they have no cards in common. If player A has the jack of hearts, player B's hands and settings that include the jack of hearts are irrelevant. So, we want to efficiently find the hands in our sample which do not overlap with the hand we are evaluating for A. Each intersection test can be done quickly by representing the cards as a bit-vector (52 bits, one per card) and using logical AND. We seek to reduce the number of intersection tests that have to be done, compared with just testing every hand in the sample.

An example of representing 13-card hands as 52-bit bitvectors, and testing for intersection; figure drawn by the author.

Definition

A discard tree consists of a labelled set of buckets containing bitvector entries.

  • Each bucket contains only bit-vectors that are supersets of the bitvectors on the bucket label. ( x AND label == label for every bitvector x in the bucket )
  • Each bucket may contain either a list of bitvectors, or another discard tree.
  • Every valid key must match the label on some bucket.

The last condition motivates the search for minimal covering sets for combinations, which I wrote about previously. If we are storing bitvectors that all have 17 bits set out of 52 (17-combinations) then we want a label set that efficiently partitions the search into nearly equal-sized buckets.

Insertion: An entry may be placed in any matching bucket. If there is more than one matching bucket, the entry should be placed randomly. If the bucket contains a subtree, recursively insert the entry into that subtree.

Search: In order to find nonintersecting bitvectors for an input X, walk the buckets in the root of the tree (in any order, or in parallel) and perform the intersection test (logical AND) between the label and X. Recursively visit any bucket in which this intersection is empty. If we reach a leaf (a list of bitvectors) perform the intersection test with every entry in the leaf.

Note that the discard tree can't be used to search for intersection efficiently, only non-intersection. We would have to try each of the buckets that might intersect at all, which could be all of them.

Comparison with radix lookup

If we're storing arbitrary bitstrings, then we might as well avoid this structure and use a conventional Radix tree where we can do exact matching at each level of the tree. But when we're storing K-combinations, where each bitstring has K bits set, then this is not very efficient.

Let's say we use three bits at the first level of a radix tree that stores all 5-of-20 combinations, and compute how many entries could live in each subtree.

radix label (exact match, first 3 bits)subtree sizecalculation
00061885 bits remaining, 17C5 possibilities
00123804 bits remaining, 17C4
0102380
0116803 bits remaining, 17C3
1002380
101680
110680
1111362 bits remaining, 17C2

Some subtrees are going to have to store a lot more entries than others. The radix tree lets us just look up the first three bits and go directly to the right bucket, which is great for searching for a single entry. But the cost is that some of those buckets are very full, and some of those buckets are nearly empty. In fact, the first bucket is going to have to be searched every time, because no bitvector intersects with 000, so the first level of the radix tree didn't let us cut down the search very much if at all.

If we use a discard tree instead, then each subtree can be adjusted to be of equal size. The subtree size here is the number of matches, which don't add up because a key may match more than one bucket.

discard label (intersection match)subtree sizecalculation
11000 00000 00000 0000081618 bits left, 3 needed so 18C3
10100 00000 00000 00000816same!
10010 00000 00000 00000816
10001 00000 00000 00000816
01100 00000 00000 00000816
01010 00000 00000 00000816
01001 00000 00000 00000816
00110 00000 00000 00000816
00101 00000 00000 00000816
00011 00000 00000 00000816
00000 11000 00000 00000816
00000 10100 00000 00000816
....8165C2 * 4 = 40 buckets total

The label set here is constructed using the pigeonhole principle as described in the previous article. We partition the 20 bits into 4 groups, at least one group will have two bits set, so all two-combinations within each group covers the entire range of possibilities. Every label test does exactly as much work, and every label excludes a fraction of the possible hands.

Now, we're doing a lot more work at each step (more intersection tests) but we're getting a lot of benefit (smaller subtrees in each bucket.) What is the right balance?

Optimizing for 17-card lookup

While searching for nonintersecting bitstrings we have to do two sets of comparisons: logical ANDs with the label set, and logical ANDs with each entry in the leaves we visit. We want to find an efficient structure that optimizes the total cost of the search. The structure will consist of:

  • a fixed depth of the tree
  • the bit weight (the "L" value in our construction) for the bucket labels at each depth

It would be possible to build a tree that uses an irregular structure instead, but I don't see any benefit to it. But I haven't proven the regular structure is optimal! I also assume that each label is constructed in the way linked to above, rather than some other, irregular scheme.

The good news is that under these assumptions, the fraction of the stored set of bitvectors that we have to search is more or less independent of the depth of the tree, if we keep the number of bits in the label set constant. Here are the results of simulation (from https://markgritter.livejournal.com/607693.html, which I did not re-run for this article.) The set being stored is C(52,17): 17-bit vectors of length 52.

tree depthbit weight at each level of the treefraction of leaf buckets visited
150.12556
160.07891
23, 30.08061
22, 40.08068
32,2,20.08006
170.0503 (expected)
24, 30.04995
25, 20.05054
22, 50.05062
33, 2, 20.05058
180.0313 (expected)
24, 40.03092
25, 30.03142

So, for example, the second row shows that if we index using 6-bit labels and only a single tier of buckets, we will visit about 0.07891 of the buckets. But if we use 3-bit labels, and then another set of 3-bit labels (all of which have the previous 3 bits set too) in a tree of height 2, then we visit 0.08061 of the buckets. The difference is small but probably not significant (and may even be due to randomness in the simulation.)

That means we can focus just on optimizing the cost of the bucket matching, which is improved by going to more layers of the tree. Every search has to visit the labels on the buckets in the root. Using fewer bits means a smaller bucket set in that first layer. So, for every K (every bit weight of our lowest level of buckets) we can calculate an essentially fixed cost for a search. Using that, we calculate the breakeven point on the number of entries, where the reduction in search space pays for that fixed cost.

(Remember, this is all when we're storing 17-combinations out of 52 bits in the data structure; for other choices the optimal structure or breakeven points may be different. We can do this optimization because we have a known workload for the discard tree, not one unknown at the time of creation.)

KSpace reductionBest structureFixed costBreakeven point
20.4487[2]60134 hands
30.2962[3]220743
40.1934[4]12906670
50.1249[3,2]~413033064
60.0797[3,3]~14556182569
70.0503[5,2]~43727869950
80.0313[5,3]~1466054687572
90.0192[6,3]~80283741832660
100.0116[5,3,2]~2565704229718055

For example, if we want to store 200,000 entries in the discard tree (each with 17 bits set) then the optimal structure is 3, 3. The bucket label size for N=52 K=7 L=3 is 220. 8 groups, 4x6 and 4x7, 3*(6C3) + 4*(7C3) = 220.

We have to visit about 0.2962 of the first-level buckets (we can see that on the table above), so our total fixed cost is thus 220 * + 0.2962 * 220 * 220 = 14556. After paying that cost we only have to look at 0.0797 of the original data set, assuming the stored entries are random. So the total number of intersection tests we have to do is 14556 + 0.0797 * 200000 = 30496.

Conclusion

The discard tree was designed to solve a very specific problem: evaluate hand settings in Chinese Poker. It efficiently lets us compare a "test" hand against all the 13- or 17-card hands in a large sample of possible hands and settings. The cost of doing this evaluation is just a small fraction of searching through the entire list, and the structure of the discard tree can be adjusted to any given sample size.

I don't know if the discard tree has other applications. It may not even be the best structure for the job! For example, I did not evaluate memory access patterns or cache behavior, because I assumed that the contents of each leaf was "large enough". If a better label construction is possible, that would improve the discard tree further. It's also possible that an asymmetric or partial radix tree would overcome the limitations described above and provide the same or better performance than the discard tree.

Sort:  

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

Upvoted.

DISCLAIMER: Your post is upvoted based on curation algorithm configured to find good articles e.g. stories, arts, photography, health, etc. This is to reward you (authors) for sharing good content using the Steem platform especially newbies.
If you're a dolphin or whales, and wish not to be included in future selection, please let me know so I can exclude your account. And if you find the upvoted post is inappropriate, FLAG if you must. This will help a better selection of post.

Keep steeming good content.
@Yehey

Posted using https://Steeming.com condenser site.

Coin Marketplace

STEEM 0.35
TRX 0.12
JST 0.039
BTC 70310.43
ETH 3568.34
USDT 1.00
SBD 4.73