Set Inclusion

Programming often involves problems which require checking whether an element has a particular boolean attribute, e.g., in Pokémon whether a move is eligible to be called via Metronome or whether an item is considered a berry for the purposes of Harvest. Alternatively, these problems can be reframed as ones which ask whether the element belongs to the set of elements that have the attribute, i.e. does the move belong to the set of moves that Metronome can proc?

The two most obvious ways of representing this information in code naturally fall out of how the problem is framed. With first framing, the logical way to represent the data would be to add fields to the structure/object that encodes its other properties:

class Item {
  ...
  berry?: boolean;
}

With the second framing an auxiliary data structure is used to encode the information:

const BERRIES = new Set<Item>(...);

While both representations have merit, the latter is often preferable as the former increases the size of every element regardless of its inclusion in the set, increasing memory usage and wasting space in caches. The second representation is almost always better when only a few elements have a particular attribute (note the opposite is also true, as if most elements have a particular property then the definition of the set can be reversed so that it only stores those elements which don’t exhibit the property).

However, despite this alternative framing of checking for the existence of attributes as being a question of set inclusion, a language’s built-in Set data structure is often not what should be reached for. Set data structures are usually designed to guarantee $O(1)$ or $O(\lg{} n)$ lookup performance, though these classifications are only relevant when the size of the set is large as they ignore constant factors (i.e., they characterize the theoretical asymptotic growth rate, not actual practical run-time). In many real tasks the size of the sets in question are small enough that they can be most efficiently queried when instead stored as an array:

const BERRIES: Item[] = [...];

Despite having $O(n)$ lookup performance, a linear scan of an array can be performed very efficiently by modern CPUs which optimize for this with speculative execution and cache prefetching. The best way to determine whether an array behaves better than a set for any given use case is to benchmark, though as a rule-of-thumb sets with ~10 or less elements are almost always better stored as an array. An array may be preferable for even larger sets if a binary search is used instead of a linear scan — the elements in the set can be sorted at compile time to ensure the preconditions of the binary search are met and to guarantee a worst case of $O(\lg{} n)$ lookups, equivalent to many Set data structures that are backed by a binary tree but with better cache locality.

In many cases there is some small numerical identifier (e.g., a handle) that should be stored in the set or array data structure instead of the whole object or its pointer. If this is the case, it’s sometimes possible to do even better — the value of the identifier itself can be chosen to turn a question of set inclusion into one of simply checking if the value falls within a specific range. The pkmn engine makes use of this optimization wherever possible, leveraging “range checking” to test for set inclusion with a small handful of cmp instructions — e.g., the Item enum in Generation II is carefully sorted so that type-enhancing items, items with effects, berries, items without effects, mail, and items not present in Pokémon Showdown are all possible to distinguish between trivially:

pkmn/engine/src/lib/gen2/data/items.zig#L225-L220

/// Whether or not this item is Mail.
pub fn isMail(item: Item) bool {
    assert(item != .None);
    return @intFromEnum(item) > 60 and @intFromEnum(item) <= 70;
}

Sometimes identifiers can’t be reordered or an ordering that satisfies every constraint is impossible, though this strategy can still be employed in the presence of multiple overlapping sets:

pkmn/engine/src/lib/gen1/data/moves.zig#L1616-L1641

/// Whether this effect is considered to "always happen".
pub fn alwaysHappens(effect: Effect) bool {
    return @intFromEnum(effect) > 31 and @intFromEnum(effect) <= 38;
}

/// Whether this effect is handled specially by the engine.
pub fn isSpecial(effect: Effect) bool {
    // NB: isSpecial includes isMulti up to Twineedle
    return @intFromEnum(effect) > 31 and @intFromEnum(effect) <= 46;
}

/// Whether this effect is a multi-hit effect.
pub fn isMulti(effect: Effect) bool {
    return @intFromEnum(effect) > 44 and @intFromEnum(effect) <= 47;
}

/// Whether this effect is has chance of lowering stats.
pub fn isStatDownChance(effect: Effect) bool {
    return @intFromEnum(effect) > 47 and @intFromEnum(effect) <= 51;
}

/// Whether this effect has a secondary chance.
pub fn isSecondaryChance(effect: Effect) bool {
    // NB: isSecondaryChance includes isStatDownChance as well as Twineedle
    return (@intFromEnum(effect) > 46 and @intFromEnum(effect) <= 61);
}

This trick is a perfect example of how careful consideration upfront of data layout can result in effectively “free” performance wins.