diff --git a/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java b/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java index 7673fb5..d4f0a2f 100644 --- a/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java +++ b/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java @@ -41,55 +41,54 @@ import java.util.stream.Collectors; * remain readable for a majority of SWEs. At a high level, the approach relies on a few principles * listed herein. * - *
- * [Exploit Parallelism] Distribute the work into Shards. Separate threads (one per core) process + *
[Exploit Parallelism] Distribute the work into Shards. Separate threads (one per core) process * Shards and follow it up by merging the results. parallelStream() is appealing but carries * potential run-time variance (i.e. std. deviation) penalties based on informal testing. Variance * is not ideal when trying to minimize the maximum worker latency. * - *
- * [Use ByteBuffers over MemorySegment] Each Shard is further divided in Chunks. This would've been - * unnecessary except that Shards are too big to be backed by ByteBuffers. Besides, MemorySegment - * appears slower than ByteBuffers. So, to use ByteBuffers, we have to use smaller chunks. + *
[Use ByteBuffers over MemorySegment] Each Shard is further divided in Chunks. This would've + * been unnecessary except that Shards are too big to be backed by ByteBuffers. Besides, + * MemorySegment appears slower than ByteBuffers. So, to use ByteBuffers, we have to use smaller + * chunks. * - *
- * [Straggler freedom] The optimization function here is to minimize the maximal worker thread + *
[Straggler freedom] The optimization function here is to minimize the maximal worker thread * completion. Law of large number averages means that all the threads will end up with similar * amounts of work and similar completion times; but, however ever so often there could be a bad * sharding and more importantly, Cores are not created equal; some will be throttled more than * others. So, we have a shared {@code LazyShardQueue} that aims to distribute work to minimize the * latest completion time. * - *
- * [Work Assignment with LazyShardQueue] The queue provides each thread with its next big-chunk + *
[Work Assignment with LazyShardQueue] The queue provides each thread with its next big-chunk * until X% of the work remains. Big-chunks belong to the thread and will not be provided to another - * thread. Then, it switches to providing small-chunk sizes. Small-chunks comprise the last X% of + * thread. Then, it switches to providing small-chunk sizes. Small-chunks comprise the last X% of * work and every thread can participate in completing the chunk. Even though the queue is shared * across threads, there's no communication across thread during the big-chunk phases. The queue is * effectively a per-thread queue while processing big-chunks. The small-chunk phase uses an * AtomicLong to coordinate chunk allocation across threads. * - *
- * [Chunk processing] Chunk processing is typical. Process line by line. Find a hash function + *
[Chunk processing] Chunk processing is typical. Process line by line. Find a hash function * (polynomial hash fns are slow, but will work fine), hash the city name, resolve conflicts using * linear probing and then accumulate the temperature into the appropriate hash slot. The key * element then is how fast can you identify the hash slot, read the temperature and update the new * temperature in the slot (i.e. min, max, count). * - *
- * [Cache friendliness] 7502P and my machine (7950X) offer 4MB L3 cache/core. This means we can hope - * to fit all our datastructures in L3 cache. Since SMT is turned on, the Runtime's available + *
[Cache friendliness] 7502P and my machine (7950X) offer 4MB L3 cache/core. This means we can + * hope to fit all our datastructures in L3 cache. Since SMT is turned on, the Runtime's available * processors will show twice the number of actual cores and so we get 2MB L3 cache/thread. To be * safe, we try to stay within 1.8 MB/thread and size our hashtable appropriately. * - *
- * [Allocation] Since MemorySegment seemed slower than ByteBuffers, backing Chunks by bytebuffers + *
[Native ByteOrder is MUCH better] There was almost a 10% lift by reading ints from bytebuffers + * using native byteorder . It so happens that both the eval machine (7502P) and my machine 7950X + * use native LITTLE_ENDIAN order, which again apparently is because X86[-64] is little-endian. But, + * by default, ByteBuffers use BIG_ENDIAN order, which appears to be a somewhat strange default from + * Java. + * + *
[Allocation] Since MemorySegment seemed slower than ByteBuffers, backing Chunks by bytebuffers * was the logical option. Creating one ByteBuffer per chunk was no bueno because the system doesn't * like it (JVM runs out of mapped file handle quota). Other than that, allocation in the hot path * was avoided. * - *
- * [General approach to fast hashing and temperature reading] Here, it helps to understand the + *
[General approach to fast hashing and temperature reading] Here, it helps to understand the * various bottlenecks in execution. One particular thing that I kept coming back to was to * understand the relative costs of instructions: See * https://www.agner.org/optimize/instruction_tables.pdf It is helpful to think of hardware as a @@ -102,24 +101,22 @@ import java.util.stream.Collectors; * endPos" in a tight loop by breaking it into two pieces: one piece where the check will not be * needed and a tail piece where it will be needed. * - *
- * [Understand What Cores like]. Cores like to go straight and loop back. Despite good branch + *
[Understand What Cores like]. Cores like to go straight and loop back. Despite good branch * prediction, performance sucks with mispredicted branches. * - *
- * [JIT] Java performance requires understanding the JIT. It is helpful to understand what the JIT - * likes though it is still somewhat of a mystery to me. In general, it inlines small methods very - * well and after constant folding, it can optimize quite well across a reasonably deep call chain. - * My experience with the JIT was that everything I tried to tune it made it worse except for one - * parameter. I have a new-found respect for JIT - it likes and understands typical Java idioms. + *
[JIT] Java performance requires understanding the JIT. It is helpful to understand what the + * JIT likes though it is still somewhat of a mystery to me. In general, it inlines small methods + * very well and after constant folding, it can optimize quite well across a reasonably deep call + * chain. My experience with the JIT was that everything I tried to tune it made it worse except for + * one parameter. I have a new-found respect for JIT - it likes and understands typical Java idioms. * - *
[Tuning] Nothing was more insightful than actually playing with various tuning parameters. - * I can have all the theories but the hardware and JIT are giant blackboxes. I used a bunch of - * tools to optimize: (1) Command line parameters to tune big and small chunk sizes etc. This was - * also very helpful in forming a mental model of the JIT. Sometimes, it would compile some methods - * and sometimes it would just run them interpreted since the compilation threshold wouldn't be - * reached for intermediate methods. (2) AsyncProfiler - this was the first line tool to understand - * cache misses and cpu time to figure where to aim the next optimization effort. (3) JitWatch - + *
[Tuning] Nothing was more insightful than actually playing with various tuning parameters. I + * can have all the theories but the hardware and JIT are giant blackboxes. I used a bunch of tools + * to optimize: (1) Command line parameters to tune big and small chunk sizes etc. This was also + * very helpful in forming a mental model of the JIT. Sometimes, it would compile some methods and + * sometimes it would just run them interpreted since the compilation threshold wouldn't be reached + * for intermediate methods. (2) AsyncProfiler - this was the first line tool to understand cache + * misses and cpu time to figure where to aim the next optimization effort. (3) JitWatch - * invaluable for forming a mental model and attempting to tune the JIT. * *
[Things that didn't work]. This is a looong list and the hit rate is quite low. In general,
@@ -140,12 +137,6 @@ import java.util.stream.Collectors;
*/
public class CalculateAverage_vemana {
- public static void checkArg(boolean condition) {
- if (!condition) {
- throw new IllegalArgumentException();
- }
- }
-
public static void main(String[] args) throws Exception {
// First process in large chunks without coordination among threads
// Use chunkSizeBits for the large-chunk size
@@ -184,18 +175,26 @@ public class CalculateAverage_vemana {
// - hashtableSizeBits = \{hashtableSizeBits}
// """);
- System.out.println(new Runner(
- Path.of("measurements.txt"),
- chunkSizeBits,
- commonChunkFraction,
- commonChunkSizeBits,
- hashtableSizeBits).getSummaryStatistics());
+ System.out.println(
+ new Runner(
+ Path.of("measurements.txt"),
+ chunkSizeBits,
+ commonChunkFraction,
+ commonChunkSizeBits,
+ hashtableSizeBits)
+ .getSummaryStatistics());
}
- public interface LazyShardQueue {
+ public record AggregateResult(Map