Squashing a bunch of commits together. (#428)
Commit#2; Uplift of 7% using native byteorder from ByteBuffer. Commit#1: Minor changes to formatting. Co-authored-by: vemana <vemana.github@gmail.com>
This commit is contained in:
		| @@ -41,55 +41,54 @@ import java.util.stream.Collectors; | ||||
|  * remain readable for a majority of SWEs. At a high level, the approach relies on a few principles | ||||
|  * listed herein. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Exploit Parallelism] Distribute the work into Shards. Separate threads (one per core) process | ||||
|  * <p>[Exploit Parallelism] Distribute the work into Shards. Separate threads (one per core) process | ||||
|  * Shards and follow it up by merging the results. parallelStream() is appealing but carries | ||||
|  * potential run-time variance (i.e. std. deviation) penalties based on informal testing. Variance | ||||
|  * is not ideal when trying to minimize the maximum worker latency. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Use ByteBuffers over MemorySegment] Each Shard is further divided in Chunks. This would've been | ||||
|  * unnecessary except that Shards are too big to be backed by ByteBuffers. Besides, MemorySegment | ||||
|  * appears slower than ByteBuffers. So, to use ByteBuffers, we have to use smaller chunks. | ||||
|  * <p>[Use ByteBuffers over MemorySegment] Each Shard is further divided in Chunks. This would've | ||||
|  * been unnecessary except that Shards are too big to be backed by ByteBuffers. Besides, | ||||
|  * MemorySegment appears slower than ByteBuffers. So, to use ByteBuffers, we have to use smaller | ||||
|  * chunks. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Straggler freedom] The optimization function here is to minimize the maximal worker thread | ||||
|  * <p>[Straggler freedom] The optimization function here is to minimize the maximal worker thread | ||||
|  * completion. Law of large number averages means that all the threads will end up with similar | ||||
|  * amounts of work and similar completion times; but, however ever so often there could be a bad | ||||
|  * sharding and more importantly, Cores are not created equal; some will be throttled more than | ||||
|  * others. So, we have a shared {@code LazyShardQueue} that aims to distribute work to minimize the | ||||
|  * latest completion time. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Work Assignment with LazyShardQueue] The queue provides each thread with its next big-chunk | ||||
|  * <p>[Work Assignment with LazyShardQueue] The queue provides each thread with its next big-chunk | ||||
|  * until X% of the work remains. Big-chunks belong to the thread and will not be provided to another | ||||
|  * thread.  Then, it switches to providing small-chunk sizes. Small-chunks comprise the last X% of | ||||
|  * thread. Then, it switches to providing small-chunk sizes. Small-chunks comprise the last X% of | ||||
|  * work and every thread can participate in completing the chunk. Even though the queue is shared | ||||
|  * across threads, there's no communication across thread during the big-chunk phases. The queue is | ||||
|  * effectively a per-thread queue while processing big-chunks. The small-chunk phase uses an | ||||
|  * AtomicLong to coordinate chunk allocation across threads. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Chunk processing] Chunk processing is typical. Process line by line. Find a hash function | ||||
|  * <p>[Chunk processing] Chunk processing is typical. Process line by line. Find a hash function | ||||
|  * (polynomial hash fns are slow, but will work fine), hash the city name, resolve conflicts using | ||||
|  * linear probing and then accumulate the temperature into the appropriate hash slot. The key | ||||
|  * element then is how fast can you identify the hash slot, read the temperature and update the new | ||||
|  * temperature in the slot (i.e. min, max, count). | ||||
|  * | ||||
|  * <p> | ||||
|  * [Cache friendliness] 7502P and my machine (7950X) offer 4MB L3 cache/core. This means we can hope | ||||
|  * to fit all our datastructures in L3 cache. Since SMT is turned on, the Runtime's available | ||||
|  * <p>[Cache friendliness] 7502P and my machine (7950X) offer 4MB L3 cache/core. This means we can | ||||
|  * hope to fit all our datastructures in L3 cache. Since SMT is turned on, the Runtime's available | ||||
|  * processors will show twice the number of actual cores and so we get 2MB L3 cache/thread. To be | ||||
|  * safe, we try to stay within 1.8 MB/thread and size our hashtable appropriately. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Allocation] Since MemorySegment seemed slower than ByteBuffers, backing Chunks by bytebuffers | ||||
|  * <p>[Native ByteOrder is MUCH better] There was almost a 10% lift by reading ints from bytebuffers | ||||
|  * using native byteorder . It so happens that both the eval machine (7502P) and my machine 7950X | ||||
|  * use native LITTLE_ENDIAN order, which again apparently is because X86[-64] is little-endian. But, | ||||
|  * by default, ByteBuffers use BIG_ENDIAN order, which appears to be a somewhat strange default from | ||||
|  * Java. | ||||
|  * | ||||
|  * <p>[Allocation] Since MemorySegment seemed slower than ByteBuffers, backing Chunks by bytebuffers | ||||
|  * was the logical option. Creating one ByteBuffer per chunk was no bueno because the system doesn't | ||||
|  * like it (JVM runs out of mapped file handle quota). Other than that, allocation in the hot path | ||||
|  * was avoided. | ||||
|  * | ||||
|  * <p> | ||||
|  * [General approach to fast hashing and temperature reading] Here, it helps to understand the | ||||
|  * <p>[General approach to fast hashing and temperature reading] Here, it helps to understand the | ||||
|  * various bottlenecks in execution. One particular thing that I kept coming back to was to | ||||
|  * understand the relative costs of instructions: See | ||||
|  * https://www.agner.org/optimize/instruction_tables.pdf It is helpful to think of hardware as a | ||||
| @@ -102,24 +101,22 @@ import java.util.stream.Collectors; | ||||
|  * endPos" in a tight loop by breaking it into two pieces: one piece where the check will not be | ||||
|  * needed and a tail piece where it will be needed. | ||||
|  * | ||||
|  * <p> | ||||
|  * [Understand What Cores like]. Cores like to go straight and loop back. Despite good branch | ||||
|  * <p>[Understand What Cores like]. Cores like to go straight and loop back. Despite good branch | ||||
|  * prediction, performance sucks with mispredicted branches. | ||||
|  * | ||||
|  * <p> | ||||
|  * [JIT] Java performance requires understanding the JIT. It is helpful to understand what the JIT | ||||
|  * likes though it is still somewhat of a mystery to me. In general, it inlines small methods very | ||||
|  * well and after constant folding, it can optimize quite well across a reasonably deep call chain. | ||||
|  * My experience with the JIT was that everything I tried to tune it made it worse except for one | ||||
|  * parameter. I have a new-found respect for JIT - it likes and understands typical Java idioms. | ||||
|  * <p>[JIT] Java performance requires understanding the JIT. It is helpful to understand what the | ||||
|  * JIT likes though it is still somewhat of a mystery to me. In general, it inlines small methods | ||||
|  * very well and after constant folding, it can optimize quite well across a reasonably deep call | ||||
|  * chain. My experience with the JIT was that everything I tried to tune it made it worse except for | ||||
|  * one parameter. I have a new-found respect for JIT - it likes and understands typical Java idioms. | ||||
|  * | ||||
|  * <p>[Tuning] Nothing was more insightful than actually playing with various tuning parameters. | ||||
|  * I can have all the theories but the hardware and JIT are giant blackboxes. I used a bunch of | ||||
|  * tools to optimize: (1) Command line parameters to tune big and small chunk sizes etc. This was | ||||
|  * also very helpful in forming a mental model of the JIT. Sometimes, it would compile some methods | ||||
|  * and sometimes it would just run them interpreted since the compilation threshold wouldn't be | ||||
|  * reached for intermediate methods. (2) AsyncProfiler - this was the first line tool to understand | ||||
|  * cache misses and cpu time to figure where to aim the next optimization effort. (3) JitWatch - | ||||
|  * <p>[Tuning] Nothing was more insightful than actually playing with various tuning parameters. I | ||||
|  * can have all the theories but the hardware and JIT are giant blackboxes. I used a bunch of tools | ||||
|  * to optimize: (1) Command line parameters to tune big and small chunk sizes etc. This was also | ||||
|  * very helpful in forming a mental model of the JIT. Sometimes, it would compile some methods and | ||||
|  * sometimes it would just run them interpreted since the compilation threshold wouldn't be reached | ||||
|  * for intermediate methods. (2) AsyncProfiler - this was the first line tool to understand cache | ||||
|  * misses and cpu time to figure where to aim the next optimization effort. (3) JitWatch - | ||||
|  * invaluable for forming a mental model and attempting to tune the JIT. | ||||
|  * | ||||
|  * <p>[Things that didn't work]. This is a looong list and the hit rate is quite low. In general, | ||||
| @@ -140,12 +137,6 @@ import java.util.stream.Collectors; | ||||
|  */ | ||||
| public class CalculateAverage_vemana { | ||||
|  | ||||
|     public static void checkArg(boolean condition) { | ||||
|         if (!condition) { | ||||
|             throw new IllegalArgumentException(); | ||||
|         } | ||||
|     } | ||||
|  | ||||
|     public static void main(String[] args) throws Exception { | ||||
|         // First process in large chunks without coordination among threads | ||||
|         // Use chunkSizeBits for the large-chunk size | ||||
| @@ -184,18 +175,26 @@ public class CalculateAverage_vemana { | ||||
|         // - hashtableSizeBits = \{hashtableSizeBits} | ||||
|         // """); | ||||
|  | ||||
|         System.out.println(new Runner( | ||||
|                 Path.of("measurements.txt"), | ||||
|                 chunkSizeBits, | ||||
|                 commonChunkFraction, | ||||
|                 commonChunkSizeBits, | ||||
|                 hashtableSizeBits).getSummaryStatistics()); | ||||
|         System.out.println( | ||||
|                 new Runner( | ||||
|                         Path.of("measurements.txt"), | ||||
|                         chunkSizeBits, | ||||
|                         commonChunkFraction, | ||||
|                         commonChunkSizeBits, | ||||
|                         hashtableSizeBits) | ||||
|                                 .getSummaryStatistics()); | ||||
|     } | ||||
|  | ||||
|     public interface LazyShardQueue { | ||||
|   public record AggregateResult(Map<String, Stat> tempStats) { | ||||
|  | ||||
|         ByteRange take(int shardIdx); | ||||
|     @Override | ||||
|     public String toString() { | ||||
|       return this.tempStats().entrySet().stream() | ||||
|           .sorted(Entry.comparingByKey()) | ||||
|           .map(entry -> "%s=%s".formatted(entry.getKey(), entry.getValue())) | ||||
|           .collect(Collectors.joining(", ", "{", "}")); | ||||
|     } | ||||
|   } | ||||
|  | ||||
|     // Mutable to avoid allocation | ||||
|     public static class ByteRange { | ||||
| @@ -267,11 +266,11 @@ public class CalculateAverage_vemana { | ||||
|     @Override | ||||
|     public String toString() { | ||||
|       return STR.""" | ||||
|           ByteRange { | ||||
|             startInBuf = \{startInBuf} | ||||
|             endInBuf = \{endInBuf} | ||||
|           } | ||||
|           """; | ||||
|         ByteRange { | ||||
|           startInBuf = \{startInBuf} | ||||
|           endInBuf = \{endInBuf} | ||||
|         } | ||||
|         """; | ||||
|     } | ||||
|  | ||||
|         private long nextNewLine(long pos) { | ||||
| @@ -285,6 +284,7 @@ public class CalculateAverage_vemana { | ||||
|         private void setByteBufferToRange(long start, long end) { | ||||
|             try { | ||||
|                 byteBuffer = raf.getChannel().map(MapMode.READ_ONLY, start, end - start); | ||||
|                 byteBuffer.order(ByteOrder.nativeOrder()); | ||||
|             } | ||||
|             catch (IOException e) { | ||||
|                 throw new RuntimeException(e); | ||||
| @@ -292,18 +292,22 @@ public class CalculateAverage_vemana { | ||||
|         } | ||||
|     } | ||||
|  | ||||
|   public record Result(Map<String, Stat> tempStats) { | ||||
|     public static final class Checks { | ||||
|  | ||||
|     @Override | ||||
|     public String toString() { | ||||
|       return this.tempStats() | ||||
|                  .entrySet() | ||||
|                  .stream() | ||||
|                  .sorted(Entry.comparingByKey()) | ||||
|                  .map(entry -> "%s=%s".formatted(entry.getKey(), entry.getValue())) | ||||
|                  .collect(Collectors.joining(", ", "{", "}")); | ||||
|         public static void checkArg(boolean condition) { | ||||
|             if (!condition) { | ||||
|                 throw new IllegalArgumentException(); | ||||
|             } | ||||
|         } | ||||
|  | ||||
|         private Checks() { | ||||
|         } | ||||
|     } | ||||
|  | ||||
|     public interface LazyShardQueue { | ||||
|  | ||||
|         ByteRange take(int shardIdx); | ||||
|     } | ||||
|   } | ||||
|  | ||||
|     public static class Runner { | ||||
|  | ||||
| @@ -314,7 +318,10 @@ public class CalculateAverage_vemana { | ||||
|         private final int shardSizeBits; | ||||
|  | ||||
|         public Runner( | ||||
|                       Path inputFile, int chunkSizeBits, double commonChunkFraction, int commonChunkSizeBits, | ||||
|                       Path inputFile, | ||||
|                       int chunkSizeBits, | ||||
|                       double commonChunkFraction, | ||||
|                       int commonChunkSizeBits, | ||||
|                       int hashtableSizeBits) { | ||||
|             this.inputFile = inputFile; | ||||
|             this.shardSizeBits = chunkSizeBits; | ||||
| @@ -323,16 +330,12 @@ public class CalculateAverage_vemana { | ||||
|             this.hashtableSizeBits = hashtableSizeBits; | ||||
|         } | ||||
|  | ||||
|         Result getSummaryStatistics() throws Exception { | ||||
|         AggregateResult getSummaryStatistics() throws Exception { | ||||
|             int processors = Runtime.getRuntime().availableProcessors(); | ||||
|             LazyShardQueue shardQueue = new SerialLazyShardQueue( | ||||
|                     1L << shardSizeBits, | ||||
|                     inputFile, | ||||
|                     processors, | ||||
|                     commonChunkFraction, | ||||
|                     commonChunkSizeBits); | ||||
|                     1L << shardSizeBits, inputFile, processors, commonChunkFraction, commonChunkSizeBits); | ||||
|  | ||||
|             List<Future<Result>> results = new ArrayList<>(); | ||||
|             List<Future<AggregateResult>> results = new ArrayList<>(); | ||||
|             ExecutorService executorService = Executors.newFixedThreadPool( | ||||
|                     processors, | ||||
|                     runnable -> { | ||||
| @@ -345,8 +348,8 @@ public class CalculateAverage_vemana { | ||||
|  | ||||
|             for (int i = 0; i < processors; i++) { | ||||
|                 final int I = i; | ||||
|                 final Callable<Result> callable = () -> { | ||||
|                     Result result = new ShardProcessor(shardQueue, hashtableSizeBits, I).processShard(); | ||||
|                 final Callable<AggregateResult> callable = () -> { | ||||
|                     AggregateResult result = new ShardProcessor(shardQueue, hashtableSizeBits, I).processShard(); | ||||
|                     finishTimes[I] = System.nanoTime(); | ||||
|                     return result; | ||||
|                 }; | ||||
| @@ -356,7 +359,7 @@ public class CalculateAverage_vemana { | ||||
|             return executorService.submit(() -> merge(results)).get(); | ||||
|         } | ||||
|  | ||||
|         private Result merge(List<Future<Result>> results) | ||||
|         private AggregateResult merge(List<Future<AggregateResult>> results) | ||||
|                 throws ExecutionException, InterruptedException { | ||||
|             Map<String, Stat> output = null; | ||||
|             boolean[] isDone = new boolean[results.size()]; | ||||
| @@ -374,20 +377,20 @@ public class CalculateAverage_vemana { | ||||
|                             for (Entry<String, Stat> entry : results.get(i).get().tempStats().entrySet()) { | ||||
|                                 output.compute( | ||||
|                                         entry.getKey(), | ||||
|                                         (key, value) -> value == null ? entry.getValue() | ||||
|                                                 : Stat.merge(value, entry.getValue())); | ||||
|                                         (key, value) -> value == null ? entry.getValue() : Stat.merge(value, entry.getValue())); | ||||
|                             } | ||||
|                         } | ||||
|                     } | ||||
|                 } | ||||
|             } | ||||
|             return new Result(output); | ||||
|             return new AggregateResult(output); | ||||
|         } | ||||
|  | ||||
|     private void printFinishTimes(long[] finishTimes) { | ||||
|       Arrays.sort(finishTimes); | ||||
|       int n = finishTimes.length; | ||||
|       System.err.println(STR."Finish Delta: \{(finishTimes[n - 1] - finishTimes[0]) / 1_000_000}ms"); | ||||
|       System.err.println( | ||||
|           STR."Finish Delta: \{(finishTimes[n - 1] - finishTimes[0]) / 1_000_000}ms"); | ||||
|     } | ||||
|     } | ||||
|  | ||||
| @@ -405,23 +408,29 @@ public class CalculateAverage_vemana { | ||||
|         private final long[] nextStarts; | ||||
|  | ||||
|         public SerialLazyShardQueue( | ||||
|                                     long chunkSize, Path filePath, int shards, double commonChunkFraction, | ||||
|                                     long chunkSize, | ||||
|                                     Path filePath, | ||||
|                                     int shards, | ||||
|                                     double commonChunkFraction, | ||||
|                                     int commonChunkSizeBits) | ||||
|                 throws IOException { | ||||
|             checkArg(commonChunkFraction < 0.9 && commonChunkFraction >= 0); | ||||
|             Checks.checkArg(commonChunkFraction < 0.9 && commonChunkFraction >= 0); | ||||
|             var raf = new RandomAccessFile(filePath.toFile(), "r"); | ||||
|             this.fileSize = raf.length(); | ||||
|  | ||||
|             // Common pool | ||||
|             long commonPoolStart = Math.min( | ||||
|                     roundToNearestHigherMultipleOf(chunkSize, (long) (fileSize * (1 - commonChunkFraction))), | ||||
|                     roundToNearestHigherMultipleOf( | ||||
|                             chunkSize, (long) (fileSize * (1 - commonChunkFraction))), | ||||
|                     fileSize); | ||||
|             this.commonPool = new AtomicLong(commonPoolStart); | ||||
|             this.commonChunkSize = 1L << commonChunkSizeBits; | ||||
|  | ||||
|             // Distribute chunks to shards | ||||
|             this.nextStarts = new long[shards << 4]; // thread idx -> 16*idx to avoid cache line conflict | ||||
|             for (long i = 0, currentStart = 0, remainingChunks = (commonPoolStart + chunkSize - 1) / chunkSize; i < shards; i++) { | ||||
|             for (long i = 0, | ||||
|                     currentStart = 0, | ||||
|                     remainingChunks = (commonPoolStart + chunkSize - 1) / chunkSize; i < shards; i++) { | ||||
|                 long remainingShards = shards - i; | ||||
|                 long currentChunks = (remainingChunks + remainingShards - 1) / remainingShards; | ||||
|                 // Shard i handles: [currentStart, currentStart + currentChunks * chunkSize) | ||||
| @@ -479,7 +488,7 @@ public class CalculateAverage_vemana { | ||||
|             this.state = new ShardProcessorState(hashtableSizeBits); | ||||
|         } | ||||
|  | ||||
|         public Result processShard() { | ||||
|         public AggregateResult processShard() { | ||||
|             ByteRange range; | ||||
|             while ((range = shardQueue.take(threadIdx)) != null) { | ||||
|                 processRange(range); | ||||
| @@ -497,7 +506,7 @@ public class CalculateAverage_vemana { | ||||
|             } | ||||
|         } | ||||
|  | ||||
|         private Result result() { | ||||
|         private AggregateResult result() { | ||||
|             return state.result(); | ||||
|         } | ||||
|     } | ||||
| @@ -527,30 +536,30 @@ public class CalculateAverage_vemana { | ||||
|                     x = Integer.reverseBytes(x); | ||||
|                 } | ||||
|  | ||||
|                 byte a = (byte) (x >>> 24); | ||||
|                 byte a = (byte) (x >>> 0); | ||||
|                 if (a == ';') { | ||||
|                     nextPos += 1; | ||||
|                     break; | ||||
|                 } | ||||
|  | ||||
|                 byte b = (byte) (x >>> 16); | ||||
|                 byte b = (byte) (x >>> 8); | ||||
|                 if (b == ';') { | ||||
|                     nextPos += 2; | ||||
|                     hash = hash * 31 + ((0xFF000000 & x)); | ||||
|                     hash = hash * 31 + ((0xFF & x)); | ||||
|                     break; | ||||
|                 } | ||||
|  | ||||
|                 byte c = (byte) (x >>> 8); | ||||
|                 byte c = (byte) (x >>> 16); | ||||
|                 if (c == ';') { | ||||
|                     nextPos += 3; | ||||
|                     hash = hash * 31 + ((0xFFFF0000 & x)); | ||||
|                     hash = hash * 31 + ((0xFFFF & x)); | ||||
|                     break; | ||||
|                 } | ||||
|  | ||||
|                 byte d = (byte) (x >>> 0); | ||||
|                 byte d = (byte) (x >>> 24); | ||||
|                 if (d == ';') { | ||||
|                     nextPos += 4; | ||||
|                     hash = hash * 31 + ((0xFFFFFF00 & x)); | ||||
|                     hash = hash * 31 + ((0xFFFFFF & x)); | ||||
|                     break; | ||||
|                 } | ||||
|  | ||||
| @@ -582,16 +591,12 @@ public class CalculateAverage_vemana { | ||||
|             } | ||||
|  | ||||
|             linearProbe( | ||||
|                     cityLen, | ||||
|                     hash & slotsMask, | ||||
|                     negative ? -temperature : temperature, | ||||
|                     mmb, | ||||
|                     originalPos); | ||||
|                     cityLen, hash & slotsMask, negative ? -temperature : temperature, mmb, originalPos); | ||||
|  | ||||
|             return nextPos; | ||||
|         } | ||||
|  | ||||
|         public Result result() { | ||||
|         public AggregateResult result() { | ||||
|             int N = stats.length; | ||||
|             TreeMap<String, Stat> map = new TreeMap<>(); | ||||
|             for (int i = 0; i < N; i++) { | ||||
| @@ -599,7 +604,7 @@ public class CalculateAverage_vemana { | ||||
|                     map.put(new String(cityNames[i]), stats[i]); | ||||
|                 } | ||||
|             } | ||||
|             return new Result(map); | ||||
|             return new AggregateResult(map); | ||||
|         } | ||||
|  | ||||
|         private byte[] copyFrom(MappedByteBuffer mmb, int offsetInMmb, int len) { | ||||
| @@ -642,6 +647,7 @@ public class CalculateAverage_vemana { | ||||
|         } | ||||
|     } | ||||
|  | ||||
|     /** Represents aggregate stats. */ | ||||
|     public static class Stat { | ||||
|  | ||||
|         public static Stat firstReading(int temp) { | ||||
|   | ||||
		Reference in New Issue
	
	Block a user