* feat(flippingbits): Improve parsing of measurement and few cleanups
* feat(flippingbits): Reduce chunk size to 10MB
* feat(flippingbits): Improve parsing of station names
* chore(flippingbits): Remove obsolete import
* chore(flippingbits): Few cleanups
* Optimize checking for collisions by doing this a long at a time always.
* Use a long at a time scanning for delimiter.
* Minor tuning. Now below 0.80s on Intel i9-13900K.
* Add number parsing code from Quan Anh Mai. Fix name length issue.
* Include suggestion from Alfonso Peterssen for another 1.5%.
* Optimize hash collision check compare for ~4% gain.
* Add perf stats based on latest version.
* isolgpus: fix chunk sizing when not at 8 threads
use as many cores as are available
don't buffer the station name, only use it when we need it.
get rid of the main branch
move variables inside the loop
* isolgpus: optimistically assume we can read a whole int for the station name, but roll back if we get it wrong. This should be very beneficial on a dataset where station names are mostly over 4 chars
---------
Co-authored-by: Jamie Stansfield <jalstansfield@gmail.com>
Runs with standard JDK 21.
On my computers (i5 13500, 20 cores, 32GB ram) my best run is (file fully in page cache):
49.78user 0.69system 0:02.81elapsed 1795%CPU
A bit older version of the code on Mac pro M1 32 GB:
real 0m2.867s
user 0m23.956s
sys 0m1.329s
As I wrote in comments in the code, I have a few different roundings that the reference implementation. I have seend that there is an issue about that, but no specific rule yet.
Main points:
- use MemorySegment, it's faster than ByteBuffer
- split the work in a lot of chunks and distribute to a thread pool
- fast measurement parser by using a lot of domain knowledge
- very low allocation
- visit each byte only once
Things I tried that were in fact pessimizations:
- use some internal JDK code to vectorize the hashCode computation
- use a MemorySegment to represent the keys instead of byte[], to avoid
copying
Hope I won't have a bad surprise when running on the target server 😱
* start
* slower
* still bad
* finally faster than baseline :)
* starting to go fast
* improve
* we ball
* fix race condition an newline
* change threadpool
* ~18sec on my machine
* single thread memory mapped file reader, pool of processors
* cleanup of inner classes of MetricProcessor
* doubles are parsed without external functions, strings are lazily created from byte arrays
* remove load() MappedByteBuffer in memory
* fixed handling of newline
* fix a bug & correct locale used
* MappedByteBuffer size set to 1MB
* fixed rounding
* Do not use ArrayBlockingQueue.offer since it drops elements when queue is full
* MappedByteBuffer size = 32 MB
* Adding kgeri's solution
* parallelizing CalculatorAverage_kgeri
* fixing aggregation bugs, chunk size calc for small files
* removed GC logging
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
* fix for when there's no newline at end of input
* fix for when the final record ends on the chunk boundary
---------
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
* improve double reading by eleminating string parsing in between, make calculations over on integer instead of double, parse into double at the end only once
* more improvements, sharing a single StringBuilder to build all toStrings, minor performance gain.
* micro optimizations on reading temperature
* a small skip for redundant traverses, micro optmization
* micro optimization, eleminate some if cases, saves 0.5 seconds more
* micro optimization, calculate key hash ahead eleminates more more loop, saves 0.5 seconds more :)
* optimize key equals and handling the case when a region is larger than max integer size
---------
Co-authored-by: Yavuz Tas <yavuz.tas@ing.com>
* Initial version with multiple ideas
* Added virtual thread implementation based on certain task size
* Removed evaluate file
* Fixed test issues
* Added a custom input split
* first try
* format
* Update calculate_average_imrafaelmerino.sh
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
* Update src/main/java/dev/morling/onebrc/CalculateAverage_imrafaelmerino.java
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
---------
Co-authored-by: Rafael Merino García <imrafaelmerino@gmail.com>
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
* Implementation of 1brc - felix19350
* Added license header
* Fixed failing tests
* Replaced parsing of doubles with a custom parser and integer arithmetic
---------
Co-authored-by: Bruno Felix <bruno.felix@klarna.com>
* Initial version.
* Make PGO feature optional off-by-default. Needs PGO_MODE environment
variable to be set. Add -O3 -march=native tuning flags for better
performance.
* Adjust script to be more quiet.
* Adjust max city length. Fix an issue when accumulating results.
* Tune thomaswue submission.
mmap the entire file, use Unsafe directly instead of ByteBuffer, avoid byte[] copies.
These tricks give a ~30% speedup, over an already fast implementation.
* Optimize parsing of numbers based on specific given constraints.
* Fix for segment calculation for case of very small input.
* Minor shell script fixes.
* Separate out build step into file additional_build_step_thomaswue.sh,
simplify run script and remove PGO option for now.
* Minor corrections to the run script.
---------
Co-authored-by: Alfonso² Peterssen <alfonso.peterssen@oracle.com>
* A solution with Actor Model concurrency and MappedByteBuffer
* fix test cases
* revert back the file name to original
* cache String hashCode calculation via composing with Key object
* fix wrong key caching and eleminate duplicate String creation between actors
* update possible char count in a line, fix calculate_average.sh
* increase possible line length to 256 bytes, much safer to cover 100 chars I hope
---------
Co-authored-by: Yavuz Tas <yavuz.tas@ing.com>
* artpar's attempt
* artpar's attempt
* remove int -> Integer conversions, custom parsing for measurements
* remove allocations by caching station names
also remove Integer and use int instead to remove valueOf calls
* Fix result mismatch errors
* parse int instead of double
* reduce time spent reading the mapped buffer
* cleanup unused memory
* less is faster ? vector addition doesn't look worth it
* backout from virtual threads as well
* Fix breaking tests
Somewhat mixed collection of multiple ideas, mostly based initially
on using the new JDK Vector API for extracting offsets of newlines
and semicolons.
Runs locally in just under 11 seconds on 1B rows of input on a
2020 M1 Macbook Air.
* isolgpus: submission 1
* isolgpus: fix min value bug (breaks if a negative temperature never appears)
* isolgpus: remove unused collector
* isolgpus: fix split on chunk bug
* isolgpus: change name equality algo to a cheaper check.
* isolgpus: fix chunking state to cope with last byte of last chunk
* isolgpus: hash as we go, instead of at the end
* isolgpus: adjust thread count to core count
* isolgpus: change cores to 8 statically
---------
Co-authored-by: Jamie Stansfield <jalstansfield@gmail.com>
* First performance tweaks
* further tweaks
* collect into a treemap
* Tweak JVM options
* Inline rounding into collector
* reduce some operations
* oops, add missing braces
* tweak JVM options
* small fixes
* add min and max to processing
* fix min
* remove compact strings
* replace sumWithCompensation with naive sum implementation
* use UseShenandoahGC
* integrate mmap
* integrate mmap
* Fix messed up array logic
* Set jdk version
* Use Integer calculation instead of double, add unit-test
* Bring back StationIdent optimization
Originally, StationIdent was using byte[] to store names, so the extra
String allocation could be avoided. However, that produced incorrect
sorting.
Sorting is now moved to the result merging step. Here, names are
converted to Strings.
* Implement readStationName with SIMD 256bit
* Rebase and cleanup test code, now that it's in the project
* Fix seijikun formatting
* Fix test failure in specific jobCnt edge-cases
* Also switch to graalvm
In case of key collision broken implementation will likely attribute
measurements to the wrong key and therefore it is better to have
non-zero value to end up with a wrong average value.
When all measurements are zero then averages are also zero even
when attributed to the wrong keys.
Updates #91
* Added tests for endian-calculations (had these in a different class, perhaps handy for others to see as well)
Inlined the hash function, runs locally in 2.4sec now, hopefully endian issues fix
Added equals to support any city name up to 1024 in length, don't rely on hash
* For clarity I've updated the code so endian doesn't change the hashes, easier to debug.
* Fixing bug in array check
Simple is faster
* Also spotted the diff, not just the big exception
Fixed buffer limit issue
Input created via
```sh
bash -c 'for i in {1..10000} ; do echo "id$i;0.0" ; done' >./src/test/resources/samples/measurements-10000-unique-keys.txt
```
and output via baseline implementation.
Keys are short and very similar which improves chances for collision
and hence are good for testing.
Fixes#91
* Use open-addressing scheme to deal with hash table collisions. Reduce concurrency from 16 to 8. Use bit mask rather than mod operator to confine hash code to table range.
* Properly handle file partitions that reside entirely within a line.
* Reorder statements in doProcessBuffer.
Adds test samples that can be used for unit tests or to verify
implementations via:
```bash
for sample in $(ls src/test/resources/samples/*.txt)
do
echo "Validating $sample"
rm -f measurements.txt
ln -s $sample measurements.txt
diff <(./calculate_average.sh) ${sample%.txt}.out
done
rm measurements.txt
```
For #61
Added SWAR (SIMD Within A Register) code to increase bytebuffer processing/throughput
Delaying the creation of the String by comparing hash, segmenting like spullara, improved EOL finding
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
* Initial implementation using Shenandoah GC and parallel iteration
* Use memory mapped files
* Iterate the buffer once and use BigDecimal parsing instead of Double.parseDouble
* Add information about Graal
* Add sdk use to calculate script
* Memory mapped file, single-pass parsing, custom hash map, fixed thread pool
The threading was a hasty addition and needs work
* Used arraylist instead of treemap to reduce a little overhead
We only need it sorted for output, so only construct a treemap for output
* Attempt to speed up double conversion
* Cap core count for low-core systems
* Fix wrong exponent
* Accumulate measurement value in double, seems marginally faster
Benchmark Mode Cnt Score Error Units
DoubleParsingBenchmark.ourToDouble thrpt 10 569.771 ± 7.065 ops/us
DoubleParsingBenchmark.ourToDoubleAccumulateInToDouble thrpt 10 648.026 ± 7.741 ops/us
DoubleParsingBenchmark.ourToDoubleDivideInsteadOfMultiply thrpt 10 570.412 ± 9.329 ops/us
DoubleParsingBenchmark.ourToDoubleNegative thrpt 10 512.618 ± 8.580 ops/us
DoubleParsingBenchmark.ourToDoubleNegativeAccumulateInToDouble thrpt 10 565.043 ± 18.137 ops/us
DoubleParsingBenchmark.ourToDoubleNegativeDivideInsteadOfMultiply thrpt 10 511.228 ± 13.967 ops/us
DoubleParsingBenchmark.stringToDouble thrpt 10 52.310 ± 1.351 ops/us
DoubleParsingBenchmark.stringToDoubleNegative thrpt 10 50.785 ± 1.252 ops/us