HN Gopher Feed (2017-06-27) - page 1 of 10 ___________________________________________________________________
Java and SIMD
75 points by mmastrac
http://prestodb.rocks/code/simd/___________________________________________________________________
ldargin - 2 hours ago
I asked James Gosling about SIMD (MMX and SSE) back in the year
2000 JavaOne conference. He said the answer is method calls, and
that the compiler can use whatever instructions it likes. (I
submitted the question online, and he answered it on stage, and I
later saw the recording; I didn't attend, nor meet him in person.
He seemed a bit annoyed that the question was too simple.)
marmaduke - 1 hours ago
Why not use some OpenCL in some form either raw or Aparapi?
stusmall - 2 minutes ago
Crossing the JNI barrier is expensive performance-wise. Unless
it's a longer running, CPU heavy calculation its usually best to
stay in pure Java. There are some things that going out to
another high performance framework will speed things up, but the
bar is a bit higher because of the JNI overhead.
alkonaut - 1 hours ago
Anyone knows how RyuJIT compares to Java8 and Java9 in a comparison
like this?
sargun - 1 hours ago
Be very careful about this in a shared environment. AVX512 slows
down the CPU cores because of thermal, voltage throttling. The
instructions "take more work". Intel CPUs take 1 MILLIsecond to
return to normal speed.If you're doing any kind of rapid context
switching, or multiple workloads, depending on how your scheduler
is setup, the OTHER workloads will show up as using more percentage
of CPU time per work item. It's non-intuitive, and difficult to
debug.
stu2010 - 1 hours ago
Now that major cloud vendors are selling VMs with guaranteed
AVX512 support, how are they going to deal with the "noisy
neighbor" problem?
gigatexal - 2 hours ago
This has got to be one of the best blog posts I've read in a while.
It's clean, concise, has a clear use-case and is well benchmarked.
Kudos to the author.
aardvark179 - 2 hours ago
There's room for two approaches in Java really. The JIT can be
smart and use SIMD instructions where it can see they are
applicable, but there's also room for a small DSL like API that
allows library authors and other very experienced users to express
a suitable algorithm and have it easily translated into the SIMD
instructions available at runtime. Anybody interested in the latter
should take a look at the work being done under project Panama.
briane80 - 1 hours ago
Annotation based compiler hints?
_old_dude_ - 2 hours ago
This presentation was posted recently on the general OpenJDK
mailing list http://cr.openjdk.java.net/~vlivanov/talks/2017_Vec
torization_in_HotSpot_JVM.pdf
faragon - 2 hours ago
Auto-vectorization is hard. Very hard. E.g. even in C/C++, the
compiler (e.g. GCC C/C++ or MS VC/VC++) is unable to vectorize
loops unless you help it a lot, and in most cases, you end writing
SIMD "intrinsics" (e.g. [1]) in order to get optimal results. From
my experience, despite auto-vectorization being better than 10
years ago, still is very far for optimizing code properly without
lots of tuning (e.g. you can try build any graphic processing
library and see the vectorization warnings -i.e. why the
vectorization was not possible-).Ten years ago, in the SSE2/Altivec
times, I thought that it would be matter of time having much
smarter compilers making graphics/pixel processing code much
faster, but not. So for JIT the case it can not be better, because
is similar, as even taking runtime information, the auto-
vectorization phase is equivalent. I would love to see smarter
compilers, understanding the code, many steps over current
hardwired pattern-matching based optimizations.[1]
https://software.intel.com/sites/landingpage/IntrinsicsGuide...
YSFEJ4SWJUVU6 - 1 hours ago
A few years back I decided to entertain myself by testing how
smart today's smart compilers really are when it comes to auto-
vectorization.I had this small and simple C application I'd
written years earlier that tried to find inputs whose
corresponding MD5 hashes started with certain bytes. It was a
good base because it was obviously vectorizable.At first enabling
the vectorizer didn't result in any changes to the binaries. I
then correctly guessed that (potentially) calling printf function
inside a hot loop might confuse it. After slight refactoring I
got the compiler to output SSE instructions, which resulted in a
nice 2.5? testing speed over the original (incidentally even
without auto-vectorizing the refactored code resulted in faster
binaries, which is not all that surprising).Anyway, I also
rewrote the application to use intrinsics. I hadn't used them
before myself, but it really much time at all familiarize myself
and write the code, and it was indeed quite a bit faster than
what the compiler was capable of with resulting binary having
12.5? speed compared to the original, or almost 5? compared to
what the compiler could achieve without explicit hints from
intrinsics.
foota - 1 hours ago
Would be interesting to see the difference in the compiled
assembly.
[deleted]
evincarofautumn - 26 minutes ago
I?ve had similar experiences. If I want vectorised code, I just
write it myself using intrinsics or assembly. It?s fine if the
compiler can autovectorise something I didn?t feel like doing
by hand, but I?m not going to rely on heuristic voodoo to get
the machine code I want for a hot loop. I wouldn?t mind a
slightly nicer wrapper API for the intrinsics, though,
something like glsl-sse2[1].And that?s more or less what I?m
planning to do in a programming language I?m working on,
actually?if you use a SIMD-compatible array type, the compiler
will try to keep it in a vector register, and some operations
will be faster (e.g., ?+? on two Float32^4 values will compile
to an addps) but it?s up to the programmer to use the
instructions they actually want, or tell the compiler with a
macro ?please vectorise this loop or warn me about why you
can?t?.[1] https://github.com/LiraNuna/glsl-sse2
faragon - 1 hours ago
That matches my experience. With auto-vectorization you get
some speed-up helping the compiler (not always obvious, often
requiring +1 increments, etc.), but for full speed you need to
do handwritten SIMD intrinsics. I would like to have at least
50% of the optimal by the compiler, without intrinsics (and
using intrinsics for the most critical code).