Note that arranging your computation to be cache-aware can produce anywhere from 5x to 20x performance improvement in single-threaded code. The problem with portable code is that you have to figure out how to discover the caching parameters on the host machine, and have some way to tell the program to reoganize the computation based on the platform's caching strategy. As far as I know, this concept is not formally supported. The only code I know that this works well on is code that has been hand-tuned to a particular platform's cache, or automatically translated from input source to output source by a program that understands the cache model, and I have been told by someone who did this that a gain of 20x is not uncommon in single-threaded code.
In parallel code, the interactions of the threads in thrashing the cache are vastly more subtle and complex than with single-threaded code. Actually, I'm more amazed that the drop is so small (from 1.9 to 1.5) than that it exists at all.