Java Threads Comment To Sun 1998/08/20

This is a request for a change in the Threads API and behaviour compared to that seen in Sun's reference Solaris JDK 1.1.5, including the SunSoft Solaris JDK 1.1.5 release with JIT and native threads.

Reason For Request

The reason for the request is to make Java-built systems behave better (faster and more scalable) portably in performance-critical I/O- and CPU- bound code.

(Not having access to the JDK source I am having to guess causes from truss output and my Java source and a working knowledge of the underlying threads/pthreads library, thus I may be off the mark.)

Summary of Request

Here is a summary of my request/suggestions in approximate descending order of importance, expanded in more detail below:

A new static int availableConcurrency() member of Thread that might return either the number of available (configured or on-line) CPUs in the system, or possibly the amount of CPU `free' in the system (ie the number on-line, multiplied by the fraction of free CPU time in the system over some short interval such as available from system metrics, rounded [up]). A minimal implementation may always return 1 (1 CPU) or possibly 0 (don't know). A green-threads implementation could always return 1, for example, as it will never be able to exploit more than one CPU.
My suggestion is simply to return the number of configured or on-line CPUs in the system, or 1 if not known and for green-thread-style implementations. (A separate static member might return fraction of unused CPU time in the system for advanced adaptive algorithms.)
The reason for this call is to allow a CPU-bound algorithm, eg a numerical computation, to split its processing into a near-optimal number of pieces to be processed in parallel on separate CPUs.
A new void setResourceIntensive(boolean isIntensive) instance member of Thread that would set an attribute of the thread just like setDaemon(), and would have to be called before start(), which indicates that the thread is going to be either I/O bound (see examples below) or CPU-bound and needs to be scheduled on its own processor if possible. In particular this is a statement that it is not adequate for the whole Java process to make progress with as little as one active thread, but that this new thread should really be able to run (or block) in parallel with others in the process.
In a Solaris implementation this corresponds to binding the thread to its own LWP. In the Microsoft implementations to date that I have used where each Java thread is a kernel thread, this call would have no effect.
The API can state that using this feature for a thread may use more system resources than normal (lighter-weight) threads, so fewer of them may be creatable than normal, and they might be slightly less efficient overall.
A more robust threading implementation. All JVMs I have tried up to and including the SunSoft JDK 1.1.5 on Solaris fall over with some sort of non-Java failure when you try to create more than about 3000 (very simple) threads. The Solaris JVMs hang without explanation or any chance of recovery (eg no OutOfMemoryError thrown). The Microsoft JVM I have tested most thoroughly (4.79.1518, as comes with VJ++1.1) falls over a little sooner with a memory error. While 3000 threads may seem a little extreme it is not impossible nor entirely unreasonable (eg on a busy Web server) and hanging or crashing is not acceptable behaviour!
On the widefp proposal... I suggest that some means be provided to allow the same method to be available with both widefp and non-widefp behaviour so that a decision can be made by the caller (potentially on each call) about which version is appropriate to the particular algorithm being run without needing two copies of the code for each routine with different names and attributes to be maintained by the developer. Preferably this should be achieved without actually incurring the relatively high cost of a run-time conditional; I'm not sure if this can be done without an obtrusive language change. I am not suggesting a static flag that decides behaviour since this is not thread-safe, ugly, and may cause surprises. I also suggest that java.lang.Math() have widefp and non-widefp routines available, and also suggest the inclusion of both widefp and non-widefp versions of the error function (erf()), which is very useful in finance.
Solaris JVM only: better behaviour in some I/O-blocking calls such as the java.io.File methods isFile(), isDirectory(), etc, and their stat() calls that do block, but don't cause another thread to be scheduled, even in native-threads JVMs (see examples below). If my suggested setResourceIntensive() call were implemented this would be less of an issue. This may turn out to need fixing in Solaris itself rather than in the JVM thread support.

Background

I have two main examples, one an I/O-bound process, and one a CPU-bound process. Both are based on real commercial applications for investment banks, or current projects of mine.

The CPU-bound example is the more alluring, but the I/O bound one is important too, because getting it right will allow Java code that aggressively overlaps high-latency I/O operations to routinely get better performance from given hardware and programmer time than investing the same effort in C/C++/etc code where threading is harder. This better handling of slow I/O is entirely in line with handling the practicalities of ``The Network is The Computer'' in large and/or performance-critical systems.

CPU Bound Example

For several years now I have earned the bulk of my consultancy income optimising (and parallelising) algorithms that perform the analytics needed by derivatives traders in the financial markets in London, New York, Tokyo and Hong Kong. Just by improving the code, finding lower-complexity algorithms, improving virtual memory behaviour, etc, I can usually find improvements in the range of x2--x5, with occasional improvements by as much as x100 or even x5000 in one instance. This work requires detailed understanding of how the compilers, OS, hardware (eg caches, VM, etc) work. And all of this primarily improves the performance of single-threaded code, and is essential before even thinking about parallelism. I believe that I get fairly close to the maximum available performance (for one CPU) given the hardware/OS/compiler combination (and sometimes even beat the published machine performance). The code I produce usually has to be portable, between UNIXes including Solaris, and NT.

Naturally, this requires getting `close to the metal', and the benefits of the JVM come with some loss of detailed control, but I think it should be possible to keep most of the benefits, while also enjoying the portability and consistency that Java provides.

So clearly, for typical machines, I can usually get a bigger speedup from algorithmic improvements than by using multiple CPUs, but since I can do both in C/C++ code (and because C/C++ are both quite low-level, I can probably fix these low-level features more precisely than in Java) it is important that I be able to do reasonably well in terms of parallelism portably in Java else Java will not be a competitive solution.

(Side note: I note with interest the widefp proposal, and am certainly prepared to allow some variation in the last digit-or-so of a result to getter better cross-platform performance, since I can at least assume the the calculations done will be accurate, albeit potentially slightly less reproducable than without it. I laud this effort and will be making use of it if it appears. I suggest that some means be provided to allow the same code to be available with both widefp and non-widefp behaviour so that a compile- or run- time decision can be made (potentially on each call) about which version is appropriate to the particular algorithm being run without needing two copies of the code for each routine with different names and attributes to be maintained by the developer, but this is a minor point. I also suggest that java.lang.Math have widefp and non-widefp routines available, and also suggest the inclusion of both widefp and non-widefp versions of the error function (erf()), which is very useful in finance.)

I accept that even with a JIT, Java code is currently about a factor of two slower than highly-optimised C or C++ code, though I anticipate that the gap will get narrower, especially since a JIT or HotSpot should be able to optimise for the precise variant of the CPU it is running on, and for the way the code is actually being used in its current run, not something slightly more generic as real-world production C/C++ code usually has to be compiled for.

I note that for one of my major clients (and I'm sure they are not unique), nearly every new compute server and trader's desktop machine delivered has two CPUs in by default. Given that with the C/C++ environment it is difficult to convert algorithms to run portably in single- and multi- threaded forms safely, I can potentially trade off some Java numerical efficiency now for being able to parallelise many more important routines easily. For now I would probably keep the routines already parallelised in C++, and make JNI calls to them. As Java numerical processing catches up with C++ (and Fortran, which after all has had a 30 year head-start!), I can convert all the numerical routines to a far more portable Java form. The client will be happy for me to do it, and they will then have 100% Pure Java trading calculations!

Two examples of where I have already applied parallelism in C++ in a limited number of routines, and would like to use Java:

Monte Carlo simulation, where the number of runs to be done can be divided amongst the CPUs available (whereas putting each of the typically thousands of runs in its own thread would be inefficient and probably break things).
Parallel generation (and conditioning of) random numbers, by the hundreds of thousands and upwards, to support such algorithms as Monte Carlo.

This will be bleeding-edge high-performance real-world buzzword-compliant (B^>) numerical processing entirely in Java.

However, to do this I need two things from the Java threading API that I currently get from the POSIX threading API and by other means.

I need to be able to find at run-time out how many CPUs are actually available in the system so that I can efficiently partition the task at hand, for minimum overhead and best parallel efficiency. The number does not need to be absolutely accurate, nor even necessarily take account of current system load (though that might be useful), but I need a value within (say) a factor of two, so the difference between 1 CPU and 2 CPUs is as important as between 2 and 4, and the difference between (say) 2 and 3 CPUs available might help partition a problem well.
Thus I propose:
A new static int availableConcurrency() member of Thread that might return either the number of available (configured or on-line) CPUs in the system, or possibly the amount of CPU `free' in the system (ie the number on-line, multiplied by the fraction of free CPU time in the system over some short interval such as available from system metrics, rounded [up]). A minimal implementation may always return 1 (1 CPU) or possibly 0 (don't know). A green-threads implementation could always return 1, for example, as it will never be able to exploit more than one CPU.

Having partitioned my problem into multiple threads I then need to ensure that the threads actually run in parallel, and not let the scheduling be satisfied with ensuring that at least one is running. In the POSIX world this seems best achieved by binding the thread (to a new LWP) and probably cacheing it to reuse it for each new task so that each (bound) thread has a chance to migrate to a separate CPU from the others so that threads do all genuinely have the chance to run in parallel.
Thus I propose:
A new void setResourceIntensive(boolean isIntensive) instance member of Thread that would set an attribute of the thread just like setDaemon(), and would have to be called before start(), which indicates that the thread is going to be either I/O bound or CPU-bound and needs to be scheduled on its own processor if possible. In particular this is a statement that it is not adequate for the whole Java process to make progress with as little as one active thread, but that this new thread should really be able to run in parallel with others in the process.

Note that this is a sufficient hint for both I/O- and CPU- bound threads, though JavaSoft might like to distinguish between these.

If these two API extensions were made it would go a long way towards ensuring that Java could make as good use of its host's and network's resources as a C or C++ program, and without the incompatibilities between hosts (such as lack of POSIX threading support in normal Windows NT applications).

All that said, note that I have not had a chance to try JDK 1.1.5 with native threads on a multi-CPU Solaris machine (B^>) and so may be pleasantly surprised, but the point remains that I want to be able to strongly hint to the scheduling system that a small group of threads is to be treated specially, since I know I have to do it for good performance within C/C++ in a POSIX/Solaris threading environment, and I don't see how the JVM could guess better.

I/O Bound Example

A library/application I wrote many years ago (Solaris 2.3/2.4 time!) for a bank initialised by gathering a list of all the files in a directory hierarchy, not unlike the UNIX find command. Each file corresponded to an instrument in their derivatives portfolio, and there were upwards of 10,000 files in maybe 50 directories. These files were then filtered by name, etc, and used as the basis of various calculations, eg ``find the the NPV (Net Present Value) of all the USD legs of instruments in the `swaps' sub-portfolio.''

Most of the machines running this algorithm were remote from the NFS server where these files were held. Much of the initialisation time of processes using this algorithm was thus spent blocked in the stat() call while the heads skittered across the discs. The fileservers were busy and could not be relied on too much to cache the inode info between programs to respond to the stat() calls from memory, thus a fair amount of disc seeking was needed. We had already beefed up the fileserver to be multi-CPU (2 initially, 4 later), and have the partition where these files were kept striped (with ODS over 2 discs, then later 3). This improved the overall performance for multiple users of the system, but the performance of each individual program was poor. A single-threaded `find' spends most of its time blocked in the `stat()' call. Most of the clients were single processor.

The solution became reasonably obvious, which was to recode the directory-search algorithm to run several threads (4 in fact, which seemed about optimimal in almost all cases for our setup) at once to do a breadth-first search of the directory tree (for reasonable disc/inode behaviour), with the threads being bound to their own LWPs to guarantee that they could schedule their own stat() calls in parallel. If I did not bind the threads, the threads library did not always seem to notice that the stat() had blocked, and would not always bring one of the ready threads onto another LWP to submit its stat() request before blocking, ie the SIGWAITING progress-guarantee mechanism does not seem to be fully effective, and I am still experiencing that problem now with native-threads SunSoft JDK 1.1.5 on Solaris 2.5.1 (patched as required); more below.

With this multi-threaded `find' coded in the program (in SPARCompiler 4.x C++ using the Solaris threads library, on Solaris 2.4 as I recall), the `find' time typically dropped by a factor of 2--3, which made a big difference.

Even though the clients were mostly single processor, it is fairly clear why multithreading should help, providing the threads can actually run concurrently:

Overlapping the stat() calls helps hide the latency of the NFS ATTR RPC over the LAN.
Overlapping the stat() calls allows the client to make use of the several remote CPUs and disc stripes on the fileserver.
Overlapping the stat() calls should allow the fileserver kernel to queue up several seeks at once for the internal seek-optimisation (eg elevator) algorithm, making fewer or at least more efficient seeks on each disc platter, thus improving throughput.

The last of these points should apply even in the apparently much-less promising case where:

The client process and the fileserver are the same machine (so there is no LAN latency to overcome, and the client and kernel (UFS) code are competing for CPU time).
The client/fileserver machine is single processor, so only concurrency is available, not real parallelism.
The target disc partition is not striped, so there is no opportunity for multiple separate simultaneous seeks to be happening to serve multiple separate stat() calls.

This reduced case is the situation I am in with a current Java project, where I am implementing a `find' routine to collect data on thousands of images to be incorporated into a Web site. When I try to run the `find' just collecting filenames and not running stat() on each one (the routine does not recurse, so this is possible), the cost is about 2--3ms per file (on the target host). I read the directories in separate threads (using java.io.File.list(filter), running up to 8 threads at once), and seem to be able to get more-or-less 100% CPU utilisation, even in a green-threads JDK.

If I do anything in the filtering that causes the files to be stat()ed, eg using File.isFile(), the time per file goes up to about 30ms, with utilisation only a little higher than 10% indicating the problem is not CPU shortage (ie the stat() appears not to be very expensive inside the kernel). The disc-operations-per-second kernel stat goes up to about 1600, which indicates that I really have maxed out disc seeking. In the green-thread JDK this is what I expected. Unfortunately, the native-thread JDK does almost no better.

Only after quite a while does a SIGWAITING get generated, and a new LWP get produced, which seems to get a marginal improvement in performance. As far as I am concerned, because the CPU utilisation is low the threads library and the JVM should be producing many more threads (to get more I/O queued in this case, for better I/O overlap). Note that I have never seen more than one extra LWP get created.

The current mechanism seems far from ideal in performance terms.

I therefore suggest:

A new void setResourceIntensive(boolean isIntensive) instance member of Thread that would set an attribute of the thread just like setDaemon(), and would have to be called before start(), which indicates that the thread is going to be either I/O bound or CPU-bound and needs to be scheduled on its own processor if possible. In particular this is a statement that it is not adequate for the whole Java process to make progress with as little as one active thread, but that this new thread should really be able to run in parallel with others in the process.

This is easy to implement on any POSIX-compliant (or Solaris) threading library by simply asking for the new thread to be bound (to a new LWP). This would ensure that the JVM/libthread notion of the amount of concurrency to allow in generate (a la the Solaris thr_setconcurrency() call) need not be disturbed (presumably either set to the old value of 4, or set to nCPUs+x), but individual threads can demand to get the best possible scheduling from the kernel because they want to queue their blocking I/O requests up as fast as possible, even if the kernel would not normally consider the thread blocked (the programmer knows better here). In other words, the programmer is providing a strong hint to the scheduling system that its default mechanism for ensuring `sufficient' progress is not aggressive enough for the work this thread is going to be doing.

As with all hints-based APIs, the JVM and underlying threading mechanisms are free to ignore the hint, either because they cannot do anything with it (the JVM is green-threads, or the underlying OS does not support threads, or the JVM already effectively provides each thread with an LWP as in the Microsoft implementation), or because they are not prepared to implement it, or because resources for a bound (LWP) are not available.

While I have focused on the effect of stat() class as my frequent foe in various applications, this applies equally to non-UNIX systems and other I/O operations where the programmer needs to insist that the JVM not guess when and how many of the eligible threads to schedule, but to at least schedule at once all the resource-intensive ones that are permitted be run by the threading API rules.

Damon Hart-Davis
ExNet Ltd
dhd@exnet.com