Instead of polling nleft[i] (without appropriate memory barriers!) and
using sleep() to detect the exit just call pthread_join() on all threads.
Also replace the use of a mutex that guarding the increments with atomic
fetch_add. This should reduce the runtime of this test on SMP systems.
Finally, remove all the debug printfs unless DEBUG_OUTPUT is set in
the environment.