Doc. no.: | N2138 | |
---|---|---|
Date: | 2017-03-13 | |
Reply to: | Clark Nelson |
This is a response to N2131. Each section of this document responds to the section of that document with the same name, but in this document, the sections are ordered differently.
I should say at the outset that I appreciate the attention that OpenMP participants have given this matter.
Let me begin with what is clearly a significant misunderstanding of the purpose of the CPLEX technical specification. Quoting from N2131:
The draft does not support producer/consumer patterns or any form of explicit concurrency control where concurrency, or at least apparently concurrent progress, is required for correctness.
That's very true, and very intentional; that was quite an important non-goal for us, and I'm glad it's clear that we didn't cover it.
For example, please see clause 1 “Scope” (from N2017):
The following are outside the scope of this technical specification:
- Support for writing a concurrent program.
Also see the definition of “concurrent program” (3.4):
program that uses multiple concurrent interacting threads of execution, each with its own progress requirements
EXAMPLE 1 A program that has separate server and client threads is a concurrent program.
So the problem seems to be that we didn't sufficiently emphasize the fact that we had absolutely no intention of inventing new facilities to support this kind of programming. (I don't know; maybe we should have used a larger font. :-)
The difference between parallel programs and concurrent programs is important. For a parallel algorithm (unlike a concurrent algorithm), it is desirable for the implementation to be able to take into account system considerations, like whether there is a processor idle at the moment, or perhaps even how much charge the battery currently has, to decide what kind of resources a program should get. The performance of a parallel algorithm can always gracefully degrade to the serial case – that's pretty much the definition of a parallel algorithm.
For a concurrent algorithm, such as a producer/consumer pair, the situation is very different. Concurrent progress, in this case, is needed for correctness, not just performance. In order for an implementation to take advantage of the difference, the implementation needs to know something about why parallel execution is requested. In other words, there is a concrete disadvantage to expressing a parallel algorithm and a concurrent algorithm in exactly the same terms, even when the implementation techniques may be nearly identical.
The reason CPLEX is not (or not yet) addressing concurrent programming is not because we believe it to be less important, but simply because it's far less obvious how to improve on the wide variety of existing facilities for writing concurrent programs, including pthreads, Windows threads, OpenMP, GCD, and C11.
It is a major concern that the CPLEX draft currently requires as much, or in some cases more, synchronization as an equivalent OpenMP program, in some cases forcing complete task barriers that are algorithmically unnecessary.
It's not clear to me to what extent the desire/need for absence of synchronization is predicated on the desire to write concurrent programs. To the extent that it is not, I would very much like to know more about the usefulness of non-synchronization in parallel algorithms. It would seem that some amount of synchronization is essential in a parallel algorithm, to control combination of partial results into a final result. If there are counter-examples, it would be good to know about them.
(This may not be significant, but I include it for completeness. I don't understand the following sentence from N2131:
Since the semantics require that "execution joins with all child tasks spawned directly or indirectly within the compound statement," Without the implied synchronization, no mechanism could ensure that a program could not observe the lack of the synchronization.
The capital in the middle of the sentence suggests the possibility of an editing glitch, but I also can't make sense, in isolation, of the clause that starts with "no mechanism" and extends through the period; perhaps there are too many negatives for me to untangle.)
The draft's wording about associated task blocks, and the requirement that all task spawn statements have an associated call block, combine to require any function that contains an escaping task spawn statement or calls such a function (and so on) must be a task spawning function. That is, the function definition and all calls to the function must be annotated with "_Task _Call".
It is certainly true that when a task spawn appears in a different function than the corresponding task block, there is a viral effect on every function and call between the task block and the task spawn. But when a function continues to run in parallel with its caller, it is by its nature collaborative with it. For such collaboration, the viral annotations are not a heavy burden.
The desired advantage is the ability to instantly recognize when a function might (in some sense) keep running after it returns – which is rather an odd circumstance – and from that, improved ability to reason about what the program is doing, and which parts of the program can run in parallel.
The task-launching defaults for OpenMP and CPLEX are reversed. An OpenMP function can, by default, return with child tasks still running – a deliberate synchronization is needed to prevent that. CPLEX errs on the side of safety: When a function returns, it is assumed to be done (as in normal C) unless it is annotated such that the caller knows it might be asynchronous. This was a deliberate design decision.
There's obviously a cost when a task spawn is widely separated from its corresponding task block. At this point, I do not have enough data to know how well the benefit would compensate for the cost – especially in a parallel (i.e. non-concurrent) algorithm. (That's part of the reason for proposing this for a Technical Specification, not a standard: to get more data on the trade-off.)
The loop hint parameters, which we believe are the one aspect that takes inspiration from OpenMP, fail to deliver the capabilities of their OpenMP LOOP construct counterparts.
CPLEX's focus area has primarily been the common subset (i.e. intersection) of Cilk and OpenMP. The intention has been that an implementation of CPLEX could be based on either a Cilk runtime or an OpenMP runtime (or some other runtime). The ability to express the tuning parameters available for an OpenMP parallel loop is intended to give a CPLEX program running on an OpenMP-based implementation a fighting chance to get performance comparable to a similar OpenMP program, but not to guarantee it, so that, for example, major changes to a Cilk-based implementation are not required.
The ability to tune performance reliably is of course of great value, but a programming-language standard usually has no way to talk about performance; the C standard certainly gives no performance guarantees, nor should it. (In what units would a performance guarantee be expressed: Nanoseconds? Milliseconds? Weeks?)
If the purpose of CPLEX were to replace/supplant OpenMP, the concerns expressed in N2131 would be very serious indeed. But that is not and never has been the purpose. The world doesn't need a newer or “better” OpenMP; the one we already have is the best OpenMP imaginable, for the purposes it is intended to serve. But the purposes of CPLEX are different.
Consider the OpenMP decision to use pragma syntax exclusively, even in cases where the OpenMP program semantics differ from those of the same program without the pragmas. For OpenMP's purposes, such a decision makes quite a lot of sense. However, for the purposes of WG14, that would be a troubling direction.
Also, where OpenMP intends to provide the means of achieving something approaching maximal performance, portably or non-portably, with only moderate effort, CPLEX intends to provide the means of achieving scalable performance, portably, with minimal effort.
It should also be kept in mind that the CPLEX study group isn't done yet. If N2017 were all the group ever produced, it would have been a waste of effort. For example, there would be no point in having language extensions to express task parallel loops if we were not also planning to have very similar language extensions to express SIMD parallel loops.