[Cfp-interest 2865] Re: CFP review of TS-4 and TS-5 revisions

Sat Aug 26 08:45:13 PDT 2023

Some initial thoughts …

One example would do for the TS. Omit the PRODUCT.

How much to say about double-double?

Are there references for the algorithms used that should be mentioned?

Should the utility for implementing reproducible dot products be mentioned (with reference)?

> On Aug 24, 2023, at 8:29 PM, David Hough CFP <pcfp at oakapple.net> wrote:
> 
> I regret sending a half-baked text which revealed numerous typos as soon 
> as I sent it.
> 
> The following might be as much as 51%-baked -
> 
> EXAMPLE
> 
> The augmented operation are useful for extending precision, particularly
> for implementing "double-double" arithmetic, which provides a faster, though
> less precise and less predictable, alternative to software quadruple precision,
> on systems which lack hardware quadruple precision.
> 
Maybe "which lack hardware support for quadruple precision”. (thinking of Titanium whose architecture allowed for decently fast quad)

Aren’t the augmented operations useful for double-double with good performance only if implemented in hardware? This is subtly suggested below. We could just say, the augmented operations can be used to implement extended precision …,  and not delve into performance, but if we do discuss performance we should be up front about the hardware need.

> Double-double represents numbers as a pair of doubles, the second much smaller
> in magnitude than the first.    Ideally, in a pair (a.h,a.t), a.h would
> contain the correctly rounded result of computed(a.h+a.t), 
> and a.t would contain the correctly rounded result of computed(a.h+a.t)-a.h.
> Performance considerations often compromise this ideal.

This might be a good place to mention that there is no standard for double-double and details differ among implementations.

> 
> The augmented operation make it possible to implement + - * for
> double-double format, in a way that is commutative and avoids conditional
> branches, yet still yields close to the intended results, and is competitive
> in performance if supported in hardware.
> 
> Let {a.h;a.t} represent a double-double value a intended to contain the value
> a.h+a.t in infinite precision, and {b.t;b.t} a similar double-double value b.
> 
> SUM
> 
> Consider computing the sum {z.h;z.t}, a double-double value z
> intended to contain {a.h+a.t}+{b.h+b.t} as closely as possible.
> 
> Clearly z should be a.h + a.t + b.h + b.t, but this is might be
> a higher precision value that has to be rounded to fit double-double.

Omit “is”.

Maybe just say “but this might have to be rounded to fit double-double.” The precision (as used in C) of double-double isn’t all that determines whether a value fits.
> 
> Using the notation
> 	{x.h;x.t} = v.h+w.h         # augmul
> to mean something like
> 	struct x daug_t ;
> 	{x.h;x.t} = augadd(v.h, w.h) ;
> 
> Let the algorithm be
> 
> 				# infinite result is a.h+a.t+b.h+b.t
> 	{u.h;u.t} = a.h+b.t	# augadd
> 				# infinite result is u.h+u.t+a.t+b.t

u.h+u.t+a.t+b.h?

> 	{v.h;v.t} = a.t+b.t	# augadd

a.t + b.h?

> 				# infinite result is u.h+u.t+v.h+v.t
> 	{w.h;w.t} = u.t+v.t	# augadd
> 				# infinite result is u.h+v.h+w.h+w.t
> 	{x.h;x.t} = v.h+w.h	# augadd
> 				# infinite result is u.h+x.h+x.t+w.t
> 	{z.h;z.t} = u.h+x.h	# augadd
> 				# infinite result is z.h+z.t+x.t+w.t
> 
> Now {z.h;z.t} is a pretty good approximation to the intended result, 
> with error x.t+w.t, and is
> commutative

Why? If it’s not fairly easy to see (requires some error analysis) maybe say “is (or can be) proven to be commutative”.

> and has no conditional branches.    Its edge cases

handling of special cases?

> might be no
> worse than those of any other double-double implementation fast enough to 
> be useful.    Its performance might be acceptable is hardware 
> augadd is no slower than two adds.

Here’s the subtle suggestion that hardware implementation is needed for performance.

> 
> PRODUCT
> 
> Consider computing the product {z.h;z.t}, a double-double value z 
> intended to contain {a.h+a.t}*{b.t;b.t} is closely as possible.
> 
> Clearly z should be a.h*b.h + a.h*b.t + a.t*b.h + a.t*b.t but this is 
> likely to be a quad-double precision value that has to be rounded to 
> fit double-double.
> 
> Using the notation
> 	{x.h;x.t} = a.h*b.h         # augmul
> to mean something like
> 	struct x daug_t ;
> 	{x.h;x.t} = augmul(a.h, b.h) ;
> 
> Let the algorithm be
> 
> 				# infinite result is 
> 				# a.h*b.h+a.h*b.t+a.t*b.h+a.t*b.t 
> 	u=a.h*b.t ; v=a.t*b.t	# regular multiplication
> 	w = u + v		# regular addition
> 				# infinite result is 
> 				# a.h*b.h+w+(a.h*b.t+a.t*b.h+a.t*b.t-w)
> 	{x.h;x.t} = a.h*b.h	# augmul
> 				# infinite result is 
> 				# x.h+x.t+w+(a.h*b.t+a.t*b.h+a.t*b.t-w)

Doesn't the possible inexact underflow of augmul affect this?

- Jim Thomas

> 	{y.h;y.t} = x.t+w	# augadd
> 				# infinite result is 
> 				# x.h+y.h+y.t+(a.h*b.t+a.t*b.h+a.t*b.t-w)
> 	{z.h;z.t} = x.h+y.h	# augadd
> 				# infinite result is 
> 				# z.h+z.t+y.t+(a.h*b.t+a.t*b.h+a.t*b.t-w)
> 
> Now {z.h;z.t} is a pretty good approximation to the intended result, 
> with error y.t+(a.h*b.t+a.t*b.h+a.t*b.t-w),
> and is
> commutative and has no conditional branches.    Its edge cases might be no
> worse than those of any other double-double implementation fast enough to 
> be useful.    Its performance might be acceptable is hardware augmul and
> augadd are no slower than two muls or two adds.

>