Missing number?

using x=1:53; y=2.^x; floor(z=log2(y-1))

octave:93> x=1:53; y=2.^x; floor(z=log2(y-1))
ans =
 Columns 1 through 26:
    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25
 Columns 27 through 52:
   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   49   50   51   52
 Column 53:

Should the number ‘48’ really already be missing in the result (when eps=2^-52)?
Log2() accuracy ending point?

octave:112> log2(2^48-1)
ans =  47.99999999999999
octave:113> log2(2^49-1)
ans =  49

Maybe the rounding up makes it loosing the .999999… at some point.
I guess it is correct and rounding for the log2() result occurs maybe before the floor().
It could be nice also if this dec2bin, etc would work for decimals:

octave:126> dec2bin(0.1)
error: dec2base: input must be real non-negative integers
error: called from
    dec2base at line 72 column 5
    dec2bin at line 48 column 7

Screenshot from 2021-09-17 13-32-34

octave:1> dec2bin(pi)
error: dec2base: input must be real non-negative integers
error: called from
    dec2base at line 72 column 5
    dec2bin at line 48 column 7
octave:1> dec2bin(floor(pi))
ans = 11

So I could not directly see the floating point register contents…
Is there some command to show the floating point register contents?

If you are interested in the binary representation of floating point numbers, you could typecast to an integer of the same size and use dec2bin on that value.

>> dec2bin(typecast(log2(2^49-1), 'uint64'))
ans = 100000001001000100000000000000000000000000000000000000000000000

See e.g. Double-precision floating-point format - Wikipedia for interpreting the bits (sign, exponent, mantissa).

The “typeinfo” shows it is scalar, but maybe I’ll try find out…

octave:71> typeinfo(2^49-1)
ans = scalar


octave:75> dec2bin(typecast(pi,'uint64'))
ans = 100000000001001001000011111101101010100010001000010110100011000
octave:82> length(dec2bin(typecast(pi,'uint64')))
ans =  63

it looks difficult to find the pattern “11.001…” here (maybe as 1st ‘1’ is hidden).
The length still is ‘63’, not ‘64’, but maybe no issues. It going to show when it will.
pi = 0 10000000000 1001001000011111101101010100010001000010110100011000
Maybe the 1st zero is missing if I use this type cast? or something like that…
If I can see all the contents of the floating point register (including the sign) and with space separated, it can be more clear.
I hope to know some examples / function / method that can check equation accuracies.
Also it looks the below would work:

octave:76> 2^-1050
ans =   8.289046058458095e-317

even the exponent is outside the IEEE -format…
Below case however, the answer looks too short:

octave:79> dec2bin(typecast(2^-1050,'uint64'))
ans = 1000000000000000000000000
octave:80> length(dec2bin(typecast(2^-1050,'uint64')))
ans =  25

I guess it would work, except for plotting these numbers (cause leading zeroes missing…?)
ok. It seems it is ‘subnormal’ number.
I was wondering though why the ‘supernormal’ numbers are missing in this code (some numbers between 1.797693134862316e+308 and infinity, but up to 10^323).
Maybe if they are hard to calculate than the subnormals?

Can this be possible?

octave:112> eps(2^-1050)
ans =   4.940656458412465e-324

I guess it would not show the actual step between these subnormal numbers as this is same eps as that used for zero.
It could be interesting not only to know this representation but how they are operated (that is how each operation is done in the register(s)).
And as I said how can I find the accuracy of some operation in octave?
Output result accuracy might be dependent of the operation, not only about the input numbers. Maybe I guess that it need some type of “Monte Carlo” simulation to see how much output is affected when input is changed by the input*eps/2 or similar amounts.
Is this how to test it:
i=0:11; x=(1-i*eps/2); y=acos(x), diff(y)
So comparing to
i=0:11; x=(i*eps/2); y=asin(x), diff(y)
would suggest asin(x) is more accuracy with small angles.
i=0:11; x=(i*eps/2); y=cos(x), diff(y)
i=0:11; x=(i*eps/2); y=sin(x), diff(y)
Maybe I can find out how to test.
Below is some testing, but its for simple multiplication function only:

Please, read the Wikipedia article to interpret the bytes correctly. (Especially, the part about subnormal representation.)

Edit: What might have partly caused your doubts: `dec2bin’ omits leading zeros.

Yes. I will check the article if I have time.
Maybe if somehow I could do quad floating point accuracy (just to make a ‘reference’) to check the accuracy of each double floating point function. But could not be sure how that can be checked.
If in this picture of FP16, the subnormals is at the left, should the right side have the ‘super normals’?

Yes. Octave does not do pure math. It does numerical calculations and those are always subject to imperfections like finite precision. Incidentally, Matlab calculates the same result (47.9999…) because it too uses standard IEEE 754 to represent numbers. In general, if a calculation is within the range of machine precision that is the best one can do. This is the case here as shown below

octave:4> y = log2 (2^48 - 1)
y = 48.000
octave:5> y - 48
ans = -7.1054e-15
octave:6> eps (48)
ans = 7.1054e-15

Octave throws an error because the function dec2bin is meant for converting integers, not floating points, to binary values. If you need different functionality then you need to look for a different function. typecast is one way of looking at the representation. Another way is to change the format. For example, format bit will display all answers in binary.

format bit
ans = 0100000000001001001000011111101101010100010001000010110100011000
ans = 00000101
1 Like

I believe this was mentioned before in the cot(0) discussion, but if you desire more precision than comes from the IEEE 754 floating point arithmetic, the symbolic toolbox implements variable precision arithmetic via the VPA() function, and can provide you with whatever level of precision you require.

ok. it looks quite practical, but not exactly what I wanted.

octave:56> format bit
octave:57> 33.439725618828724348
ans = 0100000001000000101110000100100011101101110110000000101010010111
octave:58> 68.932639205776316824
ans = 0100000001010001001110111011000001011100010110011111000110110101
octave:59> 33.439725618828724348*68.932639205776316824
ans = 0100000010100010000000100010110101010101010001100111000100110010
octave:60> 2305.088541222875631758923413225810830752
ans = 0100000010100010000000100010110101010101010001100111000100110011
octave:4> format long
octave:5> 33.439725618828724348
ans = 33.43972561882872
octave:6> 68.932639205776316824
ans = 68.93263920577631
octave:7> format bit
octave:8> 2305.088541222875459202028376926190051223766243960986723447736945
ans = 0100000010100010000000100010110101010101010001100111000100110010

Is here really some bit rounding error in the last digit or what is happening?
It could be hard to say… But it looks like the last bit is flipped. Do you get the same results?
Maybe it is what is supposed to happen. I’m not 100% sure what should happen there.

Should this be:

octave:57> 33.439725618828724348
ans = 0100000001000000101110000100100011101101110110000000101010010111+/-0.01


octave:57> 33.439725618828724348
ans = 0100000001000000101110000100100011101101110110000000101010010111+/-0.1
  actual binary  100001.01110000100100011101101110110000000101010010110111111
octave:3> 68.932639205776316824
ans = 0100000001010001001110111011000001011100010110011111000110110101
  actual binary  1000100.111011101100000101110001011001111100011011010101011111

I could not see how much is input parameters error.
I guess the conversion error between decimal and binary also affects to the results.
I guess after conversion the last bit error is +/-0.1 (in base 2).

The multiplication error seems to be the size of the last bit, which should be normal…
Input arguments error +/-0.1 should become +/-1 after multiplication.

The final conversion from double to decimal maybe adds a most of the error…

I guess using

octave:44> v=33.439725618828724348*68.932639205776316824
v = 2305.088541222875
octave:45> printf("%.44g\n",v)

would give the ‘most exact’ decimal presentation of the double value and the contents for the FP register. So I might not need the binary presentation at all, even if it could be nice also to get values as “1000100.111011101…”.

This looks nice also:

octave:54> 1+eps
ans = 0011111111110000000000000000000000000000000000000000000000000001
octave:55> 1
ans = 0011111111110000000000000000000000000000000000000000000000000000

The eps in this definition is then the ‘last bit’ of the value ‘1’.

FPACC.pdf (55.3 KB)
I try make some type of error estimation for different functions, but it looks more or less complicated.
For example general case for x^y where both are floating point numbers (not integer cases).

What is the tool/function to estimate these input related inaccuracies?
I might need check the manual, but not sure if that exists.