Double-precision floating-point format

From Wikipedia, the free encyclopedia

(Redirected from Double precision floating-point format)

In computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called adouble, may be defined to be an integer, fixed point, or floating point (in which case it is often referred to as FP64).

Modern computers with 32-bit storage locations use two memory locations to store a 64-bit double-precision number (a single storage location can hold a single-precision number). Double-precision floating-point is an IEEE 754 standard for encoding binary or decimal floating-point numbers in 64 bits (8 bytes).

Floating-point precisions
IEEE 754
16-bit: Half (binary16) 32-bit: Single (binary32), decimal32 64-bit: Double (binary64), decimal64 128-bit: Quadruple (binary128), decimal128 Extended precision formats
Other
Minifloat Arbitrary precision
v t e

[hide]

[edit]IEEE 754 double-precision binary floating-point format: binary64

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. As with single-precision floating-point format, it lacks precision on integer numbers when compared with an integer format of the same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)

This gives from 15 - 17 significant decimal digits precision. If a decimal string with at most 15 significant decimal is converted to IEEE 754 double precision and then converted back to the same number of significant decimal, then the final string should match the original; and if an IEEE 754 double precision is converted to a decimal string with at least 17 significant decimal and then converted back to double, then the final number must match the original ^[1].

The format is written with the significand having an implicit integer bit of value 1, unless the written exponent is all zeros. With the 52 bits of the fraction significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, $53\log_{10}(2) \approx 15.955$ ). The bits are laid out as follows:

IEEE 754 Double Floating Point Format.svg

The real value assumed by a given 64-bit double-precision data with a given biased exponent e and a 52-bit fraction is $= (-1)^{sign}(1.b_{-1}b_{-2}...b_{-52})_2 \times 2^{e-1023}$ or more precisely: $value = (-1)^{sign}(1 + \sum_{i=1}^{52} b_{-i} 2^{-i} )\times 2^{(e-1023)}$

Between 2⁵²=4,503,599,627,370,496 and 2⁵³=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2⁵³ to 2⁵⁴, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2⁵¹ to 2⁵², the spacing is 0.5, etc.

The spacing as a fraction of the numbers in the range from 2ⁿ to 2ⁿ⁺¹ is 2ⁿ⁻⁵². The maximum relative rounding error when rounding a number to the nearest representable one (the machine epsilon) is therefore 2⁻⁵³.

...

Double-precision floating-point format - Wikipedia, the free encyclopedia:

'via Blog this'

Connie O'Dell - DV, EDA, jobseeking, life,whatever

Wednesday, September 26, 2012

Double-precision floating-point format - Wikipedia, the free encyclopedia

Double-precision floating-point format

Contents

[edit]IEEE 754 double-precision binary floating-point format: binary64

No comments:

Post a Comment