Floating Point and Fixed Point 32-bits Quantizers for Quantization of Weights of Neural Networks

Zoran Peric,Milan Savic,Milan R. Dincic,Nikola Vučić,Danijel Djosic,Srdjan Milosavljevic

Floating Point and Fixed Point 32-bits Quantizers for Quantization of Weights of Neural Networks

2021

Zoran Peric
Milan Savic
Milan R. Dincic
Nikola Vučić
Danijel Djosic
Srdjan Milosavljevic

Floating Point 32-bits (FP32) representation format is proposed by IEEE Standard 754, being widely used in neural networks (NN), signal processing and numerical computation. Also, Fixed Point 32-bits format is widely used for data representation. This paper describes those standard 32-bits formats (Fixed Point 32 and FP32) as quantization schemes, defining quantizers based on them and providing in this way references for comparison of other quantization schemes used in neural networks. Quantization of data with the Laplacian distribution is considered, in a wide range of variance. Theoretical results are proven by an experiment, applying those quantization schemes on weights of a neural network.

Keywords:

Artificial neural network
Topology
Quantization (signal processing)
Floating point
Fixed point
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations