Before we start discussing about UTF we need to know few basic elements.
As we know that we have to encode the human understandable language into machine understandable language. To achieve this objective there are various encoding systems.
Few famous encoding systems are enlisted below:
- ASCII: American Standard Code for information interchange(For United States)
- ISO 8859-1: Western European Languages
- GB 18030 and BIG-5 : for Chinese Language
These are different encoding systems for making character sets of various languages. All these were evolved before Unicode System.
No system is perfect, so there are few flaws in these encoding systems as well:
- There code values correspond to different letters in various language standards.
- The encoding for language with large character sets have variable length. Some common characters are encoded as single bytes, other requires two or more Byte.
To reslove all these problems a new Language Standard was developed
i.e. Unicode System
In Unicode system, each character holds 2 byte, so JAVA also uses 2 bytes for character(Because JAVA follows Unicode encoding system)
values for Unicode code units
minimum value:\U0000
maximum value:\UFFFF
ASCII code Standard was limited to only 128 character definitions whereas Unicode standard defines values for over 100,000 characters.
Objective of Unicode System:
Its objective is to unify all the different encoding schemes so that the confusion between computers can be eliminated. It has various character encoding forms.
UTF stands for Unicode transformation unit.
UTF-8: It represents 1 Byte(8 bits). It uses one byte to encode the English character.
UTF-16: It uses 2 Bytes(16 bites) to encode .
UTF-32: It uses 4 Bytes(32 bits) to encode the characters.
Code Points:
The values written in unicode is written as Hexadecimal Numbers. Its all the values have a prefix of "U+"
for Example: A represents as U+0041 and "a" represents as U+0061
These code points are further divided into 17 separate sections callled as "Planes".
The first plane, which have most commonly used characters is know as " Basic Multilingual Planes".
Basic difference between UTF-8 and UTF-16
Now over all the web development languages over internet have UTF character set.Among those, UTF-8 and UTF-16 are most commonly used families.
UTF-8 encodes a character using 1 to 4 bytes. It usually uses 1 byte(8bits) to encode a character and for representing other characters which require more than 1 byte it uses the combination of characters.
and UTF-8 contains only ASCII character set.
UTF-16 uses exact 2 Byte(16 bits) per character. In this frmat, the space sometimes remains empty; which is unnecessarily wastage of memory.
and UTF-16 contains Latin, Cyrillic, Chinese, Japenese character sets.
There are three basic versions for UTF-16 and UTF-32, which are as follow:
BE : Big Endian byte serialization(Most significant first)
LE: Little Endian byte serialization(Least significant first)
unmarked: It by default follows Big endian byte serialization.
for example: UTF-16, UTF-32, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE
Comments
Post a Comment