Tuesday, March 27, 2007

HOW YOUR TEXTUAL CODE IS READ:

HOW YOUR TEXTUAL CODE IS READ:

Program codes are written using the Unicode character set. The first 128 characters of this set are from the American Standard Code for Information Interchange (ASCII (ANSI X3.4)), and that is why you can use text editors like Notepad or Wordpad that comes bundled with the Windows OS to write java programs. The encoding used is UTF-16 using 16-bits for text sequences.

Some APIs, primarily the Character class use 32-bit integers to represent code points as individual entities.

LEXICAL TRANSLATIONS OF TEXT SEQUENCES:

Lexical translations, or translations in dictionary order of characters, are followed according to this order in your written programs:

  1. Unicode escapes first. A Unicode escape is of the form \uxxxx in UTF-16 code units, xxxx (4x) representing the encoding. These escapes are translated to the corresponding Unicode character.
  2. the Unicode characters from (1) are then translated into a stream of input characters and line terminators from which it was derived.
  3. The stream in (2) are then translated into a sequence of input elements. Input elements are tokens(identifiers, keywords, literals, separators, operators), whitespace and comment. (to be dealt on later). These input elements make up the terminal symbols of the java syntax i.e the conversion of token sequences into syntactically correct programs

STEP ONE: UNICODE ESCAPES.

Unicode escapes have the following format: \uxxxx where \u is an ASCII character and xxxx are 4 hexadecimal digits.

e.g \u0055 will translate as an uppercase U character.

Notes:

  1. if the backslash is even in number, it will be translated as an eligible Unicode escape.
  2. if the backslash is odd, the translation result is no longer an eligible Unicode escape.
  3. if a backslash is not followed by u, the characters are treated as every other raw input character.
  4. if a backslash if followed by any number of u(s) but the hexadecimal digits are not exactly 4 in number, the result is a compile time error.

A illustrative UnicodeExample code:

package examples;

class UnicodeExamples {

public static void main(String[] args) {

//Unicode character, the slash, u and 4x standard
System.out.println("The first unicode: "+"\u0055");

//If we have even slashes, we get a result eligible
//to be a unicode escape
System.out.println("The second, even: "+"\\u0055");

//if we have odd slashes, we get a result that
//is not eligible to be a unicode escape
System.out.println("The third, odd: "+"\\\u0055");

//Without the small u the character treated as
//raw input character i.e any input character
System.out.println("No u, raw input: "+"\0055");

//if the hexadecimal digits are less than or more than
//4 then compiler should report an invalid unicode
//i've removed example, but try it yourself.
}
}

Afternote: I compiled the above on eclipse IDE.

STEP TWO: LINE TERMINATORS.

Line terminators are determined by ASCII characters CR (return), LF (newline) or CR LF. Note that compilers use line terminators to number your code for easy editing. Remember: lexical translation. We use ‘\n’ (newline )and ‘\r’ (carriage return) char literals for the line terminators in java (see below!)

STEP THREE: INPUT ELEMENTS AND TOKENS

Input elements are either whitespace, comments or tokens. Whitespace is significant at compile time where operators are involved.

e.g a = 2; //assignment of 2 to a

a += 2; //adding a to 2 giving 4

but a + = 2; //a syntax error

if a = b; //we say that a is to the left of b and b to the right of a

TO BE CONTINUED

No comments: