Google Chrome Unicode normalization and য়, ড়, ঢ় problem

If you are using Google Chrome and writing Bangla, you might have already faced this problem. Every time you send a POST request (it just happens to POST data only) Google Chrome changes normalizes Unicode characters automatically. In Bangla Language Chrome normalizes 3 characters. These are য়, ড় and ঢ়.

What Chrome Actually do?

If you look carefully each of these 3 characters has a dot (.) underneath. Also there are 3 other characters in Bangla which are same like this but without dot. In Bangla there are actually 6 characters, ড, ঢ, য, ড়, ঢ়, য়. Chrome just uses the first 3 and adds a dot underneath to form the last 3. This is called normalization. Each time we send request that contains the last 3 characters, Chrome just converts them to corresponding first 3 characters and then adds a dot.  This happens only for those data that resides in HTTP request body. So, this behaviour is not found for Cookie, Header or in Query string as all of these three data sources reside in HTTP request header. I suspect it also happens with PUT type request.

An Example

Lets say we are going to submit a form with request method is POST.  It has a input field. If you type “গাঢ় সবুজ পেয়াড়া” (a sentence that contains all the problem characters) and submit the form, Chrome will submit “গাঢ় সবুজ পেয়াড়া”. These string may look alike. But they are different! In hex, The red stands for modified characters and green for newly added characters. Spaces are used to align.

before: hex(গাঢ় সবুজ পেয়াড়া)= e0a697e0a6bee0a79d      20e0a6b8e0a6ace0a781e0a69c20e0a6aae0a787e0a79f      e0a6bee0a79c      e0a6be
after:  hex(গাঢ় সবুজ পেয়াড়া)= e0a697e0a6bee0a6a2e0a6bc20e0a6b8e0a6ace0a781e0a69c20e0a6aae0a787e0a6afe0a6bce0a6bee0a6a1e0a6bce0a6be

Key points:

Some key points to be noted.

  • This normalization takes place any data that reside in HTTP Request body. So only POST and  PUT will be affected. Cookie, Header and Query string data will be unaffected.
  • The inconsistency between HTTP request body and header part confirms this as a Chrome bug.
  • Either it should be normalized all over HTTP request or nowhere.

Solution:

As you have already understood the problem you know how to solve it. Just file a bug to Google Chrome team. As long as google does not fix this you can just replace those characters in your web application.  Here is a snipped I have written to fix this in PHP.

[code language=”PHP”]
class DeNormalOntosteo {
private static $strmap = array(‘/য/’ => ‘য়’, ‘/ড/’ => ‘ড়’, ‘/ঢ/’ => ‘ঢ়’);
public static function replace($data) {
if (is_array($data)) {
$keys = array_keys($data);
$values = array_values($data);
$len = count($values);
while ($len–) {
$values[$len] = preg_replace(array_keys(self::$strmap), array_values(self::$strmap), $values[$len]);
}
return array_combine($keys, $values);
} elseif (is_string($data)) {
return preg_replace(array_keys(self::$strmap), array_values(self::$strmap), $data);
} else {
return false;
}
}
}
[/code]

Usage

[code lang=”PHP”]
// Denormalizing $_POST array
$_POST = DeNormalOntosteo::replace($_POST);
// Denormalizing a string
$_POST[‘data’] = DeNormalOntosteo::replace($_POST[‘data’]);
[/code]

Update 1:

I have created a page where you can see the bug in action. You must use google chrome to browse this page. Just visit and press submit.

Update 2: 

I have filed a bug on chromium team on google code. If you are having same issue please give them a knock here.