Strategy for defining Unicode Ranges by Culture -


I am new to Unicode, some translated text has been given the need to see, iterate over all the characters in that translation And determine whether all characters are valid for the target culture (language and location). For example, for example, if I am translating a document from English to Greek, then I want to know that there is no English / ASCII in Greek translation and it is reported that in the form of an error. There may be a case with corrupted data from a translation memory.

Is there any existing group culture of Unicode characters? Or is there any current strategy for the development of this kind of group? I think there are some groups of letters () but it seems that this is not enough at first glance.

Is there any thing like "Unicode characters are valid for Spanish-Spanish: [some unicode category (s)]" or "here are valid unicode characters for Russian-Russia: [ Some unicode category (s)] "or have someone developed a strategy to define them?

If this is not the right place to ask this question, then I welcome any direction where the question can be a good place to ask.

This is something like this (normal local data repository) related to it, it is not part of the Unicode standard , But it is an activity and resource managed by the Unicode Consortium. Specification defines the format of locale data, defines some set of "main / standard", "helper", "index", and "punctuation marks".

It contains only Greek letters and some original punctuation marks, like all such data in CLDR, is subjective to a great extent. And although the CLDR process is to produce well-reviewed data on a consensus basis, the reality is different, it can be argued that Latin letters are not unusual in general Greek texts, especially in technical areas. For example, the international symbol "A" for the epic is in the form of a Latin letter; In Latin letters the symbol of kilograms is "kg", even if the word is written in Greek for Greek letters.

In this way, regardless of how you run the analysis, the incident of Latin "A" in Greek text can be flagged as potentially suspicious, but there is no error.

As part of the C / C ++ and Java libraries that implement access to CLRR data.


Comments