File indexing completed on 2024-04-28 03:53:03
0001 /* -*- C++ -*- 0002 SPDX-FileCopyrightText: 1998 Netscape Communications Corporation <developer@mozilla.org> 0003 0004 SPDX-License-Identifier: MIT 0005 */ 0006 0007 #ifndef nsHebrewProber_h__ 0008 #define nsHebrewProber_h__ 0009 0010 #include "nsSBCharSetProber.h" 0011 namespace kencodingprober 0012 { 0013 // This prober doesn't actually recognize a language or a charset. 0014 // It is a helper prober for the use of the Hebrew model probers 0015 class KCODECS_NO_EXPORT nsHebrewProber : public nsCharSetProber 0016 { 0017 public: 0018 nsHebrewProber(void) 0019 : mLogicalProb(nullptr) 0020 , mVisualProb(nullptr) 0021 { 0022 Reset(); 0023 } 0024 0025 ~nsHebrewProber(void) override 0026 { 0027 } 0028 nsProbingState HandleData(const char *aBuf, unsigned int aLen) override; 0029 const char *GetCharSetName() override; 0030 void Reset(void) override; 0031 0032 nsProbingState GetState(void) override; 0033 0034 float GetConfidence(void) override 0035 { 0036 return (float)0.0; 0037 } 0038 void SetOpion() override 0039 { 0040 } 0041 0042 void SetModelProbers(nsCharSetProber *logicalPrb, nsCharSetProber *visualPrb) 0043 { 0044 mLogicalProb = logicalPrb; 0045 mVisualProb = visualPrb; 0046 } 0047 0048 #ifdef DEBUG_PROBE 0049 void DumpStatus() override; 0050 #endif 0051 0052 protected: 0053 static bool isFinal(char c); 0054 static bool isNonFinal(char c); 0055 0056 int mFinalCharLogicalScore, mFinalCharVisualScore; 0057 0058 // The two last characters seen in the previous buffer. 0059 char mPrev, mBeforePrev; 0060 0061 // These probers are owned by the group prober. 0062 nsCharSetProber *mLogicalProb, *mVisualProb; 0063 }; 0064 } 0065 0066 /** 0067 * ** General ideas of the Hebrew charset recognition ** 0068 * 0069 * Four main charsets exist in Hebrew: 0070 * "ISO-8859-8" - Visual Hebrew 0071 * "windows-1255" - Logical Hebrew 0072 * "ISO-8859-8-I" - Logical Hebrew 0073 * "x-mac-hebrew" - ?? Logical Hebrew ?? 0074 * 0075 * Both "ISO" charsets use a completely identical set of code points, whereas 0076 * "windows-1255" and "x-mac-hebrew" are two different proper supersets of 0077 * these code points. windows-1255 defines additional characters in the range 0078 * 0x80-0x9F as some misc punctuation marks as well as some Hebrew-specific 0079 * diacritics and additional 'Yiddish' ligature letters in the range 0xc0-0xd6. 0080 * x-mac-hebrew defines similar additional code points but with a different 0081 * mapping. 0082 * 0083 * As far as an average Hebrew text with no diacritics is concerned, all four 0084 * charsets are identical with respect to code points. Meaning that for the 0085 * main Hebrew alphabet, all four map the same values to all 27 Hebrew letters 0086 * (including final letters). 0087 * 0088 * The dominant difference between these charsets is their directionality. 0089 * "Visual" directionality means that the text is ordered as if the renderer is 0090 * not aware of a BIDI rendering algorithm. The renderer sees the text and 0091 * draws it from left to right. The text itself when ordered naturally is read 0092 * backwards. A buffer of Visual Hebrew generally looks like so: 0093 * "[last word of first line spelled backwards] [whole line ordered backwards 0094 * and spelled backwards] [first word of first line spelled backwards] 0095 * [end of line] [last word of second line] ... etc' " 0096 * adding punctuation marks, numbers and English text to visual text is 0097 * naturally also "visual" and from left to right. 0098 * 0099 * "Logical" directionality means the text is ordered "naturally" according to 0100 * the order it is read. It is the responsibility of the renderer to display 0101 * the text from right to left. A BIDI algorithm is used to place general 0102 * punctuation marks, numbers and English text in the text. 0103 * 0104 * Texts in x-mac-hebrew are almost impossible to find on the Internet. From 0105 * what little evidence I could find, it seems that its general directionality 0106 * is Logical. 0107 * 0108 * To sum up all of the above, the Hebrew probing mechanism knows about two 0109 * charsets: 0110 * Visual Hebrew - "ISO-8859-8" - backwards text - Words and sentences are 0111 * backwards while line order is natural. For charset recognition purposes 0112 * the line order is unimportant (In fact, for this implementation, even 0113 * word order is unimportant). 0114 * Logical Hebrew - "windows-1255" - normal, naturally ordered text. 0115 * 0116 * "ISO-8859-8-I" is a subset of windows-1255 and doesn't need to be 0117 * specifically identified. 0118 * "x-mac-hebrew" is also identified as windows-1255. A text in x-mac-hebrew 0119 * that contain special punctuation marks or diacritics is displayed with 0120 * some unconverted characters showing as question marks. This problem might 0121 * be corrected using another model prober for x-mac-hebrew. Due to the fact 0122 * that x-mac-hebrew texts are so rare, writing another model prober isn't 0123 * worth the effort and performance hit. 0124 * 0125 * *** The Prober *** 0126 * 0127 * The prober is divided between two nsSBCharSetProbers and an nsHebrewProber, 0128 * all of which are managed, created, fed data, inquired and deleted by the 0129 * nsSBCSGroupProber. The two nsSBCharSetProbers identify that the text is in 0130 * fact some kind of Hebrew, Logical or Visual. The final decision about which 0131 * one is it is made by the nsHebrewProber by combining final-letter scores 0132 * with the scores of the two nsSBCharSetProbers to produce a final answer. 0133 * 0134 * The nsSBCSGroupProber is responsible for stripping the original text of HTML 0135 * tags, English characters, numbers, low-ASCII punctuation characters, spaces 0136 * and new lines. It reduces any sequence of such characters to a single space. 0137 * The buffer fed to each prober in the SBCS group prober is pure text in 0138 * high-ASCII. 0139 * The two nsSBCharSetProbers (model probers) share the same language model: 0140 * Win1255Model. 0141 * The first nsSBCharSetProber uses the model normally as any other 0142 * nsSBCharSetProber does, to recognize windows-1255, upon which this model was 0143 * built. The second nsSBCharSetProber is told to make the pair-of-letter 0144 * lookup in the language model backwards. This in practice exactly simulates 0145 * a visual Hebrew model using the windows-1255 logical Hebrew model. 0146 * 0147 * The nsHebrewProber is not using any language model. All it does is look for 0148 * final-letter evidence suggesting the text is either logical Hebrew or visual 0149 * Hebrew. Disjointed from the model probers, the results of the nsHebrewProber 0150 * alone are meaningless. nsHebrewProber always returns 0.00 as confidence 0151 * since it never identifies a charset by itself. Instead, the pointer to the 0152 * nsHebrewProber is passed to the model probers as a helper "Name Prober". 0153 * When the Group prober receives a positive identification from any prober, 0154 * it asks for the name of the charset identified. If the prober queried is a 0155 * Hebrew model prober, the model prober forwards the call to the 0156 * nsHebrewProber to make the final decision. In the nsHebrewProber, the 0157 * decision is made according to the final-letters scores maintained and Both 0158 * model probers scores. The answer is returned in the form of the name of the 0159 * charset identified, either "windows-1255" or "ISO-8859-8". 0160 * 0161 */ 0162 #endif /* nsHebrewProber_h__ */