微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

在不同操作系统上使用UTF8时的不同行为algorithm

简单的algorithm代码

#include <iostream> #include <string> std::string::size_type GetLengthWithUTF(std::string &sValue); int main() { std::string sTestValueUTF8 = "xD0xB6xD0xB6xD0xB6"; std::string sTestValueASCII = "x67x67x67"; std::string sTestValueMIX = "x67x67x67xD0xB6xD0xB6xD0xB6"; std::string::size_type iFuncResult = 0; std::cout << "=========== START TEST ==========nn"; std::cout << "+TEST UTF8 STRINGn"; std::cout << "+----+Bytes of string (sTestValueUTF8.length()) = " << sTestValueUTF8.length() << "n"; iFuncResult = GetLengthWithUTF(sTestValueUTF8); std::cout << "+----+Function result (GetLengthWithUTF("" << sTestValueUTF8 << "")) = " << iFuncResult<< "nn"; std::cout << "+TEST ASCII STRINGn"; std::cout << "+----+Bytes of string (sTestValueASCII.length()) = " << sTestValueASCII.length() << "n"; iFuncResult = GetLengthWithUTF(sTestValueASCII); std::cout << "+----+Function result (GetLengthWithUTF("" << sTestValueASCII << "")) = " << iFuncResult<< "nn"; std::cout << "+TEST MIX STRINGn"; std::cout << "+----+Bytes of string (sTestValueMIX.length()) = " << sTestValueMIX.length() << "n"; iFuncResult = GetLengthWithUTF(sTestValueMIX); std::cout << "+----+Function result (GetLengthWithUTF("" << sTestValueMIX << "")) = " << iFuncResult<< "nn"; std::cout << "n=========== END TEST ==========nn"; } std::string::size_type GetLengthWithUTF(std::string &sValue) { std::cout << " +----+START GetLengthWithUTFn"; std::cout << " +Input string is: " << sValue << "n"; std::string::size_type i; std::cout << " +Start cyclen"; int iCountUTF8characters = 0; for (i = 0; i < sValue.length(); i++) { std::cout << " +----+Iteration N " << i << "n"; std::cout << " +Current character is: " << sValue[i] << ",integer value = " << (int)sValue[i] << "n"; if (sValue[i] > 127) { iCountUTF8characters++; std::cout << " +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: " << iCountUTF8characters << "n"; } else { std::cout << " +----+If statement (sValue[i] > 127) is false.n"; } } std::cout << " +End cyclen"; iCountUTF8characters = iCountUTF8characters / 2; std::cout << " +Return sValue.length() - (iCountUTF8characters / 2) ---> " << sValue.length() << " - (" << iCountUTF8characters << " / 2) = " << (sValue.length() - (std::string::size_type)iCountUTF8characters) <<"n"; std::cout << " +----+ASCIID GetLengthWithUTFn"; return (sValue.length() - (std::string::size_type)iCountUTF8characters); }

控制台编译命令:

AIX 6

g++ -o test test.cpp

RHEL Server 6.7圣地亚哥

g++ -o test test.cpp

Microsoft Windows v10.0.14393

cl /EHsc test.cpp

结果:

AIX 6

=========== START TEST ========== +TEST UTF8 STRING +----+Bytes of string (sTestValueUTF8.length()) = 6 +----+START GetLengthWithUTF +Input string is: жжж +Start cycle +----+Iteration N 0 +Current character is: Ь integer value = 208 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 1 +----+Iteration N 1 +Current character is: ֬ integer value = 182 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 2 +----+Iteration N 2 +Current character is: Ь integer value = 208 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 3 +----+Iteration N 3 +Current character is: ֬ integer value = 182 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 4 +----+Iteration N 4 +Current character is: Ь integer value = 208 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 5 +----+Iteration N 5 +Current character is: ֬ integer value = 182 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 6 +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 6 - (3 / 2) = 3 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("жжж")) = 3 +TEST ASCII STRING +----+Bytes of string (sTestValueASCII.length()) = 3 +----+START GetLengthWithUTF +Input string is: ggg +Start cycle +----+Iteration N 0 +Current character is: g,integer value = 103 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 1 +Current character is: g,integer value = 103 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 2 +Current character is: g,integer value = 103 +----+If statement (sValue[i] > 127) is false. +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 3 - (0 / 2) = 3 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("ggg")) = 3 +TEST MIX STRING +----+Bytes of string (sTestValueMIX.length()) = 9 +----+START GetLengthWithUTF +Input string is: gggжжж +Start cycle +----+Iteration N 0 +Current character is: g,integer value = 103 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 3 +Current character is: Ь integer value = 208 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 1 +----+Iteration N 4 +Current character is: ֬ integer value = 182 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 2 +----+Iteration N 5 +Current character is: Ь integer value = 208 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 3 +----+Iteration N 6 +Current character is: ֬ integer value = 182 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 4 +----+Iteration N 7 +Current character is: Ь integer value = 208 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 5 +----+Iteration N 8 +Current character is: ֬ integer value = 182 +----+If statement (sValue[i] > 127) is true,value of iCountUTF8characters is: 6 +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 9 - (3 / 2) = 6 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("gggжжж")) = 6 =========== END TEST ==========

RHEL Server 6.7圣地亚哥

=========== START TEST ========== +TEST UTF8 STRING +----+Bytes of string (sTestValueUTF8.length()) = 6 +----+START GetLengthWithUTF +Input string is: жжж +Start cycle +----+Iteration N 0 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 1 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 2 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 3 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 4 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 5 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 6 - (0 / 2) = 6 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("жжж")) = 6 +TEST ASCII STRING +----+Bytes of string (sTestValueASCII.length()) = 3 +----+START GetLengthWithUTF +Input string is: ggg +Start cycle +----+Iteration N 0 +Current character is: g,integer value = 103 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 3 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 4 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 5 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 6 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 7 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 8 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 9 - (0 / 2) = 9 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("gggжжж")) = 9 =========== END TEST ==========

Microsoft Windows v10.0.14393

=========== START TEST ========== +TEST UTF8 STRING +----+Bytes of string (sTestValueUTF8.length()) = 6 +----+START GetLengthWithUTF +Input string is: жжж +Start cycle +----+Iteration N 0 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 1 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 2 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 3 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 4 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 5 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 6 - (0 / 2) = 6 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("жжж")) = 6 +TEST ASCII STRING +----+Bytes of string (sTestValueASCII.length()) = 3 +----+START GetLengthWithUTF +Input string is: ggg +Start cycle +----+Iteration N 0 +Current character is: g,integer value = 103 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 3 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 4 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 5 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 6 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 7 +Current character is: Ь integer value = -48 +----+If statement (sValue[i] > 127) is false. +----+Iteration N 8 +Current character is: ֬ integer value = -74 +----+If statement (sValue[i] > 127) is false. +End cycle +Return sValue.length() - (iCountUTF8characters / 2) ---> 9 - (0 / 2) = 9 +----+ASCIID GetLengthWithUTF +----+Function result (GetLengthWithUTF("gggжжж")) = 9 =========== END TEST ==========

该algorithm必须计算string中的字符数。 正如您从testing结果中看到的那样,它只能在AIX下正常工作。

我会很高兴,如果有人帮我理解这个荒谬的行为我的algorithm对于不同的操作系统。 该algorithm是在OS AIX下创build的。 从AIX迁移到LINUX后发现有问题,我做了更广泛的testing,其结果你看到。 我的主要问题是AIX下的该死的algorithm是如何工作的。 我不能以任何合理的方式解释它。

看来这两种制度在对待字符的方式上有所不同,这是标准所允许的。 您的AIX编译器将char视为未签名,而其他两个系统将其视为已签名。

在带有无符号字符的系统上,条件sValue[i] > 127行为与预期完全相同。 但是,相同的表达式在带符号字符的系统上永远不会成功。

这就是为什么你的代码为128以上的字符得到负数。 例如,当208被视为单字节有符号值时, 208变成-48 。

你可以通过强制转换为无符号的方式来解决这个问题,或者用位掩码来检查八位:

if (sValue[i] & 128) { ... // MSB is set }

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐