Java实现过滤中文乱码

2020-12-28 10:13:07 浏览数 (1)

1. Unicode编码

Unicode编码是一种涵盖了世界上所有语言、标点等字符的编码方式,简单一点说,就是一种通用的世界码;其编码范围:U 0000 .. U 10FFFF。按Unicode硬编码的区间进行划分,Unicode编码被分成若干个block ( Unicode block);每一个Unicode编码专属于唯一的Unicode block,Unicode block之间互不重叠。从码字的本身的属性出发,Unicode编码被分成了若干script ( Unicode script);比如,与中文相关的字符、标点的scriptHan包括block如下:

  • CJK Radicals Supplement
  • Kangxi Radicals
  • CJK Symbols and Punctuation中的15个字符
  • CJK Unified Ideographs Extension A
  • CJK Unified Ideographs
  • CJK Compatibility Ideographs
  • CJK Unified Ideographs Extension B
  • CJK Unified Ideographs Extension C
  • CJK Unified Ideographs Extension D
  • CJK Unified Ideographs Extension E
  • CJK Compatibility Ideographs Supplement

其中,常见的中文字符在CJK Unified Ideographs block;此外,考虑繁体字及不常见字等,CJK还有A、B、C、D、E五个extension。Basic Latin block完整地包含了ASCII码的控制字符、标点字符与英文字母字符。

2. Java的字符编码

JDK完整实现Unicode的block与script:

代码语言:javascript复制
Char c = '☎'
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c)
Character.UnicodeScript uc = Character.UnicodeScript.of(c);</pre>

Java中的字符char内置的编码方式是UTF-16,当char强转成int类型时,其返回值是unicode编码值,只有当getbyte时才返回的是utf-8编码的byte:

<pre>String s = "u00a0";
String.format("\ux", (int) s.charAt(0)) // --> u00a0
import org.apache.commons.codec.binary.Hex;
Hex.encodeHex(s.getBytes()) // --> c2a0

UTF-8是Unicode字符的变长前缀编码的一种实现,二者之间的对应关系在这里.现在我们回到开篇过滤中文乱码的问题,有一个基本解决思路:

UTF-8是Unicode字符的变长前缀编码的一种实现,二者之间的对应关系在这里.现在我们回到开篇过滤中文乱码的问题,有一个基本解决思路:

去掉各种标点字符、控制字符,

计算剩下字符中非中文字符所占的比例,如果超过阈值,则认为该字符串为乱码串

完整代码如下:

代码语言:javascript复制
public class ChineseUtill {

    private static boolean isChinese(char c) {
        Character.UnicodeScript sc = Character.UnicodeScript.of(c);
        if (sc == Character.UnicodeScript.HAN) {
            return true;
        }
        return false;
    }

    public static boolean isPunctuation(char c) {
        Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        if (    // punctuation, spacing, and formatting characters
                ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
                // symbols and punctuation in the unified Chinese, Japanese and Korean script
                || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                // fullwidth character or a halfwidth character
                || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                // vertical glyph variants for east Asian compatibility
                || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                // vertical punctuation for compatibility characters with the Chinese Standard GB 18030
                || ub == Character.UnicodeBlock.VERTICAL_FORMS
                // ascii
                || ub == Character.UnicodeBlock.BASIC_LATIN
                ) {
            return true;
        } else {
            return false;
        }
    }

    private static Boolean isUserDefined(char c) {
        Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        if (ub == Character.UnicodeBlock.NUMBER_FORMS
                || ub == Character.UnicodeBlock.ENCLOSED_ALPHANUMERICS
                || ub == Character.UnicodeBlock.LETTERLIKE_SYMBOLS
                || c == 'ufeff'
                || c == 'u00a0'
                )
            return true;
        return false;
    }

    public static Boolean isMessy(String str)  {
        float chlength = 0;
        float count = 0;
        for(int i = 0; i < str.length(); i  ) {
            char c = str.charAt(i);
            if(isPunctuation(c) || isUserDefined(c))
                continue;
            else {
                if(!isChinese(c)) {
                    count = count   1;
                }
                chlength   ;
            }
        }
        float result = count / chlength;
        if(result > 0.3)
            return true;
        return false;
    }

}

为了得到更为完整的可接受的字符表,定义isUserDefined方法(具体字符表与日志中的字符有关系);加上了Number Forms、Enclosed Alphanumerics、Letterlike Symbols这三个block,以及u00a0(Non-breaking space)字符与ufeff(ZERO WIDTH NO-BREAK SPACE)字符。

0 人点赞