java 源文件(.java)和编译后的 .class 文件的编码并不一样。
源文件 .java 可以采用多种编码格式,如
之所以有(可以有)多种编码格式,是因为源文件毕竟是给「人」看的,不是给 jvm 看的,它用什么编码格式 jvm 根本就不关心。
当将源码用 javac
编译的时候,默认是 javac
按照系统默认的编码格式读取 java 源文件,然后以 utf-8 的格式输出到 .class 文件中。
换句话说,在默认情况下
unix 平台,javac
用 utf-8 格式读取 java 源文件 然后以 utf-8 格式写 .class;
在默认情况下 windows 平台,javac
用 gbk 格式读取 java 源文件然后以 utf-8 格式写 .class 。
所以,中文字符乱码的根本原因在于,你(有意或无意)没有使用默认编码规则存储 .java 文件,而 javac 却又是按照默认规则去读 .java 文件,这就出现了乱码。
例如, 在 windows 平台下用 utf-8 格式保存 java 源文件, 那么你在执行 javac 命令编译源文件时,你需要「告诉」javac 命令,你要编译的源文件的编码格式。否则,会有乱码问题。
在 Java 中一个字符串的长度并「不能」简单地、想当然的想象成是其中所有字符数的累加和!
以下内容来自 stackoverflow 中的总结和解释
A Java char takes always 16 bits.
A Unicode character, when encoded as UTF-16, takes “almost always” (not always) 16 bits: that’s because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though “almost always” is).
“Almost always”, above, means the 64K first code points of Unicode, range 0x0000 to 0xFFF (BMP), which take 16 bits in the UTF-16 encoding.
A non-BMP (“rare”) Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as “\uD840\uDC00”.
Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one “rare” unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.
Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.
- String str1 = "\uD840\uDC00";
- String str2 = "𠀀";
-
- System.out.println(str2.length());
- System.out.println(Character.isLetter(str2.charAt(0)));
CJK 的意思是 Chinese,Japanese,Korea
的简写 ,实际上就是指中日韩三国文字(象形文字)的 Unicode 编码。
字符集 | Unicode 编码 | 标识 |
---|---|---|
部首扩展 | 2E80 - 2EFF | CJK Radicals Supplement |
康熙部首 | 2F00 - 2FDF | Kangxi Radicals |
汉字结构 | 2FF0 - 2FFF | Ideographic Description Characters |
汉语标点符号 | 3000 - 303F | CJK Symbols and Punctuation |
日语注音 | 3100-3120 | Bopomofo |
注音扩展 | 31A0-31BA | Bopomofo Extended |
汉字笔画 | 31C0-31EF | CJK Strokes |
扩展 A | 3400 - 4DBF | CJK Unified Ideographs Extension A |
基本汉字 | 4E00 - 9FFF | CJK Unified Ideographs |
兼容汉字 | F900- FAD9 | CJK Compatibility Ideographs |
全角半角字母符号 | FF00 - FFEF | Halfwidth and Fullwidth Forms |
扩展 B | 20000-2A6DF | CJK Unified Ideographs Extension B |
扩展 C | 2A700-2B73F | CJK Unified Ideographs Extension C |
扩展 D | 2B740-2B81F | CJK Unified Ideographs Extension D |
兼容扩展 | 2F800-2FA1F | CJK Compatibility Ideographs Supplement |
方案/思路是 ok 的,细节可以再调整。
- // 根据 Unicode 编码完美的判断中文汉字和符号
- private static boolean isChinese(char c) {
- Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
- if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
- || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
- || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
- || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B
- || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
- || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
- || ub == Character.UnicodeBlock.GENERAL_PUNCTUATION) {
- return true;
- }
- return false;
- }
空格 | Unicode 码 | 说明 |
---|---|---|
不间断空格 | \u00A0 | 主要用在 office 中,让一个单词在结尾处不会换行显示,快捷键 ctrl+shift+space |
半角空格(英文符号) | \u0020 | 代码中常用的 |
全角空格(中文符号) | \u3000 | 中文文章中使用 |
//替换中文
int getInputLen(String str) { String newStr; StringBuffer buf = new StringBuffer(); Pattern p = Pattern.compile("[\u4e00-\u9fa5]"); Matcher m = p.matcher(str); while(m.find()){ String chinese = m.group();//匹配出的中文 String pinyin = "aa";//在你的中文与拼音对应中找到对应拼音。 m.appendReplacement(buf, pinyin); } m.appendTail(buf); newStr = buf.toString(); return newStr.length(); }
//判断字符串是否由汉字组成
String str = "这是一个全部由汉字组成的字符串";
System.out.println(str.matches("[\\u4e00-\\u9fa5]+"));//true
String str1 = "这是一个全部由汉字组成的字符串吗?";
System.out.println(str1.matches("[\\u4e00-\\u9fa5]+"));//false
//判断一个字符是否为汉字
char ch = '哇';
System.out.println(String.valueOf(ch).matches("[\u4e00-\u9fa5]"));//true
char ch1 = 'a';
System.out.println(String.valueOf(ch1).matches("[\u4e00-\u9fa5]"));//flase
edittext限制字符32,中文16
TextWatcher
@Override public void onTextChanged(CharSequence s, int start, int before, int count) { int currentLength = s.length(); Log.d(TAG,"s="+s+",currentLength="+currentLength); if (currentLength > 0) { int length = 0;//计算后的长度:中文算2,其它算1,为了总长32,中文16的需求 int index = 0;//记录当前索引 for (int i = 0; i < currentLength; i++) { if (isChinese(s.charAt(i))) {//中文长度算2 length = length + 2; } else {//其它长度算1 length = length + 1; } index = i; if (length >= MAX_NAME_LENGTH) {//当换算长度达到预期退出循换 if (length > MAX_NAME_LENGTH) { index = index - 1; } break; } } int targetLength = index + 1;//用索引计算需要截取的长度 if (targetLength != currentLength) { CharSequence newS = s.subSequence(0, targetLength); Log.d(TAG, "targetLength=" + targetLength + ", newS=" + newS); mEditText.setText(newS);//重设显示字符 mEditText.setSelection(targetLength);//设置光标末尾 } } }
// 根据 Unicode 编码判断中文汉字 private static boolean isChinese(char c) { Character.UnicodeBlock ub = Character.UnicodeBlock.of(c); if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_C || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_D) { return true; } return false; }