目录
在Java内部进行字符处理时,采用的都是Unicode,具体编码格式是UTF-16BE。简单回顾一下,UTF-16使用两个或四个字节表示一个字符,Unicode编号范围在65536以内的占两个字节,超出范围的占四个字节,BE(Big Endian)就是先输出高位字节,再输出低位字节,这与整数的内存表示是一致的。
char本质上是一个固定占用两个字节的无符号正整数,这个正整数对应于Unicode编号,用于表示那个Unicode编号对应的字符。
由于固定占用两个字节,char只能表示Unicode编号在65536以内的字符,而不能表示超出范围的字符。
那超出范围的字符怎么表示呢?只能使用String类来表示,例如:汉字"𠮷"的Unicode码点为0x20BB7,该码点显然超出了65535,所以只能用String表示,而当粘贴到代码中时,自动转换为两个字符"\uD842\uDFB7"
- public class CharTest {
-
- public static void main(String[] args) {
- char c = '贤';
- System.out.println(c);
- char c1 = 0x8d24;
- System.out.println(c1);
- char c2 = 36132;
- System.out.println(c2);
- char c3 = '\u8d24';
- System.out.println(c3);
- // char c4 = '\uD842\uDFB7';
- String s = "\uD842\uDFB7";
- System.out.println(s);
- }
-
- }
getBytes():此方法是根据java命令运行时参数file.encoding设置的编码表进行编码的。
查看getBytes()底层:
- public static Charset defaultCharset() {
- if (defaultCharset == null) {
- synchronized (Charset.class) {
- String csn = AccessController.doPrivileged(
- new GetPropertyAction("file.encoding"));
- Charset cs = lookup(csn);
- if (cs != null)
- defaultCharset = cs;
- else
- defaultCharset = forName("UTF-8");
- }
- }
- return defaultCharset;
- }
例子:
- import java.util.Arrays;
-
- public class StringTest {
-
- public static void main(String[] args) {
- System.out.println(System.getProperty("file.encoding"));
- String str = "你好";
- byte[] bytes=str.getBytes();
- System.out.println(Arrays.toString(bytes));
- }
-
- }
- import java.util.Arrays;
-
- public class StringTest {
-
- public static void main(String[] args) throws Exception {
- String str = "你好";
- byte[] bytes = str.getBytes("UTF-8");
- System.out.println(Arrays.toString(bytes));//[-28, -67, -96, -27, -91, -67]
- byte[] gbks = str.getBytes("GBK");
- System.out.println(Arrays.toString(gbks));//[-60, -29, -70, -61]
-
- byte[] bytes1 = {-28, -67, -96, -27, -91, -67};
- String str1 = new String(bytes1,"UTF-8");
- System.out.println(str1);//你好
-
- byte[] bytes2 = {-60, -29, -70, -61};
- String str2 = new String(bytes2,"GBK");
- System.out.println(str2);//你好
- }
-
- }
乱码可逆演示
- public static void lmknCode() throws Exception {
- String str = "你好";
- byte[] bytes = str.getBytes("GBK");
- System.out.println(Arrays.toString(bytes));
- String str1 = new String(bytes,"UTF-8");
- System.out.println(str1);
- String str2 = new String(bytes,"GBK");
- System.out.println(str2);
- }
乱码不可逆演示
- public static void lmbknCode() throws Exception {
- String str = "你好";
- byte[] bytes = str.getBytes("ISO-8859-1");
- System.out.println(Arrays.toString(bytes));//[63, 63]
- String str1 = new String(bytes,"GBK");
- System.out.println(str1);//??
- String str2 = new String(bytes,"UTF-8");
- System.out.println(str2);//??
- }
- import java.nio.charset.Charset;
- import java.util.Set;
-
- public class JavaCode {
-
- public static void main(String[] args) {
- Set<String> charsetNames = Charset.availableCharsets().keySet();
- System.out.println("-----JDK1.8 charset is "+charsetNames.size()+"----- ");
- for (String str : charsetNames) {
- System.out.println(str);
- }
- }
- }
结果:
- -----JDK1.8 charset is 170-----
- Big5
- Big5-HKSCS
- CESU-8
- EUC-JP
- EUC-KR
- GB18030
- GB2312
- GBK
- IBM-Thai
- IBM00858
- IBM01140
- IBM01141
- IBM01142
- IBM01143
- IBM01144
- IBM01145
- IBM01146
- IBM01147
- IBM01148
- IBM01149
- IBM037
- IBM1026
- IBM1047
- IBM273
- IBM277
- IBM278
- IBM280
- IBM284
- IBM285
- IBM290
- IBM297
- IBM420
- IBM424
- IBM437
- IBM500
- IBM775
- IBM850
- IBM852
- IBM855
- IBM857
- IBM860
- IBM861
- IBM862
- IBM863
- IBM864
- IBM865
- IBM866
- IBM868
- IBM869
- IBM870
- IBM871
- IBM918
- ISO-2022-CN
- ISO-2022-JP
- ISO-2022-JP-2
- ISO-2022-KR
- ISO-8859-1
- ISO-8859-13
- ISO-8859-15
- ISO-8859-2
- ISO-8859-3
- ISO-8859-4
- ISO-8859-5
- ISO-8859-6
- ISO-8859-7
- ISO-8859-8
- ISO-8859-9
- JIS_X0201
- JIS_X0212-1990
- KOI8-R
- KOI8-U
- Shift_JIS
- TIS-620
- US-ASCII
- UTF-16
- UTF-16BE
- UTF-16LE
- UTF-32
- UTF-32BE
- UTF-32LE
- UTF-8
- windows-1250
- windows-1251
- windows-1252
- windows-1253
- windows-1254
- windows-1255
- windows-1256
- windows-1257
- windows-1258
- windows-31j
- x-Big5-HKSCS-2001
- x-Big5-Solaris
- x-euc-jp-linux
- x-EUC-TW
- x-eucJP-Open
- x-IBM1006
- x-IBM1025
- x-IBM1046
- x-IBM1097
- x-IBM1098
- x-IBM1112
- x-IBM1122
- x-IBM1123
- x-IBM1124
- x-IBM1166
- x-IBM1364
- x-IBM1381
- x-IBM1383
- x-IBM300
- x-IBM33722
- x-IBM737
- x-IBM833
- x-IBM834
- x-IBM856
- x-IBM874
- x-IBM875
- x-IBM921
- x-IBM922
- x-IBM930
- x-IBM933
- x-IBM935
- x-IBM937
- x-IBM939
- x-IBM942
- x-IBM942C
- x-IBM943
- x-IBM943C
- x-IBM948
- x-IBM949
- x-IBM949C
- x-IBM950
- x-IBM964
- x-IBM970
- x-ISCII91
- x-ISO-2022-CN-CNS
- x-ISO-2022-CN-GB
- x-iso-8859-11
- x-JIS0208
- x-JISAutoDetect
- x-Johab
- x-MacArabic
- x-MacCentralEurope
- x-MacCroatian
- x-MacCyrillic
- x-MacDingbat
- x-MacGreek
- x-MacHebrew
- x-MacIceland
- x-MacRoman
- x-MacRomania
- x-MacSymbol
- x-MacThai
- x-MacTurkish
- x-MacUkraine
- x-MS932_0213
- x-MS950-HKSCS
- x-MS950-HKSCS-XP
- x-mswin-936
- x-PCK
- x-SJIS_0213
- x-UTF-16LE-BOM
- X-UTF-32BE-BOM
- X-UTF-32LE-BOM
- x-windows-50220
- x-windows-50221
- x-windows-874
- x-windows-949
- x-windows-950
- x-windows-iso2022jp
- import ch.qos.logback.core.encoder.ByteArrayUtil; //logback-core-1.2.10.jar中
-
- public class ByteTest {
- public static void main(String[] args) {
- System.out.println(ByteArrayUtil.toHexString(new byte[10]));
- }
- }
每天⽤⼼记录⼀点点。内容也许不重要,但习惯很重要!