• java:java.util.StringTokenizer实现字符串切割


    java:java.util.StringTokenizer实现字符串切割

    1 前言

    java.util工具包提供了字符串切割的工具类StringTokenizer,Spring等常见框架的字符串工具类(如Spring的StringUtils),常见此类使用。

    例如Spring的StringUtils下的方法:

    public static String[] tokenizeToStringArray(
    		@Nullable String str, String delimiters, boolean trimTokens, boolean ignoreEmptyTokens) {
    
    	if (str == null) {
    		return EMPTY_STRING_ARRAY;
    	}
    
    	StringTokenizer st = new StringTokenizer(str, delimiters);
    	List<String> tokens = new ArrayList<>();
    	while (st.hasMoreTokens()) {
    		String token = st.nextToken();
    		if (trimTokens) {
    			token = token.trim();
    		}
    		if (!ignoreEmptyTokens || token.length() > 0) {
    			tokens.add(token);
    		}
    	}
    	return toStringArray(tokens);
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    又如定时任务框架Quartz中,cron表达式类CronExpression,其中的buildExpression方法是为了处理cron表达式的,cron表达式有7个子表达式,空格隔开,cron表达式字符串的切割也使用到了StringTokenizer类,方法如下:

    protected void buildExpression(String expression) throws ParseException {
        this.expressionParsed = true;
    
        try {
            if (this.seconds == null) {
                this.seconds = new TreeSet();
            }
    
            if (this.minutes == null) {
                this.minutes = new TreeSet();
            }
    
            if (this.hours == null) {
                this.hours = new TreeSet();
            }
    
            if (this.daysOfMonth == null) {
                this.daysOfMonth = new TreeSet();
            }
    
            if (this.months == null) {
                this.months = new TreeSet();
            }
    
            if (this.daysOfWeek == null) {
                this.daysOfWeek = new TreeSet();
            }
    
            if (this.years == null) {
                this.years = new TreeSet();
            }
    
            int exprOn = 0;
    
            for(StringTokenizer exprsTok = new StringTokenizer(expression, " \t", false); exprsTok.hasMoreTokens() && exprOn <= 6; ++exprOn) {
                String expr = exprsTok.nextToken().trim();
                if (exprOn == 3 && expr.indexOf(76) != -1 && expr.length() > 1 && expr.contains(",")) {
                    throw new ParseException("Support for specifying 'L' and 'LW' with other days of the month is not implemented", -1);
                }
    
                if (exprOn == 5 && expr.indexOf(76) != -1 && expr.length() > 1 && expr.contains(",")) {
                    throw new ParseException("Support for specifying 'L' with other days of the week is not implemented", -1);
                }
    
                if (exprOn == 5 && expr.indexOf(35) != -1 && expr.indexOf(35, expr.indexOf(35) + 1) != -1) {
                    throw new ParseException("Support for specifying multiple \"nth\" days is not implemented.", -1);
                }
    
                StringTokenizer vTok = new StringTokenizer(expr, ",");
    
                while(vTok.hasMoreTokens()) {
                    String v = vTok.nextToken();
                    this.storeExpressionVals(0, v, exprOn);
                }
            }
    
            if (exprOn <= 5) {
                throw new ParseException("Unexpected end of expression.", expression.length());
            } else {
                if (exprOn <= 6) {
                    this.storeExpressionVals(0, "*", 6);
                }
    
                TreeSet<Integer> dow = this.getSet(5);
                TreeSet<Integer> dom = this.getSet(3);
                boolean dayOfMSpec = !dom.contains(NO_SPEC);
                boolean dayOfWSpec = !dow.contains(NO_SPEC);
                if ((!dayOfMSpec || dayOfWSpec) && (!dayOfWSpec || dayOfMSpec)) {
                    throw new ParseException("Support for specifying both a day-of-week AND a day-of-month parameter is not implemented.", 0);
                }
            }
        } catch (ParseException var8) {
            throw var8;
        } catch (Exception var9) {
            throw new ParseException("Illegal cron expression format (" + var9.toString() + ")", 0);
        }
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77

    2 使用

    import com.google.common.collect.Lists;
    
    import java.util.List;
    import java.util.StringTokenizer;
    
    /**
     * @author xiaoxu
     * @date 2023-10-18
     * spring_boot:com.xiaoxu.boot.tokenizer.TestStringTokenizer
     */
    public class TestStringTokenizer {
    
        public static void main(String[] args) {
            print("你 好 吗\t我是 \t你的\t 朋友 \t", " \t", false);
        }
    
        public static void print(String str, String delimiter, boolean isReturnDelims) {
            System.out.println("切割字符串:【" + str + "】;" + "分隔符:【" + delimiter + "】。");
            List<String> strs = Lists.newArrayList();
            String s;
            boolean x;
            for (StringTokenizer strToken = new StringTokenizer(str, delimiter, false); strToken.hasMoreTokens(); x = (s != null && strs.add(s))) {
                s = strToken.nextToken();
                System.out.println("切割:【" + s + "】");
                if(s.equals("吗"))
                    s = null;
            }
            System.out.println("字符串数组:" + strs);
        }
    
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    执行结果:

    切割字符串:【你 好 吗	我是 	你的	 朋友 	】;分隔符:【 	】。
    切割:【你】
    切割:【好】
    切割:【吗】
    切割:【我是】
    切割:【你的】
    切割:【朋友】
    字符串数组:[,, 我是, 你的, 朋友]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    源码片段分析:

    public StringTokenizer(String str, String delim, boolean returnDelims) {
        currentPosition = 0;
        newPosition = -1;
        delimsChanged = false;
        this.str = str;
        maxPosition = str.length();
        delimiters = delim;
        retDelims = returnDelims;
        setMaxDelimCodePoint();
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    private void setMaxDelimCodePoint() {
        if (delimiters == null) {
            maxDelimCodePoint = 0;
            return;
        }
    
        int m = 0;
        int c;
        int count = 0;
        for (int i = 0; i < delimiters.length(); i += Character.charCount(c)) {
            c = delimiters.charAt(i);
            if (c >= Character.MIN_HIGH_SURROGATE && c <= Character.MAX_LOW_SURROGATE) {
                c = delimiters.codePointAt(i);
                hasSurrogates = true;
            }
            if (m < c)
                m = c;
            count++;
        }
        maxDelimCodePoint = m;
    
        if (hasSurrogates) {
            delimiterCodePoints = new int[count];
            for (int i = 0, j = 0; i < count; i++, j += Character.charCount(c)) {
                c = delimiters.codePointAt(j);
                delimiterCodePoints[i] = c;
            }
        }
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29

    调用setMaxDelimCodePoint()方法,源码可知,切割时设置int maxDelimCodePoint,是为了优化分隔符的检测(取的是分隔字符串中char的ASCII码值最大的字符的ASCII值,存入maxDelimCodePoint中。在方法int scanToken(int startPos)中,若满足条件(c <= maxDelimCodePoint) && (delimiters.indexOf© >= 0),意即该字符的ASCII码值小于等于最大的maxDelimCodePoint,那么这个字符可能存在于分隔字符串中,再检测delimiters分隔字符串中是否包含该字符,反之,若ASCII码值大于分隔字符串中最大的maxDelimCodePoint,也就是说该字符一定不存在于分隔字符串里,&&直接跳过delimiters.indexOf的检测,也就达到了优化分隔符检测的效果了)。

    private int scanToken(int startPos) {
        int position = startPos;
        while (position < maxPosition) {
            if (!hasSurrogates) {
                char c = str.charAt(position);
                if ((c <= maxDelimCodePoint) && (delimiters.indexOf(c) >= 0))
                    break;
                position++;
            } else {
                int c = str.codePointAt(position);
                if ((c <= maxDelimCodePoint) && isDelimiter(c))
                    break;
                position += Character.charCount(c);
            }
        }
        if (retDelims && (startPos == position)) {
            if (!hasSurrogates) {
                char c = str.charAt(position);
                if ((c <= maxDelimCodePoint) && (delimiters.indexOf(c) >= 0))
                    position++;
            } else {
                int c = str.codePointAt(position);
                if ((c <= maxDelimCodePoint) && isDelimiter(c))
                    position += Character.charCount(c);
            }
        }
        return position;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28

    scanToken方法即跳过分隔字符串,只要某此循环时,该字符包含在分隔字符串里,那么position不再自增,以此时的position值作为实际切割获取字符串的末索引, 因为subString方法是左闭右开的,该值是实际获取字符串的末索引值+1,所以可以截取到完整的不包含分隔符的字符串片段。

    skipDelimiters方法类似,即过滤连续包含于分隔字符串中的字符,获取实际需要切割获取的字符串的开始索引值。

    private int skipDelimiters(int startPos) {
        if (delimiters == null)
            throw new NullPointerException();
    
        int position = startPos;
        while (!retDelims && position < maxPosition) {
            if (!hasSurrogates) {
                char c = str.charAt(position);
                if ((c > maxDelimCodePoint) || (delimiters.indexOf(c) < 0))
                    break;
                position++;
            } else {
                int c = str.codePointAt(position);
                if ((c > maxDelimCodePoint) || !isDelimiter(c)) {
                    break;
                }
                position += Character.charCount(c);
            }
        }
        return position;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    上述分析可知,只要待切割字符串中的字符,在分隔字符串中出现,那么就会做一次切割(也就是不论分隔字符串中的每个char或字符串片段的顺序,只要连续包含在分隔字符串里,就切割)。

    演示如下(注意countTokens()方法不要在循环中和nextToken()一同使用):

    public static void print2(String str, String delimiter, boolean isReturnDelims) {
        StringTokenizer strTokenizer = new StringTokenizer(str, delimiter);
        System.out.println("总数目:" + strTokenizer.countTokens());
        int count;
        String[] strs = new String[count = strTokenizer.countTokens()];
        // 注意:不要在循环里写 int i = 0; i < strTokenizer.countTokens();
        // 因为  countTokens方法需要使用currentPosition,而每次执行nextToken方法时,currentPosition会一直往下偏移计算,
        // 会导致循环中, i < strTokenizer.countTokens();发生改变,这里应该是常量总数目
        for (int i = 0; i < count; i++) {
            String s = strTokenizer.nextToken();
            strs[i] = s;
        }
        System.out.println(Arrays.toString(strs));
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    countTokens源码如下:

    public int countTokens() {
        int count = 0;
        int currpos = currentPosition;
        while (currpos < maxPosition) {
            currpos = skipDelimiters(currpos);
            if (currpos >= maxPosition)
                break;
            currpos = scanToken(currpos);
            count++;
        }
        return count;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    执行:

    print2("1a2b3c4ca5bc6ba7abc8acbbaba9", "abc", false);
    
    • 1

    结果如下所示:

    总数目:9
    [1, 2, 3, 4, 5, 6, 7, 8, 9]
    
    • 1
    • 2
  • 相关阅读:
    Android APP开机启动,安卓APP开发自启动,安卓启动后APP自动启动
    若依前后端分离,ruoyi-vue jar包更改成war包发布 Websocket 配置
    密码学——1.密码学概论
    【C#】C# IO类路径合并、本地路径、拼接路径Path.Combine
    【Linux】基础
    点信息标注_BillboardTextActor3D
    LeetCode每日一题:1222. 可以攻击国王的皇后(2023.9.14 C++)
    计算机毕业设计(附源码)python智能旅游电子票务系统
    [项目管理-32]:项目经理六阶段职业成长之路: 达克效应=>短板理论=>刻意练习=>长版板子理论=>精进=>布道
    CSS属性: 过度效果属性transition
  • 原文地址:https://blog.csdn.net/a232884c/article/details/133904470