记第一次写groovy脚本cookie排坑，附登录验证码的解决方案（纯原创整理笔记）

我们项目中用到groovy去做爬虫，这也是我第一次学习爬虫和groovy，毕竟jvm语言学习发现和java还是很像的，相关资料网上都能搜到，这里整理了我实际项目中遇到的问题和解决的大致思路（附代码）

我们项目中是要从12315爬取客诉数据，通过jsoup拿到指定html表格页面数据然后通过list打入我们系统内部，这几步很常见就不说了。

然后主要的坑是在登录的场景，登录需要用户名+密码+验证码，登录后拿取cookie，之后爬接口用这个cookie就行了，用户名密码都有了，这么第一步就是获取验证码

在这里插入图片描述

其实就是走get请求拿到图片文件，一开始我用共用httpUtil走获取失败，发现了第一个坑，原因在于postman等都是自动转成文件，代码里直接走get请求行不通，所以我的解决办法是需要走IO获取文件并落地

失败案例

static String imgCodeReload(String cookie) {
        String url = "http://consumer.12315.sh.cn/consumer/validateImgLogin";

        HttpUtil httpUtil = HttpUtil.newHttp();
        httpUtil.addHeaders("Cookie", cookie);
        httpUtil.addHeaders("Accept", "*/*");
        Map<String, String> configs = new HashMap<>();
        HttpUtil.Response response = httpUtil.get(url, configs)
        String content = response.content;
        return content;
    }
1
2
3
4
5
6
7
8
9
10
11

成功转file获取

/**
     * 下载文件
     * @param url
     * @param dest
     * @return
     */
    static String downloadFileAndGetCode(String cookie, String url) {
        //def fos = new FileOutputStream(dest)
        File file = getImgPath();
        def fos = new FileOutputStream(file);
        def out = new BufferedOutputStream(fos)
        //必须带cookie拿到图片文件，否则登录不通过
        out << getInputStream(cookie,url) //<< new URL(url).openStream()
        fos.close();
        out.close()
        String orcText = ""
        int count = 0
        //重试三次，图片识别有误判几率会读出2313.和23o1
        while ((!isNumeric(orcText) || orcText.length() != 4) && count++ < 3) {
            orcText = CommonUtils.getImgOcrText("https://xxxxxxx.com/fileUpload/singleUpload", file, "eng");
            orcText = orcText.replaceAll("\r|\n*", "")
            System.out.println(orcText);
        }
        file.delete();
        return orcText;
    }
    
    static File getImgPath() {
        File dirFile = new File(".jpg");
        if (!dirFile.exists()) {
            dirFile.mkdirs();
        }
        return new File(dirFile, CommonUtils.createId() + ".jpg");
    }
    
	static boolean isNumeric(String str) {
        if (str == "") return false
        Pattern pattern = Pattern.compile("[0-9]*");
        return pattern.matcher(str).matches();
    }

    public static InputStream getInputStream(String cookie,String urlReq) {
        InputStream inputStream = null;
        HttpURLConnection httpURLConnection = null;
        try {
            URL url = new URL(urlReq);//创建的URL
            if (url != null) {
                httpURLConnection = (HttpURLConnection) url.openConnection();//打开链接
                httpURLConnection.setConnectTimeout(3000);//设置网络链接超时时间，3秒，链接失败后重新链接
                httpURLConnection.setDoInput(true);//打开输入流
                httpURLConnection.setRequestMethod("GET");//表示本次Http请求是GET方式
                httpURLConnection.setDoOutput(true)

                DataOutputStream outStream = new DataOutputStream(httpURLConnection.getOutputStream());
                outStream.write(cookie.getBytes());


                int responseCode = httpURLConnection.getResponseCode();//获取返回码
                if (responseCode == 200) {//成功为200
                    inputStream = httpURLConnection.getInputStream();
                }
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return inputStream
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

CommonUtils.getImgOcrText()是我们图片识别的SDK，这里不方便展示

这一步其实就是落地，相对路径创建目录

static File getImgPath() {
        File dirFile = new File(".jpg");
        if (!dirFile.exists()) {
            dirFile.mkdirs();
        }
        return new File(dirFile, CommonUtils.createId() + ".jpg");
    }
1
2
3
4
5
6
7

这一步做验证，因为12315的验证码场景比较简单，是4位纯数字，这里是做了图片识别会有误判的情况下，做了重试机制，具体处理还是得按照自己的OCR服务来做调整

static boolean isNumeric(String str) {
        if (str == "") return false
        Pattern pattern = Pattern.compile("[0-9]*");
        return pattern.matcher(str).matches();
    }
1
2
3
4
5

这一步是重点，为了能让URL带cookie获取文件，需要使用IO流传入cookie的字节

public static InputStream getInputStream(String cookie,String urlReq) {
        InputStream inputStream = null;
        HttpURLConnection httpURLConnection = null;
        try {
            URL url = new URL(urlReq);//创建的URL
            if (url != null) {
                httpURLConnection = (HttpURLConnection) url.openConnection();//打开链接
                httpURLConnection.setConnectTimeout(3000);//设置网络链接超时时间，3秒，链接失败后重新链接
                httpURLConnection.setDoInput(true);//打开输入流
                httpURLConnection.setRequestMethod("GET");//表示本次Http请求是GET方式
                httpURLConnection.setDoOutput(true)

                DataOutputStream outStream = new DataOutputStream(httpURLConnection.getOutputStream());
                outStream.write(cookie.getBytes());


                int responseCode = httpURLConnection.getResponseCode();//获取返回码
                if (responseCode == 200) {//成功为200
                    inputStream = httpURLConnection.getInputStream();
                }
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return inputStream
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

这里传入到BufferedOutputStream即可，注意这里最后要把落地的file删除
在这里插入图片描述

文件落地后效果如图，这就可以拿file文件去走图片识别的SDK了，然后如果代码发生异常，文件也会落地，而不会走file.delete()这一步，注意不要让应用内图片越堆越多，一定要file.delete()掉
在这里插入图片描述

上游调用代码，开始登录

static String loginOnAndGetCookie(String cookie,int i) {
        String url = "http://consumer.12315.sh.cn/consumer/logon";
        Map<String, String> args = new HashMap<>()
        args.put("organId", "organId");
        args.put("loginN", "user");
        args.put("loginP", "password");
        args.put("imgcode", downloadFileAndGetCode(cookie, "http://consumer.12315.sh.cn/consumer/validateImgLogin"));

        HttpUtil httpUtil = HttpUtil.newHttp();
        httpUtil.addHeaders("Cookie", cookie);
        Map<String, String> configs = new HashMap<>();
        HttpUtil.Response response = httpUtil.post(url, args, configs);
        String content = response.content;
        System.out.println(content)
        Map<String, String> headers = response.headers;
        String cookieReturn = headers.get("Set-Cookie").toString();
        if (cookieReturn != null)
            cookieReturn = cookieReturn.split("; ")[0];
        System.out.println(cookieReturn)
        if ("rememberMe=deleteMe".equals(cookieReturn)){
            return cookie
        }else if (cookieReturn.contains("consumer_sh=") && i==0) {
            i++;
            return loginOnAndGetCookie(cookieReturn,i)
        }
        return ""
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

这里递归走了两次是因为第一次调用拿到cookie，用于传了拿到验证码图片和登录用，第二次拿到的就是登陆后的cookie。具体判断是通过resp中header做的，属于特有场景

最后的上游方法就是爬取方法了，一个完整逻辑闭环

static def String process(String url, Map<String, String> args, CrawlerTask crawlerTask, org.dom4j.Document context, SpiderClient client) {
    //args.put(HttpUtil.USER_AGENT,"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36");
    //if (url==null){
    //}
    url = "http://consumer.12315.sh.cn/consumer/accuse/queryResult";
    String endTime = new SimpleDateFormat("yyyy-MM-dd").format(new Date());
    SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    Date dtDate = df.parse(endTime);

    //设置抓3天数据，防止job停止漏一天
    Calendar calendar = Calendar.getInstance();
    calendar.setTime(dtDate);
    calendar.add(Calendar.DAY_OF_MONTH, -3);
    dtDate = calendar.getTime();

    String startTime = new SimpleDateFormat("yyyy-MM-dd").format(dtDate);
    args.put("currentPage", "1");
    args.put("cnnAppCondition.startDate", startTime);
    args.put("cnnAppCondition.endDate", endTime);

    def logList = [];
    def rsList = [];
    HttpUtil httpUtil = HttpUtil.newHttp();
    String newCookie = loginOnAndGetCookie("",0)
    if (newCookie=="") logList.add("cookie get fail!!!")
    httpUtil.addHeaders("Cookie", newCookie);
    Map<String, String> configs = new HashMap<>();

    String content = httpUtil.post(url, args, configs).content;
    Document doc = Jsoup.parse(content, url);

    def element = doc.select("table").first();
    System.out.println(element);
    if (element != null) {
        def els = element.select("tr");
        for (i in 1..<els.size()) {
            Elements ele = els.get(i).select("td");
            rsList.add(["numNo" : ele.get(1).text().toString()
                        , "date"    : ele.get(2).text().toString()
                        , "name"    : ele.get(3).text().toString()
                        , "category": ele.get(5).text().toString()]);
        }
    } else {
        logList.add("element null!!!")
    }
    return JSON.toJSONString(["rs": rsList, "log": logList]);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

相关阅读:
phpcms头像上传漏洞
 Python中Tkinter模块的Canvas控件绘制jpg图片到指定区域
 HBuilderX打包web网站之wap2app设置底部菜单tabBar
动态规划算法几个小问题
 家电管理系统_家电进销系统（完整版，附带源码可以运行，可以商用）
（SpringBoot）第七章：SpringBoot日志文件
 再看目标检测map指标
 使用枚举Enum代替Integer类型
 unknown type name ‘bool‘ gcc error
应试教育导致学生迷信标准答案惯性导致思维僵化-移动机器人
原文地址：https://blog.csdn.net/Koikoi12/article/details/126779451