• 多线程扒取MAVEN中央仓所有jar的小程序


    实在是受不了内外网导出导入jar包了,心一横,写了一个安26个字母排序扒maven中央仓所有jar的代码。

    pom.xml 文件

    1. "1.0" encoding="UTF-8"?>
    2. "http://maven.apache.org/POM/4.0.0"
    3. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    4. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    5. 4.0.0
    6. org.bullgod
    7. MavenRepoBaLaLa
    8. 1.0-SNAPSHOT
    9. 17
    10. 17
    11. UTF-8
    12. org.jsoup
    13. jsoup
    14. 1.16.1
    15. org.apache.commons
    16. commons-collections4
    17. 4.4
    18. commons-io
    19. commons-io
    20. 2.11.0
    21. commons-lang
    22. commons-lang
    23. 2.1
    24. org.slf4j
    25. slf4j-api
    26. 2.0.5
    27. org.slf4j
    28. slf4j-reload4j
    29. 2.0.5

    源码

    1. package org.bullgod;
    2. import org.jsoup.Jsoup;
    3. import org.jsoup.nodes.Document;
    4. import org.jsoup.nodes.Element;
    5. import org.jsoup.select.Elements;
    6. import java.io.*;
    7. import java.net.SocketTimeoutException;
    8. import java.net.URL;
    9. import java.text.SimpleDateFormat;
    10. import java.util.Date;
    11. import java.util.concurrent.ExecutorService;
    12. import java.util.concurrent.Executors;
    13. public class MavenBalabalaThread {
    14. public static void main(String[] args) throws IOException {
    15. ExecutorService es = Executors.newFixedThreadPool(30 );
    16. char cc = 'a';
    17. for (int i = 0; i < 26; i++) {
    18. char dd = (char) (cc + i);//强制类型转化
    19. String ccc = String.valueOf(dd);
    20. es.submit(new Task(ccc));
    21. }
    22. // 关闭线程池:
    23. es.shutdown();
    24. }
    25. }
    26. class Task implements Runnable {
    27. /**
    28. * 爬取根目录
    29. */
    30. //private static final String ROOT = "https://repo.maven.apache.org/maven2/";
    31. private static final String ROOT = "https://repo1.maven.org/maven2/";
    32. /**
    33. * 硬盘存取根目录
    34. */
    35. private static final String DiskROOT = "D:\\maven2\\";
    36. /**
    37. * maven-metadata.xml文件名
    38. */
    39. private static final String MAVEN_METADATA_XML_FILENAME = "maven-metadata.xml";
    40. /**
    41. * 全部顶层索引文件
    42. */
    43. private static final String indexfilename = "maven2Indexall.txt";
    44. String firstAlpaca = "all"; //all 全部爬取,失败概率大,建议分字母 a,b,c...爬取
    45. public Task(String args) throws IOException {
    46. firstAlpaca = args;
    47. }
    48. /**
    49. * 查询子url
    50. *
    51. * @param url 当前url
    52. * @param sleepMillis 睡眠毫秒数
    53. */
    54. private static void findSubUrl(String url, int sleepMillis) {
    55. try {
    56. Thread.sleep(sleepMillis);
    57. Document doc = null;
    58. boolean needreconnect = true;
    59. while (needreconnect) {
    60. try {
    61. doc = Jsoup.connect(url).userAgent("Mozilla").timeout(5000).get();
    62. } catch (SocketTimeoutException te) {
    63. //链接超时,等待重连,10秒
    64. Thread.sleep(10 * 1000);
    65. //System.out.println("链接超时,等待重连,10秒");
    66. System.out.println("链接超时,等待10秒重连");
    67. needreconnect = true;
    68. continue;
    69. }
    70. needreconnect = false;
    71. }
    72. Elements links = doc.select("#contents a");
    73. for (Element link : links) {
    74. String pathorfilename = link.attr("href");
    75. if (pathorfilename.equals("../")) {
    76. //上级目录,不处理
    77. continue;
    78. }
    79. //创建文件夹
    80. //获得绝对URL
    81. String absUrl = link.absUrl("href");
    82. System.out.println(absUrl);
    83. System.out.println("{}" + absUrl);
    84. //获得保存文件路径
    85. int urllen = ROOT.length();
    86. String pathName = absUrl.substring(urllen);
    87. java.util.Date day = new Date();
    88. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
    89. String nowtime = sdf.format(day);
    90. System.out.println("[" + nowtime + "]: " + pathName);
    91. System.out.println("[{}]: " + nowtime + "{}" + pathName);
    92. //判断是目录还是文件
    93. int ret = pathorfilename.indexOf("/");
    94. if (ret == -1) {
    95. String saveFile = DiskROOT + pathName;
    96. File f1 = null;
    97. //是文件,不是目录
    98. //储存网络文件到硬盘
    99. while (true) {
    100. try {
    101. f1 = new File(saveFile);
    102. if (!f1.exists()) {
    103. //文件不存在才下载
    104. URL httpurl = new URL(absUrl);
    105. BufferedInputStream bis = new BufferedInputStream(httpurl.openStream());
    106. FileOutputStream fis = new FileOutputStream(saveFile);
    107. byte[] buffer = new byte[1024];
    108. int count = 0;
    109. while ((count = bis.read(buffer, 0, 1024)) != -1) {
    110. fis.write(buffer, 0, count);
    111. }
    112. fis.close();
    113. bis.close();
    114. break;
    115. }
    116. } catch (IOException e) {
    117. System.out.println("下载文件失败:{}" + saveFile);
    118. if (f1.exists()) {
    119. f1.delete();
    120. }
    121. Thread.sleep(10 * 1000);
    122. //System.out.println("链接超时,等待重连,10秒");
    123. System.out.println("文件下载失败,等待10秒重新下载");
    124. //重新下载
    125. continue;
    126. }
    127. }
    128. } else {
    129. //目录
    130. //创建硬盘目录
    131. String filePath = DiskROOT + pathName;
    132. File f2 = new File(filePath);
    133. if (!f2.exists()) {
    134. boolean flag2 = f2.mkdir();
    135. if (!flag2) {
    136. //System.out.println( "文件夹创建失败:"+filePath);
    137. System.out.println("创建文件失败:{}" + filePath);
    138. }
    139. }
    140. //递归处理
    141. findSubUrl(absUrl, sleepMillis);
    142. }
    143. }
    144. } catch (IOException | InterruptedException e) {
    145. e.printStackTrace();
    146. }
    147. }
    148. private static void searchdir(String rooturl, String dir, int sleepMillis) {
    149. String filePath = DiskROOT + dir;
    150. File f = new File(filePath);
    151. if (!f.exists()) {
    152. boolean flag2 = f.mkdir();
    153. if (!flag2) {
    154. System.out.println("文件夹创建失败:{}" + filePath);
    155. }
    156. }
    157. String suburl = rooturl + dir;
    158. findSubUrl(suburl, sleepMillis);
    159. }
    160. @Override
    161. public void run() {
    162. System.out.println("Beging crawler:beging with {}" + firstAlpaca);
    163. int sleepMillis = 100;
    164. String rooturl = ROOT;
    165. // findSubUrl(rooturl, sleepMillis); //直接爬取全部
    166. File file = new File(DiskROOT + indexfilename);
    167. try {
    168. BufferedReader br = new BufferedReader(new FileReader(file));
    169. String st;
    170. while ((st = br.readLine()) != null) {
    171. System.out.println(st);
    172. String dir = st.trim();
    173. if (firstAlpaca.equals("all") || firstAlpaca.equals("ALL")) {
    174. searchdir(rooturl, dir, sleepMillis);
    175. } else {
    176. int index = dir.toLowerCase().indexOf(firstAlpaca);
    177. if (index == 0) {
    178. //首字母合格
    179. searchdir(rooturl, dir, sleepMillis);
    180. }
    181. }
    182. }
    183. } catch (FileNotFoundException e) {
    184. System.out.println("找不到文件:{}" + indexfilename);
    185. } catch (IOException ie) {
    186. System.out.println("使用文件失败:{}" + indexfilename);
    187. }
    188. System.out.println("End crawler");
    189. }
    190. }

  • 相关阅读:
    基于Spring Boot的网上购物商城系统
    小样本利器1.半监督一致性正则 Temporal Ensemble & Mean Teacher代码实现
    【无标题】
    justjavac:从辍学到成为Deno核心代码贡献者,我的十年编程生涯
    原型工具墨刀的使用
    【机器学习】特征工程:特征选择、数据降维、PCA
    电脑同时连接有线和无线网络怎么设置网络的优先级
    电商前台项目(三):完成Search搜索模块业务
    springcloud 整合gateway 网关
    夜神模拟器安装frida-server图文详解
  • 原文地址:https://blog.csdn.net/zxbzhang/article/details/132858444