• 如何使用python将数据从hadoop保存到数据库


    我正在使用hadoop处理xml文件,所以我已经在python中编写了mapper文件,reducer文件.

    假设需要处理的输入是test.xml

    mapper.py文件

    import sys

    import cStringIO

    import xml.etree.ElementTree as xml

    if __name__ == ‘__main__’:

    buff = None

    intext = False

    for line in sys.stdin:

    line = line.strip()

    if line.find("

    print ‘%s%s’%(campaignID,adGroupID )

    reducer.py文件

    import sys

    if __name__ == ‘__main__’:

    for line in sys.stdin:

    print line.strip()

    我已经使用以下命令运行了hadoop

    bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar

    - file /path/to/mapper.py file -mapper /path/to/mapper.py file

    -file /path/to/reducer.py file -reducer /path/to/reducer.py file

    -input /path/to/input_file/test.xml

    -output /path/to/output_folder/to/store/file

    当我运行以上命令时,hadoop正在以我们在reducer.py文件中正确提及的格式与所需数据在输出路径中创建输出文件

    现在毕竟我想做的是,当我运行上述命令时,我不想将输出数据存储在haddop默认创建的文本文件中,而是想将数据保存到MYSQL数据库

    所以我在reducer.py文件中写了一些python代码,将数据直接写到MYSQL数据库,并尝试通过删除如下所示的输出路径来运行上述命令

    bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar

    - file /path/to/mapper.py file -mapper /path/to/mapper.py file

    -file /path/to/reducer.py file -reducer /path/to/reducer.py file

    -input /path/to/input_file/test.xml

    而且我收到以下类似的错误

    12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output

    Usage: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar [options]

    Options:

    -input DFS input file(s) for the Map step

    -output DFS output directory for the Reduce step

    -mapper The streaming command to run

    -combiner The streaming command to run

    -reducer The streaming command to run

    -file File/dir to be shipped in the Job jar file

    -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.

    -outputformat TextOutputFormat(default)|JavaClassName Optional.

    >毕竟我怀疑在处理文件后如何将数据保存在数据库中?

    >我们可以在哪个文件(mapper.py/reducer.py?)中编写将数据写入数据库的代码

    >该命令用于运行hadoop以将数据保存到数据库中,因为当我在hadoop命令中删除了输出文件夹路径时,它显示了一个错误.

    谁能帮我解决以上问题………….

    已编辑

    已处理

    >如上所述创建的映射器和化简器文件,可通过hadoop命令读取xml文件并在某个文件夹中创建文本文件

    例如:文本文件(使用hadoop命令处理xml文件的结果)位于以下文件夹

    / home / local / user / Hadoop / xml_processing / xml_output / part-00000

    这里的xml文件大小为1.3 GB,使用hadoop处理后,创建的文本文件的大小为345 MB

    现在,我要做的就是读取上述路径中的文本文件,并尽快将数据保存到mysql数据库中.

    我已经使用基本的python进行了尝试,但是要花费350秒来处理文本文件并将其保存到mysql数据库.

    >现在,如nichole所示,下载了sqoop并解压缩到如下所示的某个路径

    /home/local/user/sqoop-1.4.2.bin__hadoop-0.20

    并输入到bin文件夹并键入./sqoop,我收到以下错误

    sh-4.2$./sqoop

    Warning: /usr/lib/hbase does not exist! HBase imports will fail.

    Please set $HBASE_HOME to the root of your HBase installation.

    Warning: $HADOOP_HOME is deprecated.

    Try ‘sqoop help’ for usage.

    我也在下面尝试过

    ./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by ‘’

    结果

    Warning: /usr/lib/hbase does not exist! HBase imports will fail.

    Please set $HBASE_HOME to the root of your HBase installation.

    Warning: $HADOOP_HOME is deprecated.

    12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

    12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation

    12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

    java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

    at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636)

    at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)

    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525)

    at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548)

    at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)

    at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)

    at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262)

    at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235)

    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060)

    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)

    at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)

    at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97)

    at org.apache.sqoop.Sqoop.run(Sqoop.java:145)

    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)

    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)

    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)

    at org.apache.sqoop.Sqoop.main(Sqoop.java:238)

    at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)

    上面的sqoop命令对于读取文本文件并将其保存到数据库的功能是否有用? ,因为我们必须处理文本文件并将其插入数据库!

  • 相关阅读:
    .NET桌面程序集成Web网页开发的十种解决方案
    关于修改docker容器中的内容进行保存的问题
    空间精密定位与导航VR模拟培训软件突破了时空限制
    IPv4 、IPv6
    牛血清白蛋白-葡聚糖-叶黄素纳米颗粒/半乳糖白蛋白磁性阿霉素纳米粒的制备
    CSS基础笔记
    MES系统防呆措施之具体场景学习
    js JSON.stringify() 的简单了解之函数的转换
    网络协议概述
    PHP | imagettftext() 函数
  • 原文地址:https://blog.csdn.net/m0_67391270/article/details/126565676