• 去除有重复的行


     【问题】

    I have a csv file and I have duplicate as well as unique data getting add to it on a daily basis. This involves too many duplicates. I have to remove the duplicates based on specific columns. For eg:

    csvfile1:

    1. title1 title2 title3 title4 title5
    2. abcdef 12 13 14 15
    3. jklmn 12 13 56 76
    4. abcdef 12 13 98 89
    5. bvnjkl 56 76 86 96

    Now, based on title1, title2 and title3 I have to remove duplicates and add the unique entries in a new csv file. As you can see abcdef row is not unique and repeats based on title1,title2 and title3 so it should be removedand the output should look like:

    Expected Output CSV File:

    1. title1 title2 title3 title4 title5
    2. jklmn 12 13 56 76
    3. bvnjkl 56 76 86 96

    My tried code is here below:CSVINPUT file import csv

    1. f = open("1.csv", 'a+')
    2. writer = csv.writer(f)
    3. writer.writerow(("t1", "t2", "t3"))
    4. a =[["a", 'b', 'c'], ["g", "h", "i"],['a','b','c']] #This list is changed daily so new and duplicates data get added daily
    5. for i in range(2):
    6. writer.writerow((a[i]))
    7. f.close()

    Duplicate removal script:

    1. import csv
    2. with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    3. seen = set() # set for fast O(1) amortized lookup
    4. for line in in_file:
    5. if line not in seen: continue # skip duplicate
    6. out_file.write(line)

    My Output: 2.csv:

    1. t1 t2 t3
    2. a b c
    3. g h i

    Now, I do not want a b c in the 2.csv based on t1 and t2 only the unique g h i based on t1 and t2

    有人给出解法但楼主表示看不懂

    1. import csv
    2. with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    3. seen = set()
    4. seentwice = set()
    5. reader = csv.reader(in_file)
    6. writer = csv.writer(out_file)
    7. rows = []
    8. for row in reader:
    9. if (row[0],row[1]) in seen:
    10. seentwice.add((row[0],row[1]))
    11. seen.add((row[0],row[1]))
    12. rows.append(row)
    13. for row in rows:
    14. if (row[0],row[1]) not in seentwice:
    15. writer.writerow(row)

    【回答】

    只要按前3个字段分组,选出成员计数等于1的组,再合并各组记录即可。如无特殊要求,此类结构化计算用SPL来实现要简单且易懂许多:

    A
    1=file("d:\\source.csv").import@t()
    2=A1.group(title1,title2,title3).select(~.len()==1).conj()
    3=file("d:\\result.csv").export@c(A2)

    A1:读取文件source.csv中的内容。

    A2:按前3个字段分组,选出成员计数等于1的组,再合并各组记录。

    A3:将A2结果写入文件result.csv中。

  • 相关阅读:
    linux单机部署kafka
    云架构师学习------腾讯云通识-存储与数据库
    Environment与ConfigurableEnvironment
    Elasticsearch个人学习笔记
    【无标题】
    在 FPGA 上快速构建 PID 算法
    YAML配置文件
    打印日志遇到的问题,logback与zookeeper冲突
    TCPIP协议学习
    Android中使用枚举的来来去去
  • 原文地址:https://blog.csdn.net/raqsoft/article/details/127803867