• df.drop_duplicates() 详解+用法


    drop_duplicates()

    1、不定义任何参数,完全删除重复的行数据

    2、去除重复的几列行数据

    目录

    一、代码示例:

    二、运行结果:

    三、详解:


    一、代码示例:

    1. import pandas as pd
    2. df = pd.DataFrame({
    3. 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    4. 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    5. 'rating': [4, 4, 3.5, 15, 5]})
    6. print("---------------------原始数据:")
    7. print(df)
    8. print("------------------------df.drop_duplicates()")
    9. print(df.drop_duplicates())
    10. print("------------------------删除在brand列中重复的数据行")
    11. print(df.drop_duplicates(subset='brand'))
    12. print("------------------------重复行保留第一次出现的行,删除其他行")
    13. print(df.drop_duplicates(keep="first"))
    14. print("----------------------inplace 布尔值,默认为False,是否直接在原数据上删除重复项或删除重复项后返回副本")
    15. print("-----------------inplace=False 删除重复项后返回副本")
    16. print(df.drop_duplicates(inplace=False))
    17. print("-------------df1")
    18. print(df)
    19. print("-----------------inplace=True 直接在原数据上删除重复项")
    20. print(df.drop_duplicates(inplace=True))
    21. print("-------------df2")
    22. print(df)

    二、运行结果:

    1. ---------------------原始数据:
    2. brand style rating
    3. 0 Yum Yum cup 4.0
    4. 1 Yum Yum cup 4.0
    5. 2 Indomie cup 3.5
    6. 3 Indomie pack 15.0
    7. 4 Indomie pack 5.0
    8. ------------------------df.drop_duplicates()
    9. brand style rating
    10. 0 Yum Yum cup 4.0
    11. 2 Indomie cup 3.5
    12. 3 Indomie pack 15.0
    13. 4 Indomie pack 5.0
    14. ------------------------删除在brand列中重复的数据行
    15. brand style rating
    16. 0 Yum Yum cup 4.0
    17. 2 Indomie cup 3.5
    18. ------------------------重复行保留第一次出现的行,删除其他行
    19. brand style rating
    20. 0 Yum Yum cup 4.0
    21. 2 Indomie cup 3.5
    22. 3 Indomie pack 15.0
    23. 4 Indomie pack 5.0
    24. ----------------------inplace 布尔值,默认为False,是否直接在原数据上删除重复项或删除重复项后返回副本
    25. -----------------inplace=False 删除重复项后返回副本
    26. brand style rating
    27. 0 Yum Yum cup 4.0
    28. 2 Indomie cup 3.5
    29. 3 Indomie pack 15.0
    30. 4 Indomie pack 5.0
    31. -------------df1
    32. brand style rating
    33. 0 Yum Yum cup 4.0
    34. 1 Yum Yum cup 4.0
    35. 2 Indomie cup 3.5
    36. 3 Indomie pack 15.0
    37. 4 Indomie pack 5.0
    38. -----------------inplace=True 直接在原数据上删除重复项
    39. None
    40. -------------df2
    41. brand style rating
    42. 0 Yum Yum cup 4.0
    43. 2 Indomie cup 3.5
    44. 3 Indomie pack 15.0
    45. 4 Indomie pack 5.0

     

    三、详解:


    drop_duplicates(self, subset: 'Optional[Union[Hashable, Sequence[Hashable]]]' = None, keep: 'Union[str, bool]' = 'first', inplace: 'bool' = False, ignore_index: 'bool' = False)
       

    返回:

            DataFrame with duplicate rows removed.
        
        Considering certain columns is optional. Indexes, including time indexes
        are ignored.
        
    参数:
        ----------
        subset : 指定重复数据所在的列。column label or sequence of labels, optional
            Only consider certain columns for identifying duplicates, by
            default use all of the columns.
        keep : {'first', 'last', False}, default 'first'
            Determines which duplicates (if any) to keep.
            - ``first`` : 除了第一次出现以外,删除重复项。Drop duplicates except for the first occurrence.
            - ``last`` : 除了第一次出现以外,删除重复项。Drop duplicates except for the last occurrence.
            - False : 删除所有重复项。Drop all duplicates.
        inplace : True:直接在原始数据删除,False:不直接在原始数据删除,并生成一个副本。bool, default False
            Whether to drop duplicates in place or to return a copy.
        ignore_index : bool, default False
            If True, the resulting axis will be labeled 0, 1, …, n - 1.
        
            .. versionadded:: 1.0.0
        
        Returns
        -------
        DataFrame or None
            DataFrame with duplicates removed or None if ``inplace=True``.
        
        See Also
        --------
        DataFrame.value_counts: Count unique combinations of columns.
        
        示例:
        --------
        Consider dataset containing ramen rating.
        
        >>> df = pd.DataFrame({
        ...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
        ...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
        ...     'rating': [4, 4, 3.5, 15, 5]
        ... })
        >>> df
            brand style  rating
        0  Yum Yum   cup     4.0
        1  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        3  Indomie  pack    15.0
        4  Indomie  pack     5.0
        
        By default, it removes duplicate rows based on all columns.
        
        >>> df.drop_duplicates()
            brand style  rating
        0  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        3  Indomie  pack    15.0
        4  Indomie  pack     5.0
        
        To remove duplicates on specific column(s), use ``subset``.
        
        >>> df.drop_duplicates(subset=['brand'])
            brand style  rating
        0  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        
        To remove duplicates and keep last occurrences, use ``keep``.
        
        >>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
            brand style  rating
        1  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        4  Indomie  pack     5.0

  • 相关阅读:
    中台架构介绍和应用价值
    面向对象技术--设 计 模 式
    Mybatis初级的概念和注解
    编码格式科普ASCII unicode utf-8 usc-2 GB2312
    基于物联网的无线温度系统在钢铁行业的应用
    scala基础入门
    华为无线ac+fit三层组网,每个ap发射不同的业务vlan
    面试题:Redis和MySQL的事务区别是什么?
    Himall商城类型帮助类将string类型转换成int类型
    Day813.什么时候需要分表分库 -Java 性能调优实战
  • 原文地址:https://blog.csdn.net/c_lanxiaofang/article/details/125880941