pandas读取文件

读取文件

读取文件

读取csv,txt文件read_csv()

csv文件介绍：

CSV 又称逗号分隔值文件，是一种简单的文件格式，以特定的结构来排列表格数据。
CSV 文件能够以纯文本形式存储表格数据，比如电子表格、数据库文件，并具有数据交换的通用格式。
CSV 文件会在 Excel 文件中被打开，其行和列都定义了标准的数据格式。

读取方法：

pandas.read_csv(filepath_or_buffer, sep =‘,’, header=0, names=[“第一列”，“第二列”，“第三列”]，encoding=‘utf-8’，usecols=[1,2,3])
filepath_or_buffer: 文件路径
sep：原文件的分隔符
header: 用作列名的行号，默认为header=0，即使用首行作为列名；若header=None,则表明数据中没有列名行
names:列名命名或重命名,当指定了列名以后，还是可以用0,1,2等index进行列的访问，注意是从0开始的，0表示原数据的第一列。
usecols：要读取的列号或列名,不能用切片的方式，也是从0开始，0代表第一列。
其它参数参照read_excel

import pandas as pd
df = pd.read_csv(".\study\weather.txt", sep=",")
print("df------\n", df.head())
df1 = pd.read_csv(".\study\weather.txt", sep=",", names=["a", "b", "c", "d", "e","f", "g", "h", "i"], usecols=[1, 2, 3, 4, 5]) #指定列号
print("df1------\n",df1.head())
df2 = pd.read_csv(".\study\weather.txt", sep=",", names=["a", "b", "c", "d", "e","f", "g", "h", "i"], usecols=["b", "d", "e"]) #指定列名
print("df2------\n",df2.head())
1
2
3
4
5
6
7

df------
           ymd bWendu yWendu tianqi fengxiang fengli  aqi aqiInfo  aqiLevel
0  2018-01-01     3℃    -6℃   晴~多云       东北风   1-2级   59       良         2
1  2018-01-02     2℃    -5℃   阴~多云       东北风   1-2级   49       优         1
2  2018-01-03     2℃    -5℃     多云        北风   1-2级   28       优         1
3  2018-01-04     0℃    -8℃      阴       东北风   1-2级   28       优         1
4  2018-01-05     3℃    -6℃   多云~晴       西北风   1-2级   50       优         1
df1------
         b       c       d          e       f
0  bWendu  yWendu  tianqi  fengxiang  fengli
1      3℃     -6℃    晴~多云        东北风    1-2级
2      2℃     -5℃    阴~多云        东北风    1-2级
3      2℃     -5℃      多云         北风    1-2级
4      0℃     -8℃       阴        东北风    1-2级
df2------
         b       d          e
0  bWendu  tianqi  fengxiang
1      3℃    晴~多云        东北风
2      2℃    阴~多云        东北风
3      2℃      多云         北风
4      0℃       阴        东北风
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

读取excel文件read_excel ()

使用库介绍

pandas.read_excel() 在内部使用名为openpyxl 和 xlrd 的库。
所以你如果只是用excel相关处理，也可以用openpyxl库。

读取方法

pandas.read_excel(
io,
sheet_name=0,
header=0,
names=None,
index_col=None,
usecols=None,
skiprows=None,
nrows=None
)

参数介绍

io: 文件路径
sheet_name: 文件路径
- 默认是0，索引号从0开始，表示第一个sheet。返回DataFrame。
- sheet_name=1, 2nd sheet as a DataFrame。
- sheet_name=“sheet1”，Load sheet with name “Sheet1”, 返回DataFrame。
- sheet_name=[1,2,“sheet3”]，Load first, second and sheet named “Sheet5”，返回dict类型，key名为1,2,“sheet3”。
- None 表示引用所有sheet，返回dict类型，key为sheet名。

header：表示用第几行作为表头，支持 int, list of int；
默认是0，第一行的数据当做表头。header=None表示不使用数据源中的表头，Pandas自动使用0,1,2,3…的自然数作为索引。
names：表示自定义表头的名称，此时需要数组参数。
index_col：指定列属性为行索引列，支持 int, list of int, 默认是None，也就是索引为0,1,2,3等自然数的列用作DataFrame的行标签。
如果传入的是列表形式，则行索引会是多层索引。

usecols：待解析的列，支持 int, str, list-like, or callable ，默认是 None,头尾皆包含。
- If None, 表示解析全部的列。
- If str, then indicates comma separated list of Excel column letters
  and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of
  both sides.
- If list of int, then indicates list of column numbers to be parsed.
- If list of string, then indicates list of column names to be parsed.

dtype：指定列属性的字段类型。案例：{“a”: “float64”}；默认为None，也就是不改变数据类型。
skiprows：跳过指定的行（可选参数)，类型为：list-like, int, or callable
- Rows to skip at the beginning, 1表示跳掉第一行。
nrows：指定读取的行数，通常用于较大的数据文件中。类型int, 默认是None，读取全部数据

converters：对指定列进行指定函数的处理，传入参数为列名与函数组成的字典，和usecols参数连用。
- key 可以是列名或者列的序号，values是函数，可以自定义的函数或者Python的匿名lambda函数

应用举例

原excel文件中有两个sheet，第一个是student sheet，第二个是vegetables sheet。

# sheet_name演示
import pandas as pd
# 当sheet_name为None时，返回dict类型，key为sheet名，value为DataFrame类型
df1 = pd.read_excel(r".\study\test_excel.xlsx", sheet_name=None)
print("df1---------------\n", df1, type(df1["student"]))

# 当sheet_name为default值时，即为第一个sheet，返回DataFrame类型
df2 = pd.read_excel(r".\study\test_excel.xlsx")
print("df2---------------\n", df2)

# 读取时由数字指定的工作表的键是数字，由工作表名称指定的工作表的键是工作表名称。0表示第一个sheet
df3 = pd.read_excel(r".\study\test_excel.xlsx", sheet_name=[0,"vegetables"])
print("df3---------------\n", df3)
1
2
3
4
5
6
7
8
9
10
11
12
13

df1---------------
 {'student':   name  age sex address  score
0   刘一   18   女      上海    100
1   花二   40   男      上海     99
2   张三   25   男      北京     80
3   李四   30   男      西安     40
4   王五   70   男      青岛     70
5   孙六   65   女      泰州     90, 'vegetables':    序号          菜名   单价  产地
0   1     spinach  4.5  崇明
1   2    cucumber  5.0  奉贤
2   3      tomato  6.0  惠南
3   4  green bean  8.0  金山} 
df2---------------
   name  age sex address  score
0   刘一   18   女      上海    100
1   花二   40   男      上海     99
2   张三   25   男      北京     80
3   李四   30   男      西安     40
4   王五   70   男      青岛     70
5   孙六   65   女      泰州     90
df3---------------
 {0:   name  age sex address  score
0   刘一   18   女      上海    100
1   花二   40   男      上海     99
2   张三   25   男      北京     80
3   李四   30   男      西安     40
4   王五   70   男      青岛     70
5   孙六   65   女      泰州     90, 'vegetables':    序号          菜名   单价  产地
0   1     spinach  4.5  崇明
1   2    cucumber  5.0  奉贤
2   3      tomato  6.0  惠南
3   4  green bean  8.0  金山}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

# header and names演示
# 表示用第几行作为表头，支持 int, list of int； 默认是0，第一行的数据当做表头。header=None表示不使用数据源中的表头，Pandas自动使用0,1,2,3…的自然数作为索引。
# names：表示自定义表头的名称，此时需要传递数组参数。
import pandas as pd
# header=[0,1]，表示第1,2行都是表头
df1 = pd.read_excel(r".\study\test_excel.xlsx", header=[0,1])
print("df1---------------\n", df1)

# header=None表示不使用数据源中的表头，Pandas自动使用0,1,2,3…的自然数作为索引
df2 = pd.read_excel(r".\study\test_excel.xlsx", header=None)
print("df2---------------\n", df2)

# names：表示自定义表头的名称，注意names中的元素数量要和表的列数对应
df3 = pd.read_excel(r".\study\test_excel.xlsx", names=["a", "b", "c", "d", "e"])
print("df3---------------\n", df3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

df1---------------
   name age sex address score
    刘一  18   女      上海   100
0   花二  40   男      上海    99
1   张三  25   男      北京    80
2   李四  30   男      西安    40
3   王五  70   男      青岛    70
4   孙六  65   女      泰州    90
df2---------------
       0    1    2        3      4
0  name  age  sex  address  score
1    刘一   18    女       上海    100
2    花二   40    男       上海     99
3    张三   25    男       北京     80
4    李四   30    男       西安     40
5    王五   70    男       青岛     70
6    孙六   65    女       泰州     90
df3---------------
     a   b  c   d    e
0  刘一  18  女  上海  100
1  花二  40  男  上海   99
2  张三  25  男  北京   80
3  李四  30  男  西安   40
4  王五  70  男  青岛   70
5  孙六  65  女  泰州   90
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# index_col，skiprows, nrows and usecols演示

import pandas as pd
# 指定列属性为行索引列，index_col=[0]以第一列作为行索引
df1 = pd.read_excel(r".\study\test_excel.xlsx", index_col=[0])  #以第一列作为行索引
print("df1____1---------\n", df1)

# 第一列作为行索引了
print("df1____2---------\n",df1.loc[["刘一", "李四"], "address"])

# index_col默认是None，也就是索引为0,1,2,3等自然数的列用作DataFrame的行索引
df2 = pd.read_excel(r".\study\test_excel.xlsx")
print("df2---------\n", df2)

# index_col=None为默认值，索引为0,1,2,3等自然数的列用作DataFrame的行索引，usecols=[0,2]取第一列和第3列
df4 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,2])
print("df4---------\n",df4)

# usecols以类似于excel中range的访问方式，多列
df5 = pd.read_excel(r".\study\test_excel.xlsx", usecols="A:C,E")
print("df5---------\n",df5)

# 取出前4列，跳过第1行和第2行，注意skiprows是从1开始的
df6 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,1,2,3], skiprows=[1,2])
print("df6---------\n",df6)

# 取出列，跳过第1列和第4行，取出前2行
df7 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,3], nrows=2)
print("df7---------\n",df7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

df1____1---------
       age sex address  score
name                        
刘一     18   女      上海    100
花二     40   男      上海     99
张三     25   男      北京     80
李四     30   男      西安     40
王五     70   男      青岛     70
孙六     65   女      泰州     90
df1____2---------
 name
刘一    上海
李四    西安
Name: address, dtype: object
df2---------
   name  age sex address  score
0   刘一   18   女      上海    100
1   花二   40   男      上海     99
2   张三   25   男      北京     80
3   李四   30   男      西安     40
4   王五   70   男      青岛     70
5   孙六   65   女      泰州     90
df4---------
   name sex
0   刘一   女
1   花二   男
2   张三   男
3   李四   男
4   王五   男
5   孙六   女
df5---------
   name  age sex  score
0   刘一   18   女    100
1   花二   40   男     99
2   张三   25   男     80
3   李四   30   男     40
4   王五   70   男     70
5   孙六   65   女     90
df6---------
   name  age sex address
0   张三   25   男      北京
1   李四   30   男      西安
2   王五   70   男      青岛
3   孙六   65   女      泰州
df7---------
   name address
0   刘一      上海
1   花二      上海
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

# dtype演示
df1 = pd.read_excel(r".\study\test_excel.xlsx", names=["a", "b", "c", "d", "e"], 
                    dtype={"b": "float64", "e":"float64"}) #注意原数据的a，e列类型变化
print(df1)
1
2
3
4

    a     b  c   d      e
0  刘一  18.0  女  上海  100.0
1  花二  40.0  男  上海   99.0
2  张三  25.0  男  北京   80.0
3  李四  30.0  男  西安   40.0
4  王五  70.0  男  青岛   70.0
5  孙六  65.0  女  泰州   90.0
1
2
3
4
5
6
7

# converters演示
# 遇到某一列的数据前面包含0（如010101）的时候，pd.read_excel()方法返回的DataFrame会将这一列视为int类型，即010101变成10101，
# 这种情况下，如果想要保持数据的完整性，可以以str类型来读取这一列
# converters中调用函数时，是将列的每一个元素依次作为函数的参数进行的。
df1 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,2,4],  #原1，3,列
              converters={0: lambda x: x+"同学",  # 0对应上面[0,2,4]中的0, sex对应原2，2对应原4
                          "sex": lambda x: x + "孩子",
                          "score": lambda x: x + 1000
                         })
print("df1---------\n",df1)
print(type(df1.loc[1, "score"]))

def join_str(value):
    if isinstance(value, int):
        result = value + 100
    else:
        result = value + "乖宝"
    return result

df2 = df1 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,2,4],  #原1，3,列
              converters={0: join_str,
                          1: join_str,
                          2: str    #使用str读取
                         })
print("df2---------\n",df2)
print(type(df2.loc[1, "score"]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

df1---------
    name  sex  score
0  刘一同学  女孩子   1100
1  花二同学  男孩子   1099
2  张三同学  男孩子   1080
3  李四同学  男孩子   1040
4  王五同学  男孩子   1070
5  孙六同学  女孩子   1090

df2---------
    name  sex score
0  刘一乖宝  女乖宝   100
1  花二乖宝  男乖宝    99
2  张三乖宝  男乖宝    80
3  李四乖宝  男乖宝    40
4  王五乖宝  男乖宝    70
5  孙六乖宝  女乖宝    90

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

相关阅读:
Lvs+Nginx+NDS
网络安全（黑客技术）—自学
 算法与数据结构-并行排序算法之双调排序（Bitonic_Sort）
js正则表达式之中文验证(转)
SpringBoot面试题4：Spring Boot 支持哪些日志框架？推荐和默认的日志框架是哪个？
Google Earth Engine（GEE）扩展—— geetool中的Widgets小部件(geetools:widgets)
github在线编程
 随机函数变换示例
 JS-JSON
蓝桥杯官网填空题（方格计数）
原文地址：https://blog.csdn.net/weixin_48668114/article/details/126107918