可以通过 list、csv、Series、empty DataFrame
等创建
DataFrame 语法结构
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
list
创建import pandas as pd
technologies = [ ["Spark",20000, "30days"], ["Pandas",25000, "40days"], ]df = pd.DataFrame(technologies)
df
不指定 index、columns
,会默认生成递增序列号 0/1/2...
,下面是指定 index、columns
,可以对已存在的 df
指定:
# 自定义列名、索引
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df = pd.DataFrame(technologies, index=row_label, columns=column_names)
df
df.dtypes
可以查看字段数据类型:
df.dtypesCourses object
Fee int64
Duration object
dtype: object# 指定自定义数据类型
types={'Courses': str,'Fee':float,'Duration':str}
df = df.astype(types)
df.dtypes
list-dict(json 字符串)
创建technologies = [{'Courses':'Spark', 'Fee': 20000, 'Duration':'30days'},{'Courses':'Pandas', 'Fee': 25000, 'Duration': '40days'}]
df = pd.DataFrame(technologies)
df
courses = pd.Series(['Spark', 'Pandas'])
fees = pd.Series([20000, 25000])
duration = pd.Series(['30days', '40days'])df = pd.concat([courses, fees, duration], axis=1)
df#Outputs
# 0 1 2
#0 Spark 20000 30days
#1 Pandas 25000 40days
concat
默认的 index
是自增序列,可以自定义:
index_labels=['r1','r2']
courses.index = index_labels
fees.index = index_labels
duration.index = index_labelsdf = pd.concat({"courses": courses, "fees": fees, "duration": duration}, axis=1)
df# Outputs
# Courses Course_Fee Course_Duration
#r1 Spark 20000 30days
#r2 Pandas 25000 40days
zip
函数解压列表Courses = ['Spark', 'Pandas']
Fee = [20000,25000]
Duration = ['30days','40days']df = pd.DataFrame(list(zip(Courses, Fee, Duration)), columns=['courses', 'fees', 'duration'])
df
df = pd.DataFrame()
print(df)# Create Empty DataFraem with Column Labels
df = pd.DataFrame(columns = ["Courses","Fee","Duration"])
print(df)# copy 另一个 df
df2 = df.copy()
print(df2)
df = pd.read_csv('data_file.csv')
iloc[start:stop:step]
:参数为索引,可以是一个或索引列表loc[start:stop:step]
:参数为索引名称,可以是一个或名称列表example
:
import pandas as pd
import numpy as np
technologies = {'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],'Fee' :[20000,25000,26000,22000,24000,21000,22000],'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],'Discount':[1000,2300,1500,1200,2500,2100,2000]}
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
print(df)Courses Fee Duration Discount
r1 Spark 20000 30days 1000
r2 PySpark 25000 40days 2300
r3 Hadoop 26000 35days 1500
r4 Python 22000 40days 1200
r5 pandas 24000 NaN 2500
r6 Oracle 21000 None 2100
r7 Java 22000 55days 2000
1、单个索引:
# 索引从 0 开始,表示取第 3 行
df2 = df.iloc[2]
df2Courses Hadoop
Fee 26000
Duration 35days
Discount 1500
Name: r3, dtype: object
2、索引列表:
# 取第 3/4/7 行
df2 = df.iloc[[2, 3, 6]]Courses Fee Duration Discount
r3 Hadoop 26000 35days 1500
r4 Python 22000 40days 1200
r7 Java 22000 55days 2000
3、索引范围选择:
# 第 2 行到第 5 行,不包括第 6行,左开右闭
df2 = df.iloc[1:5]Courses Fee Duration Discount
r2 PySpark 25000 40days 2300
r3 Hadoop 26000 35days 1500
r4 Python 22000 40days 1200
r5 pandas 24000 NaN 2500
4、其他:
df2 = df.iloc[:1] # select first row
df2 = df.iloc[:3] # select first 3 row
df2 = df.iloc[-1:] # select last row
df2 = df.iloc[-3:] # select last 3 row
df2 = df.iloc[::2] # 拷贝,每 2 行选择一个Courses Fee Duration Discount
r1 Spark 20000 30days 1000
r3 Hadoop 26000 35days 1500
r5 pandas 24000 NaN 2500
r7 Java 22000 55days 2000
用法和 iloc
类似:
df2 = df.loc['r2']
df2 = df.loc[['r2', 'r3', 'r6']]
df2 = df.loc['r1': 'r5'] # 范围
df2 = df.loc['r1': 'r5': 2] # 2 表示步长
df2
df2 = df[['Courses', 'Fee', 'Duration']]
df2
df[:,start:stop:step]
可以使用列索引或位置下标来选择
import pandas as pd
import numpy as np
technologies = {'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],'Fee' :[20000,25000,26000,22000,24000,21000,22000],'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],'Discount':[1000,2300,1500,1200,2500,2100,2000]}
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)# iloc
df2 = df.iloc[:, [1, 3]] # 选择列索引,选择多列
df2 = df.iloc[:, 1:4] # 按索引范围选择,[1: 4]
df2 = df.iloc[:, 2:] # 切片
df2 = df.iloc[:, :2] # 切片
df2
df.loc[:,start:stop:step]
可以使用列名或标签来选择
df2 = df.loc[:, ['Courses', 'Fee']] # 选择多列
df2 = df.loc[:, "Fee": "Discount"] # 选择 Fee 到 Discount 之间的列,包含 Discount
df2 = df.loc[:, 'Fee':] # 切片,Fee 到最后
df2 = df.loc[:, : 'Fee'] # 切片,开始到 Fee
df2 = df.loc[:,::2] # 从第 1 列 开始,每隔 2 列取一例显示
df2
df.query()
方法可以根据提供的表达式(单列或多列条件)查询行,并返回一个新的 df
,如果想更新一个已存在 df
,可以使用 inplace=True
参数,语法:
DataFrame.query(expr, inplace=False, **kwargs)
# expr 条件表达式
# kwargs 与 eval 一起使用的关键字参数
示例:
import pandas as pd
import numpy as np
technologies = {'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],'Fee' :[20000,25000,26000,22000,24000,21000,22000],'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],'Discount':[1000,2300,1500,1200,2500,2100,2000]}index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)# 查询 Courses 为 Spark 的列所在的行
df2 = df.query('Courses == "Spark"')
df2# 查询结果Courses Fee Duration Discount
r1 Spark 20000 30days 1000
更新现有 df
(会将已有的 df
替换掉):
# using variable
value = 'Spark'
df2 = df.query('Courses == @value')# 更新现有 df
df2 = df.query('Courses == @value', inplace=True)
df2
不等于、in
和 多条件:
# not equals、in & multiple condition 不等于、in 和 多条件
df2 = df.query('Courses != "Spark"')
df2 = df.query('Courses in ("Spark", "Hadoop", "pandas")')# 使用变量
values=['Spark','PySpark']
df2 = df.query('Courses in @values')
df2 = df.query('Courses not in @values')df2 = df.query("Fee >= 23000")
df2 = df.query("Fee >= 23000 and Discount == 2300")
df2 = df.query("Fee >= 23000 & Discount == 2300")
apply 方法
apply()
方法逐行匹配并返回与之匹配的行:
# 返回 Courses 包含 'Spark', 'PySpark' 的行
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark', 'PySpark'])])
其他示例
df2 = df[df['Courses'] == 'Spark']
df2 = df.loc[df['Courses'] != 'Spark']
df2 = df.loc[df['Courses'].isin(['Spark', 'PySpark'])]
df2 = df.loc[~df['Courses'].isin(['Spark', 'PySpark'])]
df2 = df.loc[(df['Discount'] >= 1500) & (df['Discount'] <= 2500)]# contains
df2 = df[df['Courses'].str.contains('Spark')]# 值转换小写
df2 = df[df['Courses'].str.lower().str.contains('spark')]# startswith
df2 = df[df['Courses'].str.startswith('Spa')]
df2
获取单元格的值方法有:loc[]、iloc[]、at[]、iat[]
等方法
import pandas as pd
import numpy as np
technologies = {'Courses':["Spark","PySpark","Hadoop","Python","pandas"],'Fee' :[24000,25000,25000,24000,24000],'Duration':['30day','50days','55days', '40days','60days'],'Discount':[1000,2300,1000,1200,2500]}index_labels=['r1','r2','r3','r4','r5']
df = pd.DataFrame(technologies, index=index_labels)
dfCourses Fee Duration Discount
r1 Spark 24000 30day 1000
r2 PySpark 25000 50days 2300
r3 Hadoop 25000 55days 1000
r4 Python 24000 40days 1200
r5 pandas 24000 60days 2500# using loc[] 根据 name 或 index 索引,先选择标签为 r4 的行,再选择 Duration 列
df.loc['r4']['Duration'] # '40days'
df.loc['r4'][2] # '40days' 选择第 2 列# using iloc[]
df.iloc[3]['Duration'] # '40days'
df.iloc[3, 2] # '40days'# using at
df.at['r4', 'Duration']
df.at[df.index[3], 'Duration']# using iat[]
df.iat[3, 2] # 40days# get a cell value,先获取具体哪一列,再根据索引位置获取某个单元格
df['Duration'].values[3] # 40days# get cell value from last row
df.iloc[-1, 2] # '60days'
df.iloc[-1]['Duration'] # '60days'
df.at[df.index[-1], 'Duration'] # '60days'
总结:
loc[]
:通过行和标签名称(列名)获取特定的单元格值iloc[]
:通过列号或索引位置获取单元格,索引从 0 到 length-1
,最后一个为 -1at[]
:通过行和列标签方法单元格,性能更好,不支持从最后行或列进行负索引iat[]
:通过行和列的索引来获取单元格,不支持负索引位置pandas
中可以对一个已存在的 df
添加新的列,主要方法有:
DataFrame.insert()
:更新已存在的列DataFrame.assgin()
:会返回一个新的 df
示例:
import pandas as pd
import numpy as nptechnologies= {'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],'Fee' :[22000,25000,23000,24000,26000],'Discount':[1000,2300,1000,1200,2500]}df = pd.DataFrame(technologies)
print(df)Courses Fee Discount
0 Spark 22000 1000
1 PySpark 25000 2300
2 Hadoop 23000 1000
3 Python 24000 1200
4 Pandas 26000 2500
DataFrame.assgin()
:会返回一个新的 df
,它可以 add/append
一个新的列到已存在的 df
# 语法
# DataFrame.assign(**kwargs)# 添加新列 TutorsAssigned
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)
df2Courses Fee Discount TutorsAssigned
0 Spark 22000 1000 William
1 PySpark 25000 2300 Henry
2 Hadoop 23000 1000 Michael
3 Python 24000 1200 John
4 Pandas 26000 2500 Messi# 添加多列
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 = df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors)
df2["MNCCompanies"] = MNCCompanies # 这个也可以
大多数情况下,我们一般是基于现有已存在的列,在此基础上进行计算后添加新的列:
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)
df2# output
Courses Fee Discount Discount_Percent
0 Spark 22000 1000 220000.0
1 PySpark 25000 2300 575000.0
2 Hadoop 23000 1000 230000.0
3 Python 24000 1200 288000.0
4 Pandas 26000 2500 650000.0# add multiple column
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100, New_Discount=lambda x: x.Discount * 1.3)
df['Discount_Percent'] = df['Fee'] * df['Discount'] / 100 # 方法二
添加常量或者空值
# 添加常量或空值
df2 = df.assign(A=None, B=0, C="")
df2Courses Fee Discount A B C
0 Spark 22000 1000 None 0
1 PySpark 25000 2300 None 0
2 Hadoop 23000 1000 None 0
3 Python 24000 1200 None 0
4 Pandas 26000 2500 None 0
import pandas as pd
import numpy as nptechnologies= {'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],'Fee' :[22000,25000,23000,24000,26000],'Discount':[1000,2300,1000,1200,2500]}df = pd.DataFrame(technologies)df.insert(0, 'Turors', tutors)
df
添加字典
若要添加的数据是字典,可以使用 map
:
# 添加字典
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df["Turors11"] = df['Courses'].map(tutors)# 使用 loc[]
df.loc[:, 'loc_turors_column'] = tutors
下面是panda .DataFrame.rename()
方法的语法,它返回DataFrame
或None
。默认情况下,更新列后返回DataFrame
。当use inplace=True
时,它更新现有的DataFrame inplace (self)
并返回None
。
# DataFrame.rename() Syntax
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
mapper
:用于重命名列和索引的字典或函数index
:重命名索引的字典或函数,当与 axis
参数一起使用时,应该是 (mapper, axis=0)
,相当于 index=mapper
columns
:重命名列的字典或函数,当与 axis
参数一起使用时,应该是 (mapper, axis=0)
,相当于 column=mapper
axis
:值可以是 0 或索引 | 1 或列,默认为 ‘0’copy
:同时复制数据,默认为 True
inplace
:用于指定要更新的 df
,默认 False
,当为 True
时,复制属性将被忽略level
:与 MultiIndex
一起使用,需要整数值,默认为 None
errors
:接受值提升或忽略。如果使用 raise
,则在类字典的映射器、索引或列包含被转换的索引中不存在的标签时引发KeyError
。如果使用’ignore
',现有的键将被重命名,额外的键将被忽略。默认设置为忽略import pandas as pd
technologies = ({'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],'Fee' :[20000,25000,26000,22000,24000,21000,22000],'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']})
df = pd.DataFrame(technologies)print(df.columns) # Index(['Courses', 'Fee', 'Duration'], dtype='object')# 当使用 columns 时,不能与 axis 一起使用
df2 = df.rename(columns={"Courses": "New_Courses"})
df2 = df.rename({"Courses": "New_Courses"}, axis=1) # 效果等同
df2 = df.rename({"Courses": "New_Courses"}, axis='columns') # 效果等同
print(df2.columns) # Index(['New_Courses', 'Fee', 'Duration'], dtype='object')
重命名多列
# 重命名多列
df3 = df.rename(columns={'Courses':'Courses_List','Fee':'Courses_Fee', 'Duration':'Courses_Duration'})
print(df3.columns) # Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')
print(df.columns) # Index(['Courses', 'Fee', 'Duration'], dtype='object')df.rename(columns={'Courses':'Courses_List','Fee':'Courses_Fee', 'Duration':'Courses_Duration'}, inplace=True)
print(df.columns) # Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')
注意:
df.rename()
会返回一个新的df
,如果使用inplace=True
会直接更新当前df
,返回None
# 直接更新列名,根据列索引
df.columns.values[2] = 'Courses_Duration'# 用 list 替换 columns,长度必须一致,否则会报错
columns_names = ['Courses', 'Fee', 'Duration']
df.columns = columns_names# 给列添加前缀或后缀
# df.columns = ['col_' + str(col) for col in df.columns] // 方法一,直接更新 df
df2 = df.add_prefix('col_') # 方法二,返回一个新的 df
df2 = df.add_suffix('_col')
df2# 使用 lambda 表达式更新列,直接更新 df
df.rename(columns=lambda x: 'col_' + x, inplace=True)
df# 列名转换大小写,返回一个新的 df
df2 = df.rename(str.lower, axis='columns')
df2 = df.rename(str.upper, axis='columns')
df2
df2.columns # Index(['COURSES', 'FEE', 'DURATION'], dtype='object')# 使用 set_axis 去更新所有列
df.set_axis(['Courses_List', 'Course_Fee', 'Course_Duration'], axis=1, inplace=True)
replace 方法
用于替换 DataFrame
列中的字符串、系列、字典、列表、数字、正则表达式等
df.columns = df.columns.str.replace('Fee', 'FeeCost') # 替换单列
print(df.columns) # Index(['Courses', 'FeeCost', 'Duration'], dtype='object')df.columns = df.columns.str.replace('_', " ") # 替换所有列
df
rename()
方法重命名时不会抛出异常,也可以主动抛出异常:
# thrown error 抛出异常,当 rename 不存在列时
df.rename(columns = {'Cour': 'Courses_List'}, errors = "raise")
仅更新已存在的列
# 仅更新已存在的列
d={'Courses':'Courses_List','Fees':'Courses_fees'}
df.rename(columns={k: v for k, v in d.items() if k in df.columns}, inplace=True,errors = "raise")
print(df.columns)
df.drop()
方法可以删除行,轴参数 axis = 0
时表示删除行,axis = 1
或 columns
参数时表示删除列,该方法会返回一个新的 df
,从已有的 df
中删除可以使用 inplace=True
语法:
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')# labels:要删除的标签名,与 axis 一起使用
# axis:抽参数,默认 0, 为 1 时表示删除列
# index:指定索引,可以接受单个索引值或 list
# columns:指定列,接受单个列标签或 list
# level:可选,int 或级别名称,用于多个索引
# inplace:默认 False,返回 df 的拷贝(新的 df),为 True 时,删除时返回 None
# error:抛出异常
注意:
drop()
返回会返回一个新的df
,返回 None,指定inplace=True
示例:
import pandas as pd
import numpy as nptechnologies = {'Courses':["Spark","PySpark","Hadoop","Python"],'Fee' :[20000,25000,26000,22000],'Duration':['30day','40days',np.nan, None],'Discount':[1000,2300,1500,1200]}indexes=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=indexes)
df# 结果Courses Fee Duration Discount
r1 Spark 20000 30day 1000
r2 PySpark 25000 40days 2300
r3 Hadoop 26000 NaN 1500
r4 Python 22000 None 1200
根据行索引删除
# 以下三个效果等同# df1 = df.drop(index=['r1'])
# df1 = df.drop(labels=['r1'])
df1 = df.drop(labels=['r1'], axis=0)
参考:sparkbyexamples.com