16. How do I handle missing values in pandas? (video)
In [121]:
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
ufo = pd.read_csv(url3)#用read_csv打开csv文件
ufo.tail()
Out[121]:
| City | Colors Reported | Shape Reported | State | Time |
18236 |
Grant Park |
NaN |
TRIANGLE |
IL |
12/31/2000 23:00 |
18237 |
Spirit Lake |
NaN |
disK |
IA |
12/31/2000 23:00 |
18238 |
Eagle River |
NaN |
NaN |
WI |
12/31/2000 23:45 |
18239 |
Eagle River |
RED |
LIGHT |
WI |
12/31/2000 23:45 |
18240 |
Ybor |
NaN |
oval |
FL |
12/31/2000 23:59 |
“NaN”是什么意思?
“NaN”不是字符串,而是一个特殊值:numpy.nan。 它代表“非数字”并表示缺失值。 read_csv在读取文件时检测缺失值(默认情况下),并用此特殊值替换它们。
Documentation for read_csv
In [122]:
# ''isnull'返回一个布尔DataFrame(如果缺少则为True,如果没有缺失则为False)
ufo.isnull().tail()
Out[122]:
| City | Colors Reported | Shape Reported | State | Time |
18236 |
False |
True |
False |
False |
False |
18237 |
False |
True |
False |
False |
False |
18238 |
False |
True |
True |
False |
False |
18239 |
False |
False |
False |
False |
False |
18240 |
False |
True |
False |
False |
False |
In [123]:
# 'nonnull'返回'isnull'的反面(如果不丢失,则返回true,如果丢失则返回False)
ufo.notnull().tail()
Out[123]:
| City | Colors Reported | Shape Reported | State | Time |
18236 |
True |
False |
True |
True |
True |
18237 |
True |
False |
True |
True |
True |
18238 |
True |
False |
False |
True |
True |
18239 |
True |
True |
True |
True |
True |
18240 |
True |
False |
True |
True |
True |
Documentation for isnull
and notnull
In [124]:
# 计算每个Series中缺失值的数量
ufo.isnull().sum()
Out[124]:
City 25
Colors Reported 15359
Shape Reported 2644
State 0
Time 0
dtype: int64
此计算有效,因为:
1.默认情况下,DataFrame的sum
**方法在axis = 0
上运行(因此产生列总和)。 2.为了添加布尔值,pandas将“True”转换为 1 ,将“False”转换为 0 **。
In [125]:
# 使用'isnull'Series方法过滤DataFrame行
ufo[ufo.City.isnull()].head()
Out[125]:
| City | Colors Reported | Shape Reported | State | Time |
21 |
NaN |
NaN |
NaN |
LA |
8/15/1943 0:00 |
22 |
NaN |
NaN |
LIGHT |
LA |
8/15/1943 0:00 |
204 |
NaN |
NaN |
disK |
CA |
7/15/1952 12:30 |
241 |
NaN |
BLUE |
disK |
MT |
7/4/1953 14:00 |
613 |
NaN |
NaN |
disK |
NV |
7/1/1960 12:00 |
How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:
In [126]:
# 检查行数和列数
ufo.shape
Out[126]:
(18241, 5)
In [127]:
# 如果连续缺少“任何”值,则删除该行
ufo.dropna(how='any').shape
Out[127]:
(2486, 5)
Documentation for dropna
In [128]:
# 默认情况下,'dropna'的inplace'参数为False,因此只暂时删除行
ufo.shape
Out[128]:
(18241, 5)
In [129]:
# 如果一行中缺少'all'值,则删除该行(在这种情况下不丢弃)
ufo.dropna(how='all').shape
Out[129]:
(18241, 5)
In [130]:
# 给 any 一个范围
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
Out[130]:
(15576, 5)
In [131]:
# 给all 一个范围
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape
Out[131]:
(18237, 5)
In [132]:
# 默认情况下,value_counts'不包含缺失值
ufo['Shape Reported'].value_counts().head()
Out[132]:
LIGHT 2803
disK 2122
TRIANGLE 1889
OTHER 1402
CIRCLE 1365
Name: Shape Reported, dtype: int64
In [133]:
# 明确包含缺失值
ufo['Shape Reported'].value_counts(dropna=False).head()
Out[133]:
LIGHT 2803
NaN 2644
disK 2122
TRIANGLE 1889
OTHER 1402
Name: Shape Reported, dtype: int64
Documentation for value_counts
In [134]:
# 使用指定值填充缺失值
ufo['Shape Reported'].fillna(value='VARIoUS', inplace=True)
Documentation for fillna
In [135]:
# 确认填写了缺失值
ufo['Shape Reported'].value_counts().head()
Out[135]:
VARIoUS 2977
LIGHT 2803
disK 2122
TRIANGLE 1889
OTHER 1402
Name: Shape Reported, dtype: int64
Working with missing data in pandas
[Back to top]
17. What do I need to kNow about the pandas index? (Part 1) (video)
In [137]:
drinks = pd.read_csv(url7)
drinks.head()
Out[137]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
In [138]:
# 每个DataFrame都有一个索引(有时称为“行标签”)
drinks.index
Out[138]:
RangeIndex(start=0, stop=193, step=1)
In [139]:
# 列名也存储在特殊的“索引”对象中
drinks.columns
Out[139]:
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
'total_litres_of_pure_alcohol', 'continent'],
dtype='object')
In [140]:
# 索引和列名都不包含在形状中
drinks.shape
Out[140]:
(193, 6)
In [141]:
# 如果不定义索引和列名,则默认为整数
pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()
Out[141]:
| 0 | 1 | 2 | 3 | 4 |
0 |
1 |
24 |
M |
technician |
85711 |
1 |
2 |
53 |
F |
other |
94043 |
2 |
3 |
23 |
M |
writer |
32067 |
3 |
4 |
24 |
M |
technician |
43537 |
4 |
5 |
33 |
F |
other |
15213 |
索引用于什么?
识别 选择 对齐
In [142]:
# i识别:过滤DataFrame时,每行保留索引
drinks[drinks.continent=='South America']
Out[142]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
6 |
Argentina |
193 |
25 |
221 |
8.3 |
South America |
20 |
Bolivia |
167 |
41 |
8 |
3.8 |
South America |
23 |
Brazil |
245 |
145 |
16 |
7.2 |
South America |
35 |
Chile |
130 |
124 |
172 |
7.6 |
South America |
37 |
Colombia |
159 |
76 |
3 |
4.2 |
South America |
52 |
Ecuador |
162 |
74 |
3 |
4.2 |
South America |
72 |
Guyana |
93 |
302 |
1 |
7.1 |
South America |
132 |
Paraguay |
213 |
117 |
74 |
7.3 |
South America |
133 |
Peru |
163 |
160 |
21 |
6.1 |
South America |
163 |
Suriname |
128 |
178 |
7 |
5.6 |
South America |
185 |
Uruguay |
115 |
35 |
220 |
6.6 |
South America |
188 |
Venezuela |
333 |
100 |
3 |
7.7 |
South America |
In [143]:
# 选择:使用索引选择DataFrame的一部分
drinks.loc[23, 'beer_servings']
Out[143]:
245
Documentation for loc
In [144]:
# 将现有列设置为索引
drinks.set_index('country', inplace=True)
drinks.head()
Out[144]:
| beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
country | | | | | |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
Documentation for set_index
In [145]:
# 'country'现在是索引
drinks.index
Out[145]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
...
'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
dtype='object', name='country', length=193)
In [146]:
# 'country'不再是列名
drinks.columns
Out[146]:
Index(['beer_servings', 'spirit_servings', 'wine_servings',
'total_litres_of_pure_alcohol', 'continent'],
dtype='object')
In [147]:
# 'country'数据不再是DataFrame内容的一部分
drinks.shape
Out[147]:
(193, 5)
In [148]:
# 国家名字现在可用于选择
drinks.loc['Brazil', 'beer_servings']
Out[148]:
245
In [149]:
# 索引名称是可选的
drinks.index.name = None
drinks.head()
Out[149]:
| beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
In [150]:
# 恢复索引名称,并将索引移回列
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()
Out[150]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
Documentation for reset_index
In [151]:
# 许多DataFrame方法输出一个DataFrame
drinks.describe()
Out[151]:
| beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol |
count |
193.000000 |
193.000000 |
193.000000 |
193.000000 |
mean |
106.160622 |
80.994819 |
49.450777 |
4.717098 |
std |
101.143103 |
88.284312 |
79.697598 |
3.773298 |
min |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
20.000000 |
4.000000 |
1.000000 |
1.300000 |
50% |
76.000000 |
56.000000 |
8.000000 |
4.200000 |
75% |
188.000000 |
128.000000 |
59.000000 |
7.200000 |
max |
376.000000 |
438.000000 |
370.000000 |
14.400000 |
In [152]:
# 您可以使用其索引和列与任何DataFrame进行交互
drinks.describe().loc['25%', 'beer_servings']
Out[152]:
20.0
Indexing and selecting data
[Back to top]
18. What do I need to kNow about the pandas index? (Part 2) (video)
In [153]:
drinks = pd.read_csv(url7)
drinks.head()
Out[153]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
In [154]:
# 每个DataFrame都有一个索引
drinks.index
Out[154]:
RangeIndex(start=0, stop=193, step=1)
In [155]:
# 每个Series还有一个索引(从DataFrame继承)
drinks.continent.head()
Out[155]:
0 Asia
1 Europe
2 Africa
3 Europe
4 Africa
Name: continent, dtype: object
In [156]:
# 将'country'设置为索引
drinks.set_index('country', inplace=True)
Documentation for set_index
In [157]:
# Series 索引在左侧,值在右侧
drinks.continent.head()
Out[157]:
country
Afghanistan Asia
Albania Europe
Algeria Africa
Andorra Europe
Angola Africa
Name: continent, dtype: object
In [158]:
# 系Series的另一个例子(来自'value_counts'方法的输出)
drinks.continent.value_counts()
Out[158]:
Africa 53
Europe 45
Asia 44
north America 23
Oceania 16
South America 12
Name: continent, dtype: int64
Documentation for value_counts
In [159]:
# Series 访问索引
drinks.continent.value_counts().index
Out[159]:
Index(['Africa', 'Europe', 'Asia', 'north America', 'Oceania',
'South America'],
dtype='object')
In [160]:
# 访问Series 的值
drinks.continent.value_counts().values
Out[160]:
array([53, 45, 44, 23, 16, 12], dtype=int64)
In [161]:
#可以通过索引选择系列中的元素(使用括号表示法)
drinks.continent.value_counts()['Africa']
Out[161]:
53
In [162]:
# 任何Series都可以按其值排序
drinks.continent.value_counts().sort_values()
Out[162]:
South America 12
Oceania 16
north America 23
Asia 44
Europe 45
Africa 53
Name: continent, dtype: int64
In [163]:
# a任何Series都可以按其索引排序
drinks.continent.value_counts().sort_index()
Out[163]:
Africa 53
Asia 44
Europe 45
north America 23
Oceania 16
South America 12
Name: continent, dtype: int64
Documentation for sort_values
and sort_index
索引用于什么?
识别 选择 对齐
In [164]:
# beer_servings'包含每人平均每年的啤酒摄取量
drinks.beer_servings.head()
Out[164]:
country
Afghanistan 0
Albania 89
Algeria 25
Andorra 245
Angola 217
Name: beer_servings, dtype: int64
In [165]:
# 创建一个包含两个国家人口的Series
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people
Out[165]:
Albania 3000000
Andorra 85000
Name: population, dtype: int64
Documentation for Series
In [166]:
# 计算每个国家的年度啤酒总量
(drinks.beer_servings * people).head()
Out[166]:
Afghanistan NaN
Albania 267000000.0
Algeria NaN
Andorra 20825000.0
Angola NaN
dtype: float64
两个Series按其指数对齐。 如果任一Series中缺少值,则结果标记为NaN。 对齐使我们能够轻松处理不完整的数据。
In [167]:
# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)
pd.concat([drinks, people], axis=1).head()
Out[167]:
| beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | population |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
NaN |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
3000000.0 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
NaN |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
85000.0 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
NaN |
Documentation for concat
Indexing and selecting data
[Back to top]
19. How do I select multiple rows and columns from a pandas DataFrame? (video)
In [168]:
ufo = pd.read_csv(url3)
ufo.head(3)
Out[168]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
The loc
y loc方法用于按标签选择行和列。你可以传递它:
单个标签 标签列表 标签切片 布尔Series 冒号(表示“所有标签”)
In [169]:
# 第0行,所有列
ufo.loc[0, :]
Out[169]:
City Ithaca
Colors Reported NaN
Shape Reported TRIANGLE
State NY
Time 6/1/1930 22:00
Name: 0, dtype: object
In [170]:
# 头三行,所有列
ufo.loc[[0, 1, 2], :]
Out[170]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
In [171]:
# 0:2行 包括2, all columns
ufo.loc[0:2, :]
Out[171]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
In [172]:
# t这意味着“所有列”,但明确声明“所有列”更好
ufo.loc[0:2]
Out[172]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
In [173]:
# 行0到行2(包含行2),列'城市'
ufo.loc[0:2, 'City']
Out[173]:
0 Ithaca
1 Willingboro
2 Holyoke
Name: City, dtype: object
In [174]:
# 行0到行2(包含行2), 列 'City' and 'State'
ufo.loc[0:2, ['City', 'State']]
Out[174]:
| City | State |
0 |
Ithaca |
NY |
1 |
Willingboro |
NJ |
2 |
Holyoke |
CO |
In [175]:
# 使用双括号完成相同的事情 - 但使用'loc'是首选,因为它更明确
ufo[['City', 'State']].head(3)
Out[175]:
| City | State |
0 |
Ithaca |
NY |
1 |
Willingboro |
NJ |
2 |
Holyoke |
CO |
In [176]:
# 行0到行2(包含行2), 列 'City' 到 'State')
ufo.loc[0:2, 'City':'State']
Out[176]:
| City | Colors Reported | Shape Reported | State |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
1 |
Willingboro |
NaN |
OTHER |
NJ |
2 |
Holyoke |
NaN |
oval |
CO |
In [177]:
# 使用'head'和'drop'完成同样的事情
ufo.head(3).drop('Time', axis=1)
Out[177]:
| City | Colors Reported | Shape Reported | State |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
1 |
Willingboro |
NaN |
OTHER |
NJ |
2 |
Holyoke |
NaN |
oval |
CO |
In [178]:
# 'City'为'Oakland'的行,'State'列
ufo.loc[ufo.City=='Oakland', 'State']
Out[178]:
1694 CA
2144 CA
4686 MD
7293 CA
8488 CA
8768 CA
10816 OR
10948 CA
11045 CA
12322 CA
12941 CA
16803 MD
17322 CA
Name: State, dtype: object
In [179]:
# 使用“链式索引”完成同样的事情 - 但使用'loc'是首选,因为链式索引可能会导致问题
ufo[ufo.City=='Oakland'].State
Out[179]:
1694 CA
2144 CA
4686 MD
7293 CA
8488 CA
8768 CA
10816 OR
10948 CA
11045 CA
12322 CA
12941 CA
16803 MD
17322 CA
Name: State, dtype: object
The iloc
iloc'方法用于通过整数位置选择行和列。你可以传递它:
- 单个整数位置
- 整数位置列表
- 整数位置切片
- 冒号(表示“所有整数位置”)
In [180]:
# 位置0和1中的行,位置0和3中的列
ufo.iloc[[0, 1], [0, 3]]
Out[180]:
| City | State |
0 |
Ithaca |
NY |
1 |
Willingboro |
NJ |
In [181]:
# 位置0到2中的行(不包括),位置0到4中的列(不包括)
ufo.iloc[0:2, 0:4]
Out[181]:
| City | Colors Reported | Shape Reported | State |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
1 |
Willingboro |
NaN |
OTHER |
NJ |
In [182]:
# 所有列中位置0到2(不包括)的行
ufo.iloc[0:2, :]
Out[182]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
In [183]:
# 完成同样的事情 - 但使用'iloc'是首选,因为它更明确
ufo[0:2]
Out[183]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
The ix
ix
方法用于通过标签或整数位置选择行和列,并且只应在需要在同一调用中混合基于标签和基于整数的选择时使用。
In [184]:
# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index
drinks = pd.read_csv(url7, index_col='country')
drinks.head()
Out[184]:
| beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
country | | | | | |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
In [185]:
# 标有'Albania'的行,位于0的列
drinks.ix['Albania', 0]
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[185]:
89
In [186]:
# row in position 1, column with label 'beer_servings'
drinks.ix[1, 'beer_servings']
Out[186]:
89
使用带有ix
的数字的规则:
- 如果索引是字符串,则数字被视为整数位置,因此切片在右侧是不包含。
- 如果索引是整数,则数字被视为标签,因此切片是包含。
In [187]:
# 行'阿尔巴尼亚'到'安道尔'(包括),0到2位的列(包含)
drinks.ix['Albania':'Andorra', 0:2]
Out[187]:
| beer_servings | spirit_servings |
country | | |
Albania |
89 |
132 |
Algeria |
25 |
0 |
Andorra |
245 |
138 |
In [188]:
# 行0到2(含),0到2位的列(不包括)
ufo.ix[0:2, 0:2]
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[188]:
| City | Colors Reported |
0 |
Ithaca |
NaN |
1 |
Willingboro |
NaN |
2 |
Holyoke |
NaN |
Summary of the pandas API for selection
[Back to top]
20. When should I use the "inplace" parameter in pandas? (video)
In [189]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv(url3)
ufo.head()
Out[189]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
Abilene |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [190]:
ufo.shape
Out[190]:
(18241, 5)
In [191]:
# 删除“城市”列(因为inplace = False,不会影响DataFrame)
ufo.drop('City', axis=1).head()
Out[191]:
| Colors Reported | Shape Reported | State | Time |
0 |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [192]:
# 确认实际上没有删除“城市”列
ufo.head()
Out[192]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
Abilene |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [193]:
# 删除“城市”列(因为inplace = True,确实会影响DataFrame)
ufo.drop('City', axis=1, inplace=True)
In [194]:
# 确认实际删除了“城市”列
ufo.head()
Out[194]:
| Colors Reported | Shape Reported | State | Time |
0 |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [195]:
# 如果该行中缺少任何值,则删除一行(因为inplace = False,不会影响DataFrame)
ufo.dropna(how='any').shape
Out[195]:
(2490, 4)
In [196]:
# 确认没有实际删除任何行
ufo.shape
Out[196]:
(18241, 4)
In [197]:
# 使用赋值语句而不是'inplace'参数
ufo = ufo.set_index('Time')
ufo.tail()
Out[197]:
| Colors Reported | Shape Reported | State |
Time | | | |
12/31/2000 23:00 |
NaN |
TRIANGLE |
IL |
12/31/2000 23:00 |
NaN |
disK |
IA |
12/31/2000 23:45 |
NaN |
NaN |
WI |
12/31/2000 23:45 |
RED |
LIGHT |
WI |
12/31/2000 23:59 |
NaN |
oval |
FL |
In [198]:
# 使用“向后填充”策略填充缺失值(因为inplace = False不影响DataFrame)
ufo.fillna(method='bfill').tail()
Out[198]:
| Colors Reported | Shape Reported | State |
Time | | | |
12/31/2000 23:00 |
RED |
TRIANGLE |
IL |
12/31/2000 23:00 |
RED |
disK |
IA |
12/31/2000 23:45 |
RED |
LIGHT |
WI |
12/31/2000 23:45 |
RED |
LIGHT |
WI |
12/31/2000 23:59 |
NaN |
oval |
FL |
In [199]:
# 使用“前向填充”策略进行操作(因为inplace = False,不会影响DataFrame)
ufo.fillna(method='ffill').tail()
Out[199]:
| Colors Reported | Shape Reported | State |
Time | | | |
12/31/2000 23:00 |
RED |
TRIANGLE |
IL |
12/31/2000 23:00 |
RED |
disK |
IA |
12/31/2000 23:45 |
RED |
disK |
WI |
12/31/2000 23:45 |
RED |
LIGHT |
WI |
12/31/2000 23:59 |
RED |
oval |
FL |
[Back to top]
21. How do I make my pandas DataFrame smaller and faster? (video)
In [200]:
drinks = pd.read_csv(url7)
drinks.head()
Out[200]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
In [201]:
# 确切的内存使用是未知的,因为对象列是别处的引用
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country 193 non-null object
beer_servings 193 non-null int64
spirit_servings 193 non-null int64
wine_servings 193 non-null int64
total_litres_of_pure_alcohol 193 non-null float64
continent 193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB
In [202]:
#强制pandas计算真正的内存使用量
drinks.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country 193 non-null object
beer_servings 193 non-null int64
spirit_servings 193 non-null int64
wine_servings 193 non-null int64
total_litres_of_pure_alcohol 193 non-null float64
continent 193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.4 KB
In [203]:
# 计算每个系列的内存使用量(以字节为单位)
drinks.memory_usage(deep=True)
Out[203]:
Index 80
country 12588
beer_servings 1544
spirit_servings 1544
wine_servings 1544
total_litres_of_pure_alcohol 1544
continent 12332
dtype: int64
Documentation for info
and memory_usage
In [204]:
# 使用'category'数据类型(pandas 0.15)将'continent'字符串存储为整数
drinks['continent'] = drinks.continent.astype('category')
drinks.dtypes
Out[204]:
country object
beer_servings int64
spirit_servings int64
wine_servings int64
total_litres_of_pure_alcohol float64
continent category
dtype: object
In [205]:
# 'continent'系列似乎没有变化
drinks.continent.head()
Out[205]:
0 Asia
1 Europe
2 Africa
3 Europe
4 Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, north America, Oceania, South America]
In [206]:
# strings are Now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)
drinks.continent.cat.codes.head()
Out[206]:
0 1
1 2
2 0
3 2
4 0
dtype: int8
In [207]:
# 现在对编码进行编码(0表示'非洲',1表示'亚洲',2表示'欧洲'等)
drinks.memory_usage(deep=True)
Out[207]:
Index 80
country 12588
beer_servings 1544
spirit_servings 1544
wine_servings 1544
total_litres_of_pure_alcohol 1544
continent 744
dtype: int64
In [208]:
# country' Series中重复这个过程'
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)
Out[208]:
Index 80
country 18094
beer_servings 1544
spirit_servings 1544
wine_servings 1544
total_litres_of_pure_alcohol 1544
continent 744
dtype: int64
In [209]:
# 因为我们创建了193个类别,因此内存使用增加
drinks.country.cat.categories
Out[209]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
...
'United arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
dtype='object', length=193)
类别数据类型只应与具有少量可能值的字符串Series一起使用。
In [210]:
# 从字典中创建一个小型DataFrame
df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']})
df
Out[210]:
| ID | quality |
0 |
100 |
good |
1 |
101 |
very good |
2 |
102 |
good |
3 |
103 |
excellent |
In [211]:
# 按“质量”系列排序DataFrame(按字母顺序排列)
df.sort_values('quality')
Out[211]:
| ID | quality |
3 |
103 |
excellent |
0 |
100 |
good |
2 |
102 |
good |
1 |
101 |
very good |
In [212]:
# 定义类别的逻辑顺序
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
df.quality
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
Out[212]:
0 good
1 very good
2 good
3 excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]
In [213]:
# 按“质量”系列(逻辑顺序)输出DataFrame)
df.sort_values('quality')
Out[213]:
| ID | quality |
0 |
100 |
good |
2 |
102 |
good |
1 |
101 |
very good |
3 |
103 |
excellent |
In [214]:
# 比较运算符使用有序类别
df.loc[df.quality > 'good', :]
Out[214]:
| ID | quality |
1 |
101 |
very good |
3 |
103 |
excellent |
Overview of categorical data in pandas
API reference for categorical methods
[Back to top]
22. How do I use pandas with scikit-learn to create kaggle submissions? (video)
In [215]:
# 从kaggle的泰坦尼克号竞赛中将训练数据集读入DataFrame
url5 = 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_train.csv'
train = pd.read_csv(url5)
train.head()
Out[215]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
Goal: Predict passenger survival aboard the Titanic based on passenger attributes
Video: What is machine learning, and how does it work?
In [216]:
#根据乘客属性预测泰坦尼克号上的乘客生存率
In [217]:
# 通过选择两个DataFrame列来创建特征矩阵“X”
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
X.shape
Out[217]:
(891, 2)
In [218]:
# 通过选择一个系列来创建一个响应向量'y'
y = train.Survived
y.shape
Out[218]:
(891,)
Note: There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes.
In [219]:
#无需将这些pandas对象转换为NumPy数组。 scikit-learn将理解这些对象,只要它们完全是数字和正确的形状
In [220]:
# 将分类模型拟合到训练数据中
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
Out[220]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
Video series: Introduction to machine learning with scikit-learn
In [221]:
# 从kaggle的泰坦尼克号竞赛中读取测试数据集到DataFrame
url6 ="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_test.csv"
test = pd.read_csv(url6)
test.head()
Out[221]:
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
892 |
3 |
Kelly, Mr. James |
male |
34.5 |
0 |
0 |
330911 |
7.8292 |
NaN |
Q |
1 |
893 |
3 |
Wilkes, Mrs. James (Ellen Needs) |
female |
47.0 |
1 |
0 |
363272 |
7.0000 |
NaN |
S |
2 |
894 |
2 |
Myles, Mr. Thomas Francis |
male |
62.0 |
0 |
0 |
240276 |
9.6875 |
NaN |
Q |
3 |
895 |
3 |
Wirz, Mr. Albert |
male |
27.0 |
0 |
0 |
315154 |
8.6625 |
NaN |
S |
4 |
896 |
3 |
Hirvonen, Mrs. Alexander (Helga E Lindqvist) |
female |
22.0 |
1 |
1 |
3101298 |
12.2875 |
NaN |
S |
In [222]:
# 根据与训练数据匹配的测试数据创建特征矩阵
X_new = test.loc[:, feature_cols]
X_new.shape
Out[222]:
(418, 2)
In [223]:
# 使用拟合模型对测试集进行预测
new_pred_class = logreg.predict(X_new)
In [224]:
# 输出结果
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()
Out[224]:
| PassengerId | Survived |
0 |
892 |
0 |
1 |
893 |
0 |
2 |
894 |
0 |
3 |
895 |
0 |
4 |
896 |
0 |
Documentation for the DataFrame
constructor
In [225]:
# 通过将PassengerID设置为索引来确保PassengerID是第一列
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()
Out[225]:
| Survived |
PassengerId | |
892 |
0 |
893 |
0 |
894 |
0 |
895 |
0 |
896 |
0 |
In [226]:
# 将DataFrame写入可以提交给kaggle的CSV文件
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')
Documentation for to_csv
In [227]:
# save a DataFrame to disk ("pickle it")
train.to_pickle('train.pkl')
In [228]:
# read a pickled object from disk ("unpickle it")
pd.read_pickle('train.pkl').head()
Out[228]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
Documentation for to_pickle
and read_pickle
[Back to top]
23. More of your pandas questions answered! (video)
Question: Could you explain how to read the pandas documentation?
pandas API reference
Question: What is the difference between ufo.isnull()
and pd.isnull(ufo)
?
In [229]:
ufo = pd.read_csv(url3)
ufo.head()
Out[229]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
Abilene |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [230]:
# #使用'isnull'作为顶层函数
pd.isnull(ufo).head()
Out[230]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
False |
True |
False |
False |
False |
1 |
False |
True |
False |
False |
False |
2 |
False |
True |
False |
False |
False |
3 |
False |
True |
False |
False |
False |
4 |
False |
True |
False |
False |
False |
In [231]:
# 等效:使用'isnull'作为DataFrame方法
ufo.isnull().head()
Out[231]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
False |
True |
False |
False |
False |
1 |
False |
True |
False |
False |
False |
2 |
False |
True |
False |
False |
False |
3 |
False |
True |
False |
False |
False |
4 |
False |
True |
False |
False |
False |
Documentation for isnull
Question: Why are DataFrame slices inclusive when using .loc
, but exclusive when using .iloc
?
In [232]:
# 基于标签的切片包括开始和结束
ufo.loc[0:4, :]
Out[232]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
Abilene |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [233]:
# 基于位置的切片包括开始和不包括结束
ufo.iloc[0:4, :]
Out[233]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
Abilene |
NaN |
disK |
KS |
6/1/1931 13:00 |
Documentation for loc
and iloc
In [234]:
# iloc'只是遵循NumPy的切片惯例......
ufo.values[0:4, :]
Out[234]:
array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
['Holyoke', nan, 'oval', 'CO', '2/15/1931 14:00'],
['Abilene', nan, 'disK', 'KS', '6/1/1931 13:00']], dtype=object)
In [235]:
# ..而NumPy只是遵循Python的切片约定
'python'[0:4]
Out[235]:
'pyth'
In [236]:
# oc'包含停止标签,因为您不一定知道它后面会是什么标签
ufo.loc[0:4, 'City':'State']
Out[236]:
| City | Colors Reported | Shape Reported | State |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
1 |
Willingboro |
NaN |
OTHER |
NJ |
2 |
Holyoke |
NaN |
oval |
CO |
3 |
Abilene |
NaN |
disK |
KS |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
Question: How do I randomly sample rows from a DataFrame?
In [237]:
# 随机取三行
ufo.sample(n=3)
Out[237]:
| City | Colors Reported | Shape Reported | State | Time |
16158 |
Summer Lake |
RED ORANGE |
SPHERE |
OR |
3/28/2000 7:45 |
3334 |
San Jose |
NaN |
disK |
CA |
8/1/1976 21:00 |
15116 |
Dallas |
GREEN |
FLASH |
TX |
11/16/1999 5:00 |
Documentation for sample
In [238]:
# 使用'random_state'参数进行使结果可以再现
ufo.sample(n=3, random_state=42)
Out[238]:
| City | Colors Reported | Shape Reported | State | Time |
217 |
norridgewock |
NaN |
disK |
ME |
9/15/1952 14:00 |
12282 |
Ipava |
NaN |
TRIANGLE |
IL |
10/1/1998 21:15 |
17933 |
Ellinwood |
NaN |
FIREBALL |
KS |
11/13/2000 22:00 |
In [239]:
# 按比例采样
train = ufo.sample(frac=0.75, random_state=99)
In [240]:
# 剩下数据存到另一个DataFrame
test = ufo.loc[~ufo.index.isin(train.index), :]
Documentation for isin
[Back to top]
24. How do I create dummy variables in pandas? (video)
In [241]:
train = pd.read_csv(url5)
train.head()
Out[241]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
In [242]:
# 使用'map'方法创建'Sex_male'虚拟变量
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
train.head()
Out[242]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked | Sex_male |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
0 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
0 |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
0 |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
1 |
Documentation for map
In [243]:
# 替代方法:使用'get_dummies'为每个可能的值创建一列
pd.get_dummies(train.Sex).head()
Out[243]:
| female | male |
0 |
0 |
1 |
1 |
1 |
0 |
2 |
1 |
0 |
3 |
1 |
0 |
4 |
0 |
1 |
Generally speaking:
- If you have "K" possible values for a categorical feature, you only need "K-1" dummy variablesto capture all of the @R_612_4045@ion about that feature.
- One convention is to drop the first dummy variable, which defines that level as the "baseline".
In [244]:
"""一般来说:
如果分类特征具有“K”个可能值,则只需要“K-1”虚拟变量来捕获有关该特征的所有信息。
一种惯用方法是删除第一个虚拟变量,该变量将该级别定义为“基线”。"""
Out[244]:
'一般来说:\n\n如果分类特征具有“K”个可能值,则只需要“K-1”虚拟变量来捕获有关该特征的所有信息。\n一种惯用方法是删除第一个虚拟变量,该变量将该级别定义为“基线”。'
In [245]:
# 使用'iloc'方法删除第一个虚拟变量('female')
pd.get_dummies(train.Sex).iloc[:, 1:].head()
Out[245]:
In [246]:
# 添加前缀以标识虚拟变量的来源
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()
Out[246]:
| Sex_male |
0 |
1 |
1 |
0 |
2 |
0 |
3 |
0 |
4 |
1 |
In [247]:
# 使用 'get_dummies' 处理具有3个可能值
pd.get_dummies(train.Embarked, prefix='Embarked').head(10)
Out[247]:
| Embarked_C | Embarked_Q | Embarked_S |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
2 |
0 |
0 |
1 |
3 |
0 |
0 |
1 |
4 |
0 |
0 |
1 |
5 |
0 |
1 |
0 |
6 |
0 |
0 |
1 |
7 |
0 |
0 |
1 |
8 |
0 |
0 |
1 |
9 |
1 |
0 |
0 |
In [248]:
# 删除第一个虚拟变量('C')
pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)
Out[248]:
| Embarked_Q | Embarked_S |
0 |
0 |
1 |
1 |
0 |
0 |
2 |
0 |
1 |
3 |
0 |
1 |
4 |
0 |
1 |
5 |
1 |
0 |
6 |
0 |
1 |
7 |
0 |
1 |
8 |
0 |
1 |
9 |
0 |
0 |
How to translate these values back to the original 'Embarked' value:
- 0, 0 means C
- 1, 0 means Q
- 0, 1 means S
In [249]:
# 保存虚拟变量的DataFrame并将它们连接到原始DataFram
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()
Out[249]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked | Sex_male | Embarked_Q | Embarked_S |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
0 |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
0 |
0 |
0 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
0 |
0 |
1 |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
0 |
0 |
1 |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
1 |
0 |
1 |
Documentation for concat
In [250]:
# 重新读取DataFrame
train = pd.read_csv(url5)
train.head()
Out[250]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
In [251]:
# 将DataFrame传递给'get_dummies'并指定哪些列为dummy(它删除了原始列)
pd.get_dummies(train, columns=['Sex', 'Embarked']).head()
Out[251]:
| PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | fare | Cabin | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
0 |
1 |
0 |
0 |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
1 |
0 |
1 |
0 |
0 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
1 |
0 |
0 |
0 |
1 |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
1 |
0 |
0 |
0 |
1 |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
0 |
1 |
0 |
0 |
1 |
In [252]:
# 使用'drop_first'参数(pandas 0.18)删除每个特征的第一个虚拟变量
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()
Out[252]:
| PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | fare | Cabin | Sex_male | Embarked_Q | Embarked_S |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
1 |
0 |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
0 |
0 |
0 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
0 |
0 |
1 |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
0 |
0 |
1 |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
1 |
0 |
1 |
Documentation for get_dummies
[Back to top]
25. How do I work with dates and times in pandas? (video)
In [253]:
ufo = pd.read_csv(url3)
ufo.head()
Out[253]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
6/1/1930 22:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
6/30/1930 20:00 |
2 |
Holyoke |
NaN |
oval |
CO |
2/15/1931 14:00 |
3 |
Abilene |
NaN |
disK |
KS |
6/1/1931 13:00 |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
4/18/1933 19:00 |
In [254]:
# 'Time'当前存储为字符串
ufo.dtypes
Out[254]:
City object
Colors Reported object
Shape Reported object
State object
Time object
dtype: object
In [255]:
# 可以使用字符串切片访问小时,但这种方法很容易出问题
ufo.Time.str.slice(-5, -3).astype(int).head()
Out[255]:
0 22
1 20
2 14
3 13
4 19
Name: Time, dtype: int32
In [256]:
# 将“时间”转换为日期时间格式
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()
Out[256]:
| City | Colors Reported | Shape Reported | State | Time |
0 |
Ithaca |
NaN |
TRIANGLE |
NY |
1930-06-01 22:00:00 |
1 |
Willingboro |
NaN |
OTHER |
NJ |
1930-06-30 20:00:00 |
2 |
Holyoke |
NaN |
oval |
CO |
1931-02-15 14:00:00 |
3 |
Abilene |
NaN |
disK |
KS |
1931-06-01 13:00:00 |
4 |
New York Worlds Fair |
NaN |
LIGHT |
NY |
1933-04-18 19:00:00 |
In [257]:
ufo.dtypes
Out[257]:
City object
Colors Reported object
Shape Reported object
State object
Time datetime64[ns]
dtype: object
Documentation for to_datetime
In [258]:
#使用Series方法
ufo.Time.dt.hour.head()
Out[258]:
0 22
1 20
2 14
3 13
4 19
Name: Time, dtype: int64
In [259]:
ufo.Time.dt.weekday_name.head()
Out[259]:
0 Sunday
1 Monday
2 Sunday
3 Monday
4 Tuesday
Name: Time, dtype: object
In [260]:
ufo.Time.dt.dayofyear.head()
Out[260]:
0 152
1 181
2 46
3 152
4 108
Name: Time, dtype: int64
API reference for datetime properties and methods
In [261]:
# convert a single string to datetime format (outputs a timestamp object)
ts = pd.to_datetime('1/1/1999')
ts
Out[261]:
Timestamp('1999-01-01 00:00:00')
In [262]:
# 将datetime系列与时间戳进行比较
ufo.loc[ufo.Time >= ts, :].head()
Out[262]:
| City | Colors Reported | Shape Reported | State | Time |
12832 |
Loma Rica |
NaN |
LIGHT |
CA |
1999-01-01 02:30:00 |
12833 |
Bauxite |
NaN |
NaN |
AR |
1999-01-01 03:00:00 |
12834 |
Florence |
NaN |
CYLINDER |
SC |
1999-01-01 14:00:00 |
12835 |
Lake Henshaw |
NaN |
CIGAR |
CA |
1999-01-01 15:00:00 |
12836 |
Wilmington Island |
NaN |
LIGHT |
GA |
1999-01-01 17:15:00 |
In [263]:
# 使用时间戳执行数学运算(输出timedelta对象)
ufo.Time.max() - ufo.Time.min()
Out[263]:
timedelta('25781 days 01:59:00')
In [264]:
# timedelta对象还具有您可以访问的属性
(ufo.Time.max() - ufo.Time.min()).days
Out[264]:
25781
In [265]:
# 允许绘图出现在笔记本jupyter notebook中
%matplotlib inline
In [266]:
# 计算每年不明飞行物报告的数量
ufo['Year'] = ufo.Time.dt.year
ufo.Year.value_counts().sort_index().head()
Out[266]:
1930 2
1931 2
1933 1
1934 1
1935 1
Name: Year, dtype: int64
In [267]:
# 绘制每年不明飞行物报告的数量(默认折线图)
ufo.Year.value_counts().sort_index().plot()
Out[267]:
<matplotlib.axes._subplots.Axessubplot at 0x296f1bd1240>
[Back to top]
26. How do I find and remove duplicate rows in pandas? (video)
In [268]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table(url2, sep='|', header=None, names=user_cols, index_col='user_id')
users.head()
Out[268]:
| age | gender | occupation | zip_code |
user_id | | | | |
1 |
24 |
M |
technician |
85711 |
2 |
53 |
F |
other |
94043 |
3 |
23 |
M |
writer |
32067 |
4 |
24 |
M |
technician |
43537 |
5 |
33 |
F |
other |
15213 |
In [269]:
users.shape
Out[269]:
(943, 4)
In [270]:
# 检测重复的邮政编码:如果项目与之前的项目相同,则为真
users.zip_code.duplicated().tail()
Out[270]:
user_id
939 False
940 True
941 False
942 False
943 False
Name: zip_code, dtype: bool
In [271]:
# 计算重复项(True变为1,False变为0)
users.zip_code.duplicated().sum()
Out[271]:
148
In [272]:
# 检测重复的DataFrame行:如果整行与前一行相同,则为True
users.duplicated().tail()
Out[272]:
user_id
939 False
940 False
941 False
942 False
943 False
dtype: bool
In [273]:
# 数重复的行
users.duplicated().sum()
Out[273]:
7
Logic for duplicated
:
keep='first'
(default): Mark duplicates as True except for the first occurrence.
keep='last'
: Mark duplicates as True except for the last occurrence.
keep=False
: Mark all duplicates as True.
In [274]:
"""duplicated的逻辑:
keep ='first'(默认值):将重复项标记为True,但第一次出现除外。
keep ='last':将重复项标记为True,但最后一次出现除外。
keep = False:将所有重复项标记为True。"""
Out[274]:
"duplicated的逻辑:\n\nkeep ='first'(默认值):将重复项标记为True,但第一次出现除外。\nkeep ='last':将重复项标记为True,但最后一次出现除外。\nkeep = False:将所有重复项标记为True。"
In [275]:
# 检查重复行,忽略最第一次出现的
users.loc[users.duplicated(keep='first'), :]
Out[275]:
| age | gender | occupation | zip_code |
user_id | | | | |
496 |
21 |
F |
student |
55414 |
572 |
51 |
M |
educator |
20003 |
621 |
17 |
M |
student |
60402 |
684 |
28 |
M |
student |
55414 |
733 |
44 |
F |
other |
60630 |
805 |
27 |
F |
other |
20009 |
890 |
32 |
M |
student |
97301 |
In [276]:
# 检查重复行,忽略最后一个出现的
users.loc[users.duplicated(keep='last'), :]
Out[276]:
| age | gender | occupation | zip_code |
user_id | | | | |
67 |
17 |
M |
student |
60402 |
85 |
51 |
M |
educator |
20003 |
198 |
21 |
F |
student |
55414 |
350 |
32 |
M |
student |
97301 |
428 |
28 |
M |
student |
55414 |
437 |
27 |
F |
other |
20009 |
460 |
44 |
F |
other |
60630 |
In [277]:
# 检查重复行,保留所有重复数据
users.loc[users.duplicated(keep=False), :]
Out[277]:
| age | gender | occupation | zip_code |
user_id | | | | |
67 |
17 |
M |
student |
60402 |
85 |
51 |
M |
educator |
20003 |
198 |
21 |
F |
student |
55414 |
350 |
32 |
M |
student |
97301 |
428 |
28 |
M |
student |
55414 |
437 |
27 |
F |
other |
20009 |
460 |
44 |
F |
other |
60630 |
496 |
21 |
F |
student |
55414 |
572 |
51 |
M |
educator |
20003 |
621 |
17 |
M |
student |
60402 |
684 |
28 |
M |
student |
55414 |
733 |
44 |
F |
other |
60630 |
805 |
27 |
F |
other |
20009 |
890 |
32 |
M |
student |
97301 |
In [278]:
# 删除重复的行(默认情况下为inplace = False)
users.drop_duplicates(keep='first').shape
Out[278]:
(936, 4)
In [279]:
users.drop_duplicates(keep='last').shape
Out[279]:
(936, 4)
In [280]:
users.drop_duplicates(keep=False).shape
Out[280]:
(929, 4)
Documentation for drop_duplicates
In [281]:
# 在识别重复项时,只考虑列的子集
users.duplicated(subset=['age', 'zip_code']).sum()
Out[281]:
16
In [282]:
users.drop_duplicates(subset=['age', 'zip_code']).shape
Out[282]:
(927, 4)
[Back to top]
27. How do I avoid a SettingWithcopyWarning in pandas? (video)
In [283]:
movies = pd.read_csv(url4)
movies.head()
Out[283]:
| star_rating | title | content_rating | genre | duration | actors_list |
0 |
9.3 |
The Shawshank Redemption |
R |
Crime |
142 |
[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 |
9.2 |
The Godfather |
R |
Crime |
175 |
[u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 |
9.1 |
The Godfather: Part II |
R |
Crime |
200 |
[u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 |
9.0 |
The Dark Knight |
PG-13 |
Action |
152 |
[u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 |
8.9 |
Pulp Fiction |
R |
Crime |
154 |
[u'John Travolta', u'Uma Thurman', u'Samuel L.... |
In [284]:
# 数'content_rating'系列中的缺失值
movies.content_rating.isnull().sum()
Out[284]:
3
In [285]:
# 检查包含缺少值的DataFrame行
movies[movies.content_rating.isnull()]
Out[285]:
| star_rating | title | content_rating | genre | duration | actors_list |
187 |
8.2 |
Butch Cassidy and the Sundance Kid |
NaN |
Biography |
110 |
[u'Paul Newman', u'Robert Redford', u'Katharin... |
649 |
7.7 |
Where Eagles Dare |
NaN |
Action |
158 |
[u'Richard Burton', u'Clint Eastwood', u'Mary ... |
936 |
7.4 |
True Grit |
NaN |
Adventure |
128 |
[u'John Wayne', u'Kim Darby', u'Glen Campbell'] |
In [286]:
# 检查'content_rating'系列中的唯一值
movies.content_rating.value_counts()
Out[286]:
R 460
PG-13 189
PG 123
NOT RATED 65
APPROVED 47
UNRATED 38
G 32
PASSED 7
NC-17 7
X 4
GP 3
TV-MA 1
Name: content_rating, dtype: int64
Goal: Mark the 'NOT RATED' values as missing values, represented by 'NaN'.
In [287]:
#目标:将“未评级”值标记为缺失值,由“NaN”表示。
In [288]:
# 首先,找到相关的行
movies[movies.content_rating=='NOT RATED'].head()
Out[288]:
| star_rating | title | content_rating | genre | duration | actors_list |
5 |
8.9 |
12 angry Men |
NOT RATED |
Drama |
96 |
[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... |
6 |
8.9 |
The Good, the Bad and the Ugly |
NOT RATED |
Western |
161 |
[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ... |
41 |
8.5 |
Sunset Blvd. |
NOT RATED |
Drama |
110 |
[u'William Holden', u'Gloria Swanson', u'Erich... |
63 |
8.4 |
M |
NOT RATED |
Crime |
99 |
[u'Peter Lorre', u'Ellen Widmann', u'Inge Land... |
66 |
8.4 |
Munna Bhai M.B.B.S. |
NOT RATED |
Comedy |
156 |
[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi'] |
In [289]:
# 然后,从这些行中选择'content_rating'系列
movies[movies.content_rating=='NOT RATED'].content_rating.head()
Out[289]:
5 NOT RATED
6 NOT RATED
41 NOT RATED
63 NOT RATED
66 NOT RATED
Name: content_rating, dtype: object
In [290]:
# 最后,用'NaN'(从NumPy导入)替换'NOT RATED'值
import numpy as np
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan
c:\users\qq123\anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithcopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
Problem: That statement involves two operations, a __getitem__
and a __setitem__
. pandas can't guarantee whether the __getitem__
operation returns a view or a copy of the data.
- If
__getitem__
returns a view of the data, __setitem__
will affect the 'movies' DataFrame.
- But if
__getitem__
returns a copy of the data, __setitem__
will not affect the 'movies' DataFrame.
In [291]:
"""问题:该语句涉及两个操作,__ getitem__和__setitem__。 pandas无法保证__getitem__操作是返回视图还是数据副本。
如果__getitem__返回数据视图,__ setitem__将影响'movies'DataFrame。
但是如果__getitem__返回数据的副本,__ setitem__将不会影响'movies'DataFrame。"""
Out[291]:
"问题:该语句涉及两个操作,__ getitem__和__setitem__。 pandas无法保证__getitem__操作是返回视图还是数据副本。\n\n如果__getitem__返回数据视图,__ setitem__将影响'movies'DataFrame。\n但是如果__getitem__返回数据的副本,__ setitem__将不会影响'movies'DataFrame。"
In [292]:
# 'content_rating'系列没有改变
movies.content_rating.isnull().sum()
Out[292]:
3
Solution: Use the loc
method, which replaces the 'NOT RATED' values in a single __setitem__
operation.
In [293]:
"""解决方案:使用loc方法,该方法用单个__setitem__操作中替换“NOT RATED”值。"""
Out[293]:
'解决方案:使用loc方法,该方法用单个__setitem__操作中替换“NOT RATED”值。'
In [294]:
# 用'NaN'替换'NOT RATED'值(不会导致SettingWithcopyWarning)
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan
In [295]:
# 这一次,'content_rating'Sreies发生了变化
movies.content_rating.isnull().sum()
Out[295]:
68
Summary: Use the loc
method any time you are selecting rows and columns in the same statement.
More @R_612_4045@ion: Modern Pandas (Part 1)
In [296]:
# 创建一个仅包含高'star_rating'影片的DataFrame
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies
Out[296]:
| star_rating | title | content_rating | genre | duration | actors_list |
0 |
9.3 |
The Shawshank Redemption |
R |
Crime |
142 |
[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 |
9.2 |
The Godfather |
R |
Crime |
175 |
[u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 |
9.1 |
The Godfather: Part II |
R |
Crime |
200 |
[u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 |
9.0 |
The Dark Knight |
PG-13 |
Action |
152 |
[u'Christian Bale', u'Heath Ledger', u'Aaron E... |
Goal: Fix the 'duration' for 'The Shawshank Redemption'.
In [297]:
'目标:修复“The Shawshank Redemption”的“持续时间”。'
Out[297]:
'目标:修复“The Shawshank Redemption”的“持续时间”。'
In [298]:
# 用正确的持续时间覆盖相关的单元格
top_movies.loc[0, 'duration'] = 150
c:\users\qq123\anaconda3\lib\site-packages\pandas\core\indexing.py:537: SettingWithcopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.
In [299]:
"问题:pandas不确定'top_movies'是视图还是'电影'的副本。"
Out[299]:
"问题:pandas不确定'top_movies'是视图还是'电影'的副本。"
In [300]:
# 结果还是修改了
top_movies
Out[300]:
| star_rating | title | content_rating | genre | duration | actors_list |
0 |
9.3 |
The Shawshank Redemption |
R |
Crime |
150 |
[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 |
9.2 |
The Godfather |
R |
Crime |
175 |
[u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 |
9.1 |
The Godfather: Part II |
R |
Crime |
200 |
[u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 |
9.0 |
The Dark Knight |
PG-13 |
Action |
152 |
[u'Christian Bale', u'Heath Ledger', u'Aaron E... |
In [301]:
# 'movies' 电影的DataFrame没有改变
movies.head(1)
Out[301]:
| star_rating | title | content_rating | genre | duration | actors_list |
0 |
9.3 |
The Shawshank Redemption |
R |
Crime |
142 |
[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
Solution: Any time you are attempting to create a DataFrame copy, use the copy
method.
In [302]:
"解决方案:每次尝试创建DataFrame副本时,请使用copy方法。"
Out[302]:
'解决方案:每次尝试创建DataFrame副本时,请使用copy方法。'
In [303]:
# 明确创建'电影'的副本
top_movies = movies.loc[movies.star_rating >= 9, :].copy()
In [304]:
# andas现在知道你正在更新副本而不是视图(不会导致SettingWithcopyWarning
top_movies.loc[0, 'duration'] = 150
In [305]:
# 'top_movies'DataFrame已更新
top_movies
Out[305]:
| star_rating | title | content_rating | genre | duration | actors_list |
0 |
9.3 |
The Shawshank Redemption |
R |
Crime |
150 |
[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 |
9.2 |
The Godfather |
R |
Crime |
175 |
[u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 |
9.1 |
The Godfather: Part II |
R |
Crime |
200 |
[u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 |
9.0 |
The Dark Knight |
PG-13 |
Action |
152 |
[u'Christian Bale', u'Heath Ledger', u'Aaron E... |
Documentation on indexing and selection: Returning a view versus a copy
Stack Overflow: What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?
[Back to top]
28. How do I change display options in pandas? (video)
In [306]:
drinks = pd.read_csv(url7)
In [307]:
# 打印时仅显示60行
drinks
Out[307]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
5 |
Antigua & Barbuda |
102 |
128 |
45 |
4.9 |
north America |
6 |
Argentina |
193 |
25 |
221 |
8.3 |
South America |
7 |
Armenia |
21 |
179 |
11 |
3.8 |
Europe |
8 |
Australia |
261 |
72 |
212 |
10.4 |
Oceania |
9 |
Austria |
279 |
75 |
191 |
9.7 |
Europe |
10 |
Azerbaijan |
21 |
46 |
5 |
1.3 |
Europe |
11 |
Bahamas |
122 |
176 |
51 |
6.3 |
north America |
12 |
Bahrain |
42 |
63 |
7 |
2.0 |
Asia |
13 |
Bangladesh |
0 |
0 |
0 |
0.0 |
Asia |
14 |
Barbados |
143 |
173 |
36 |
6.3 |
north America |
15 |
Belarus |
142 |
373 |
42 |
14.4 |
Europe |
16 |
Belgium |
295 |
84 |
212 |
10.5 |
Europe |
17 |
Belize |
263 |
114 |
8 |
6.8 |
north America |
18 |
Benin |
34 |
4 |
13 |
1.1 |
Africa |
19 |
Bhutan |
23 |
0 |
0 |
0.4 |
Asia |
20 |
Bolivia |
167 |
41 |
8 |
3.8 |
South America |
21 |
Bosnia-Herzegovina |
76 |
173 |
8 |
4.6 |
Europe |
22 |
Botswana |
173 |
35 |
35 |
5.4 |
Africa |
23 |
Brazil |
245 |
145 |
16 |
7.2 |
South America |
24 |
Brunei |
31 |
2 |
1 |
0.6 |
Asia |
25 |
Bulgaria |
231 |
252 |
94 |
10.3 |
Europe |
26 |
Burkina Faso |
25 |
7 |
7 |
4.3 |
Africa |
27 |
Burundi |
88 |
0 |
0 |
6.3 |
Africa |
28 |
Cote d'Ivoire |
37 |
1 |
7 |
4.0 |
Africa |
29 |
Cabo Verde |
144 |
56 |
16 |
4.0 |
Africa |
... |
... |
... |
... |
... |
... |
... |
163 |
Suriname |
128 |
178 |
7 |
5.6 |
South America |
164 |
Swaziland |
90 |
2 |
2 |
4.7 |
Africa |
165 |
Sweden |
152 |
60 |
186 |
7.2 |
Europe |
166 |
Switzerland |
185 |
100 |
280 |
10.2 |
Europe |
167 |
Syria |
5 |
35 |
16 |
1.0 |
Asia |
168 |
Tajikistan |
2 |
15 |
0 |
0.3 |
Asia |
169 |
Thailand |
99 |
258 |
1 |
6.4 |
Asia |
170 |
Macedonia |
106 |
27 |
86 |
3.9 |
Europe |
171 |
Timor-Leste |
1 |
1 |
4 |
0.1 |
Asia |
172 |
Togo |
36 |
2 |
19 |
1.3 |
Africa |
173 |
Tonga |
36 |
21 |
5 |
1.1 |
Oceania |
174 |
Trinidad & Tobago |
197 |
156 |
7 |
6.4 |
north America |
175 |
Tunisia |
51 |
3 |
20 |
1.3 |
Africa |
176 |
Turkey |
51 |
22 |
7 |
1.4 |
Asia |
177 |
Turkmenistan |
19 |
71 |
32 |
2.2 |
Asia |
178 |
Tuvalu |
6 |
41 |
9 |
1.0 |
Oceania |
179 |
Uganda |
45 |
9 |
0 |
8.3 |
Africa |
180 |
Ukraine |
206 |
237 |
45 |
8.9 |
Europe |
181 |
United arab Emirates |
16 |
135 |
5 |
2.8 |
Asia |
182 |
United Kingdom |
219 |
126 |
195 |
10.4 |
Europe |
183 |
Tanzania |
36 |
6 |
1 |
5.7 |
Africa |
184 |
USA |
249 |
158 |
84 |
8.7 |
north America |
185 |
Uruguay |
115 |
35 |
220 |
6.6 |
South America |
186 |
Uzbekistan |
25 |
101 |
8 |
2.4 |
Asia |
187 |
Vanuatu |
21 |
18 |
11 |
0.9 |
Oceania |
188 |
Venezuela |
333 |
100 |
3 |
7.7 |
South America |
189 |
Vietnam |
111 |
2 |
1 |
2.0 |
Asia |
190 |
Yemen |
6 |
0 |
0 |
0.1 |
Asia |
191 |
Zambia |
32 |
19 |
4 |
2.5 |
Africa |
192 |
Zimbabwe |
64 |
18 |
4 |
4.7 |
Africa |
193 rows × 6 columns
In [308]:
# 检查'max_rows'选项的当前设置
pd.get_option('display.max_rows')
Out[308]:
60
Documentation for get_option
In [309]:
# 覆盖当前设置,以便显示所有行
pd.set_option('display.max_rows', None)
drinks
Out[309]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
5 |
Antigua & Barbuda |
102 |
128 |
45 |
4.9 |
north America |
6 |
Argentina |
193 |
25 |
221 |
8.3 |
South America |
7 |
Armenia |
21 |
179 |
11 |
3.8 |
Europe |
8 |
Australia |
261 |
72 |
212 |
10.4 |
Oceania |
9 |
Austria |
279 |
75 |
191 |
9.7 |
Europe |
10 |
Azerbaijan |
21 |
46 |
5 |
1.3 |
Europe |
11 |
Bahamas |
122 |
176 |
51 |
6.3 |
north America |
12 |
Bahrain |
42 |
63 |
7 |
2.0 |
Asia |
13 |
Bangladesh |
0 |
0 |
0 |
0.0 |
Asia |
14 |
Barbados |
143 |
173 |
36 |
6.3 |
north America |
15 |
Belarus |
142 |
373 |
42 |
14.4 |
Europe |
16 |
Belgium |
295 |
84 |
212 |
10.5 |
Europe |
17 |
Belize |
263 |
114 |
8 |
6.8 |
north America |
18 |
Benin |
34 |
4 |
13 |
1.1 |
Africa |
19 |
Bhutan |
23 |
0 |
0 |
0.4 |
Asia |
20 |
Bolivia |
167 |
41 |
8 |
3.8 |
South America |
21 |
Bosnia-Herzegovina |
76 |
173 |
8 |
4.6 |
Europe |
22 |
Botswana |
173 |
35 |
35 |
5.4 |
Africa |
23 |
Brazil |
245 |
145 |
16 |
7.2 |
South America |
24 |
Brunei |
31 |
2 |
1 |
0.6 |
Asia |
25 |
Bulgaria |
231 |
252 |
94 |
10.3 |
Europe |
26 |
Burkina Faso |
25 |
7 |
7 |
4.3 |
Africa |
27 |
Burundi |
88 |
0 |
0 |
6.3 |
Africa |
28 |
Cote d'Ivoire |
37 |
1 |
7 |
4.0 |
Africa |
29 |
Cabo Verde |
144 |
56 |
16 |
4.0 |
Africa |
30 |
Cambodia |
57 |
65 |
1 |
2.2 |
Asia |
31 |
Cameroon |
147 |
1 |
4 |
5.8 |
Africa |
32 |
Canada |
240 |
122 |
100 |
8.2 |
north America |
33 |
Central African Republic |
17 |
2 |
1 |
1.8 |
Africa |
34 |
Chad |
15 |
1 |
1 |
0.4 |
Africa |
35 |
Chile |
130 |
124 |
172 |
7.6 |
South America |
36 |
China |
79 |
192 |
8 |
5.0 |
Asia |
37 |
Colombia |
159 |
76 |
3 |
4.2 |
South America |
38 |
Comoros |
1 |
3 |
1 |
0.1 |
Africa |
39 |
Congo |
76 |
1 |
9 |
1.7 |
Africa |
40 |
Cook Islands |
0 |
254 |
74 |
5.9 |
Oceania |
41 |
Costa Rica |
149 |
87 |
11 |
4.4 |
north America |
42 |
Croatia |
230 |
87 |
254 |
10.2 |
Europe |
43 |
Cuba |
93 |
137 |
5 |
4.2 |
north America |
44 |
Cyprus |
192 |
154 |
113 |
8.2 |
Europe |
45 |
Czech Republic |
361 |
170 |
134 |
11.8 |
Europe |
46 |
north Korea |
0 |
0 |
0 |
0.0 |
Asia |
47 |
DR Congo |
32 |
3 |
1 |
2.3 |
Africa |
48 |
Denmark |
224 |
81 |
278 |
10.4 |
Europe |
49 |
Djibouti |
15 |
44 |
3 |
1.1 |
Africa |
50 |
Dominica |
52 |
286 |
26 |
6.6 |
north America |
51 |
Dominican Republic |
193 |
147 |
9 |
6.2 |
north America |
52 |
Ecuador |
162 |
74 |
3 |
4.2 |
South America |
53 |
Egypt |
6 |
4 |
1 |
0.2 |
Africa |
54 |
El Salvador |
52 |
69 |
2 |
2.2 |
north America |
55 |
Equatorial Guinea |
92 |
0 |
233 |
5.8 |
Africa |
56 |
Eritrea |
18 |
0 |
0 |
0.5 |
Africa |
57 |
Estonia |
224 |
194 |
59 |
9.5 |
Europe |
58 |
Ethiopia |
20 |
3 |
0 |
0.7 |
Africa |
59 |
Fiji |
77 |
35 |
1 |
2.0 |
Oceania |
60 |
Finland |
263 |
133 |
97 |
10.0 |
Europe |
61 |
France |
127 |
151 |
370 |
11.8 |
Europe |
62 |
Gabon |
347 |
98 |
59 |
8.9 |
Africa |
63 |
Gambia |
8 |
0 |
1 |
2.4 |
Africa |
64 |
Georgia |
52 |
100 |
149 |
5.4 |
Europe |
65 |
Germany |
346 |
117 |
175 |
11.3 |
Europe |
66 |
Ghana |
31 |
3 |
10 |
1.8 |
Africa |
67 |
Greece |
133 |
112 |
218 |
8.3 |
Europe |
68 |
Grenada |
199 |
438 |
28 |
11.9 |
north America |
69 |
Guatemala |
53 |
69 |
2 |
2.2 |
north America |
70 |
Guinea |
9 |
0 |
2 |
0.2 |
Africa |
71 |
Guinea-Bissau |
28 |
31 |
21 |
2.5 |
Africa |
72 |
Guyana |
93 |
302 |
1 |
7.1 |
South America |
73 |
Haiti |
1 |
326 |
1 |
5.9 |
north America |
74 |
Honduras |
69 |
98 |
2 |
3.0 |
north America |
75 |
Hungary |
234 |
215 |
185 |
11.3 |
Europe |
76 |
Iceland |
233 |
61 |
78 |
6.6 |
Europe |
77 |
India |
9 |
114 |
0 |
2.2 |
Asia |
78 |
Indonesia |
5 |
1 |
0 |
0.1 |
Asia |
79 |
Iran |
0 |
0 |
0 |
0.0 |
Asia |
80 |
Iraq |
9 |
3 |
0 |
0.2 |
Asia |
81 |
Ireland |
313 |
118 |
165 |
11.4 |
Europe |
82 |
Israel |
63 |
69 |
9 |
2.5 |
Asia |
83 |
Italy |
85 |
42 |
237 |
6.5 |
Europe |
84 |
Jamaica |
82 |
97 |
9 |
3.4 |
north America |
85 |
Japan |
77 |
202 |
16 |
7.0 |
Asia |
86 |
Jordan |
6 |
21 |
1 |
0.5 |
Asia |
87 |
Kazakhstan |
124 |
246 |
12 |
6.8 |
Asia |
88 |
Kenya |
58 |
22 |
2 |
1.8 |
Africa |
89 |
Kiribati |
21 |
34 |
1 |
1.0 |
Oceania |
90 |
Kuwait |
0 |
0 |
0 |
0.0 |
Asia |
91 |
Kyrgyzstan |
31 |
97 |
6 |
2.4 |
Asia |
92 |
Laos |
62 |
0 |
123 |
6.2 |
Asia |
93 |
Latvia |
281 |
216 |
62 |
10.5 |
Europe |
94 |
Lebanon |
20 |
55 |
31 |
1.9 |
Asia |
95 |
Lesotho |
82 |
29 |
0 |
2.8 |
Africa |
96 |
Liberia |
19 |
152 |
2 |
3.1 |
Africa |
97 |
Libya |
0 |
0 |
0 |
0.0 |
Africa |
98 |
Lithuania |
343 |
244 |
56 |
12.9 |
Europe |
99 |
Luxembourg |
236 |
133 |
271 |
11.4 |
Europe |
100 |
Madagascar |
26 |
15 |
4 |
0.8 |
Africa |
101 |
Malawi |
8 |
11 |
1 |
1.5 |
Africa |
102 |
Malaysia |
13 |
4 |
0 |
0.3 |
Asia |
103 |
Maldives |
0 |
0 |
0 |
0.0 |
Asia |
104 |
Mali |
5 |
1 |
1 |
0.6 |
Africa |
105 |
Malta |
149 |
100 |
120 |
6.6 |
Europe |
106 |
Marshall Islands |
0 |
0 |
0 |
0.0 |
Oceania |
107 |
Mauritania |
0 |
0 |
0 |
0.0 |
Africa |
108 |
Mauritius |
98 |
31 |
18 |
2.6 |
Africa |
109 |
Mexico |
238 |
68 |
5 |
5.5 |
north America |
110 |
Micronesia |
62 |
50 |
18 |
2.3 |
Oceania |
111 |
Monaco |
0 |
0 |
0 |
0.0 |
Europe |
112 |
Mongolia |
77 |
189 |
8 |
4.9 |
Asia |
113 |
Montenegro |
31 |
114 |
128 |
4.9 |
Europe |
114 |
Morocco |
12 |
6 |
10 |
0.5 |
Africa |
115 |
Mozambique |
47 |
18 |
5 |
1.3 |
Africa |
116 |
Myanmar |
5 |
1 |
0 |
0.1 |
Asia |
117 |
Namibia |
376 |
3 |
1 |
6.8 |
Africa |
118 |
Nauru |
49 |
0 |
8 |
1.0 |
Oceania |
119 |
Nepal |
5 |
6 |
0 |
0.2 |
Asia |
120 |
Netherlands |
251 |
88 |
190 |
9.4 |
Europe |
121 |
New Zealand |
203 |
79 |
175 |
9.3 |
Oceania |
122 |
Nicaragua |
78 |
118 |
1 |
3.5 |
north America |
123 |
Niger |
3 |
2 |
1 |
0.1 |
Africa |
124 |
Nigeria |
42 |
5 |
2 |
9.1 |
Africa |
125 |
Niue |
188 |
200 |
7 |
7.0 |
Oceania |
126 |
norway |
169 |
71 |
129 |
6.7 |
Europe |
127 |
Oman |
22 |
16 |
1 |
0.7 |
Asia |
128 |
Pakistan |
0 |
0 |
0 |
0.0 |
Asia |
129 |
Palau |
306 |
63 |
23 |
6.9 |
Oceania |
130 |
Panama |
285 |
104 |
18 |
7.2 |
north America |
131 |
Papua New Guinea |
44 |
39 |
1 |
1.5 |
Oceania |
132 |
Paraguay |
213 |
117 |
74 |
7.3 |
South America |
133 |
Peru |
163 |
160 |
21 |
6.1 |
South America |
134 |
Philippines |
71 |
186 |
1 |
4.6 |
Asia |
135 |
Poland |
343 |
215 |
56 |
10.9 |
Europe |
136 |
Portugal |
194 |
67 |
339 |
11.0 |
Europe |
137 |
Qatar |
1 |
42 |
7 |
0.9 |
Asia |
138 |
South Korea |
140 |
16 |
9 |
9.8 |
Asia |
139 |
Moldova |
109 |
226 |
18 |
6.3 |
Europe |
140 |
Romania |
297 |
122 |
167 |
10.4 |
Europe |
141 |
Russian Federation |
247 |
326 |
73 |
11.5 |
Asia |
142 |
Rwanda |
43 |
2 |
0 |
6.8 |
Africa |
143 |
St. Kitts & Nevis |
194 |
205 |
32 |
7.7 |
north America |
144 |
St. Lucia |
171 |
315 |
71 |
10.1 |
north America |
145 |
St. vincent & the Grenadines |
120 |
221 |
11 |
6.3 |
north America |
146 |
Samoa |
105 |
18 |
24 |
2.6 |
Oceania |
147 |
San Marino |
0 |
0 |
0 |
0.0 |
Europe |
148 |
Sao Tome & Principe |
56 |
38 |
140 |
4.2 |
Africa |
149 |
Saudi arabia |
0 |
5 |
0 |
0.1 |
Asia |
150 |
Senegal |
9 |
1 |
7 |
0.3 |
Africa |
151 |
Serbia |
283 |
131 |
127 |
9.6 |
Europe |
152 |
Seychelles |
157 |
25 |
51 |
4.1 |
Africa |
153 |
Sierra Leone |
25 |
3 |
2 |
6.7 |
Africa |
154 |
Singapore |
60 |
12 |
11 |
1.5 |
Asia |
155 |
Slovakia |
196 |
293 |
116 |
11.4 |
Europe |
156 |
Slovenia |
270 |
51 |
276 |
10.6 |
Europe |
157 |
Solomon Islands |
56 |
11 |
1 |
1.2 |
Oceania |
158 |
Somalia |
0 |
0 |
0 |
0.0 |
Africa |
159 |
South Africa |
225 |
76 |
81 |
8.2 |
Africa |
160 |
Spain |
284 |
157 |
112 |
10.0 |
Europe |
161 |
Sri Lanka |
16 |
104 |
0 |
2.2 |
Asia |
162 |
Sudan |
8 |
13 |
0 |
1.7 |
Africa |
163 |
Suriname |
128 |
178 |
7 |
5.6 |
South America |
164 |
Swaziland |
90 |
2 |
2 |
4.7 |
Africa |
165 |
Sweden |
152 |
60 |
186 |
7.2 |
Europe |
166 |
Switzerland |
185 |
100 |
280 |
10.2 |
Europe |
167 |
Syria |
5 |
35 |
16 |
1.0 |
Asia |
168 |
Tajikistan |
2 |
15 |
0 |
0.3 |
Asia |
169 |
Thailand |
99 |
258 |
1 |
6.4 |
Asia |
170 |
Macedonia |
106 |
27 |
86 |
3.9 |
Europe |
171 |
Timor-Leste |
1 |
1 |
4 |
0.1 |
Asia |
172 |
Togo |
36 |
2 |
19 |
1.3 |
Africa |
173 |
Tonga |
36 |
21 |
5 |
1.1 |
Oceania |
174 |
Trinidad & Tobago |
197 |
156 |
7 |
6.4 |
north America |
175 |
Tunisia |
51 |
3 |
20 |
1.3 |
Africa |
176 |
Turkey |
51 |
22 |
7 |
1.4 |
Asia |
177 |
Turkmenistan |
19 |
71 |
32 |
2.2 |
Asia |
178 |
Tuvalu |
6 |
41 |
9 |
1.0 |
Oceania |
179 |
Uganda |
45 |
9 |
0 |
8.3 |
Africa |
180 |
Ukraine |
206 |
237 |
45 |
8.9 |
Europe |
181 |
United arab Emirates |
16 |
135 |
5 |
2.8 |
Asia |
182 |
United Kingdom |
219 |
126 |
195 |
10.4 |
Europe |
183 |
Tanzania |
36 |
6 |
1 |
5.7 |
Africa |
184 |
USA |
249 |
158 |
84 |
8.7 |
north America |
185 |
Uruguay |
115 |
35 |
220 |
6.6 |
South America |
186 |
Uzbekistan |
25 |
101 |
8 |
2.4 |
Asia |
187 |
Vanuatu |
21 |
18 |
11 |
0.9 |
Oceania |
188 |
Venezuela |
333 |
100 |
3 |
7.7 |
South America |
189 |
Vietnam |
111 |
2 |
1 |
2.0 |
Asia |
190 |
Yemen |
6 |
0 |
0 |
0.1 |
Asia |
191 |
Zambia |
32 |
19 |
4 |
2.5 |
Africa |
192 |
Zimbabwe |
64 |
18 |
4 |
4.7 |
Africa |
In [310]:
# 将'max_rows'选项重置为默认值
pd.reset_option('display.max_rows')
Documentation for set_option
and reset_option
In [311]:
# 'max_columns'选项类似于'max_rows'
pd.get_option('display.max_columns')
Out[311]:
20
In [312]:
train = pd.read_csv(url5)
train.head()
Out[312]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
In [313]:
# 由于“max_colwidth”选项,省略号显示在第1行的“名称”单元格中
pd.get_option('display.max_colwidth')
Out[313]:
50
In [314]:
# 覆盖当前设置,以便显示更多字符
pd.set_option('display.max_colwidth', 1000)
train.head()
Out[314]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
In [315]:
# 覆盖''precision' '设置以显示'票价'小数点后的2位数
pd.set_option('display.precision', 2)
train.head()
Out[315]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.25 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
female |
38.0 |
1 |
0 |
PC 17599 |
71.28 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.92 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.10 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.05 |
NaN |
S |
In [316]:
# 在饮料DataFrame中添加两个无意义的列
drinks['x'] = drinks.wine_servings * 1000
drinks['y'] = drinks.total_litres_of_pure_alcohol * 1000
drinks.head()
Out[316]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | x | y |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
0 |
0.0 |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
54000 |
4900.0 |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
14000 |
700.0 |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
312000 |
12400.0 |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
45000 |
5900.0 |
In [317]:
# 使用Python格式字符串将逗号指定为千位分隔符
pd.set_option('display.float_format', '{:,}'.format)
drinks.head()
Out[317]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | x | y |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
0 |
0.0 |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
54000 |
4,900.0 |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
14000 |
700.0 |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
312000 |
12,400.0 |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
45000 |
5,900.0 |
In [318]:
# y'受影响(但不是'x')因为'float_format'选项只影响浮点数(不是整数)
drinks.dtypes
Out[318]:
country object
beer_servings int64
spirit_servings int64
wine_servings int64
total_litres_of_pure_alcohol float64
continent object
x int64
y float64
dtype: object
In [319]:
# 查看选项说明(包括默认值和当前值)
pd.describe_option()
compute.use_bottleneck : bool
Use the bottleneck library to accelerate if it is installed,
the default is True
Valid values: False,True
[default: True] [currently: True]
compute.use_numexpr : bool
Use the numexpr library to accelerate computation if it is installed,
the default is True
Valid values: False,True
[default: True] [currently: True]
display.chop_threshold : float or None
if set to a float value, all float values smaller then the given threshold
will be displayed as exactly 0 by repr and friends.
[default: None] [currently: None]
display.colheader_justify : 'left'/'right'
Controls the justification of column headers. used by DataFrameFormatter.
[default: right] [currently: right]
display.column_space No description available.
[default: 12] [currently: 12]
display.date_dayfirst : boolean
When True, prints and parses dates with the day first, eg 20/01/2005
[default: False] [currently: False]
display.date_yearfirst : boolean
When True, prints and parses dates with the year first, eg 2005/01/20
[default: False] [currently: False]
display.encoding : str/unicode
Defaults to the detected encoding of the console.
Specifies the encoding to be used for strings returned by to_string,
these are generally strings meant to be displayed on the console.
[default: UTF-8] [currently: UTF-8]
display.expand_frame_repr : boolean
Whether to print out the full DataFrame repr for wide DataFrames across
multiple lines, `max_columns` is still respected, but the output will
wrap-around across multiple "pages" if its width exceeds `display.width`.
[default: True] [currently: True]
display.float_format : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See formats.format.EngFormatter for an example.
[default: None] [currently: <built-in method format of str object at 0x00000296F1B05880>]
display.height : int
Deprecated.
[default: 60] [currently: 60]
display.html.border : int
A ``border=value`` attribute is inserted in the ``<table>`` tag
for the DataFrame HTML repr.
[default: 1] [currently: 1]
display.html.table_schema : boolean
Whether to publish a Table Schema representation for frontends
that support it.
(default: False)
[default: False] [currently: False]
display.large_repr : 'truncate'/'info'
For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
show a truncated table (the default from 0.13), or switch to the view from
df.info() (the behavIoUr in earlier versions of pandas).
[default: truncate] [currently: truncate]
display.latex.escape : bool
This specifies if the to_latex method of a Dataframe uses escapes special
characters.
Valid values: False,True
[default: True] [currently: True]
display.latex.longtable :bool
This specifies if the to_latex method of a Dataframe uses the longtable
format.
Valid values: False,True
[default: False] [currently: False]
display.latex.multicolumn : bool
This specifies if the to_latex method of a Dataframe uses multicolumns
to pretty-print MultiIndex columns.
Valid values: False,True
[default: True] [currently: True]
display.latex.multicolumn_format : bool
This specifies if the to_latex method of a Dataframe uses multicolumns
to pretty-print MultiIndex columns.
Valid values: False,True
[default: l] [currently: l]
display.latex.multirow : bool
This specifies if the to_latex method of a Dataframe uses multirows
to pretty-print MultiIndex rows.
Valid values: False,True
[default: False] [currently: False]
display.latex.repr : boolean
Whether to produce a latex DataFrame representation for jupyter
environments that support it.
(default: False)
[default: False] [currently: False]
display.line_width : int
Deprecated.
[default: 80] [currently: 80]
display.max_categories : int
This sets the maximum number of categories pandas should output when
printing out a `Categorical` or a Series of dtype "category".
[default: 8] [currently: 8]
display.max_columns : int
If max_cols is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the width of the terminal and print a truncated object which fits
the screen width. The IPython notebook, IPython qtconsole, or IDLE
do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 20] [currently: 20]
display.max_colwidth : int
The maximum width in characters of a column in the repr of
a pandas data structure. When the column overflows, a "..."
placeholder is embedded in the output.
[default: 50] [currently: 1000]
display.max_info_columns : int
max_info_columns is used in DataFrame.info method to decide if
per column @R_612_4045@ion will be printed.
[default: 100] [currently: 100]
display.max_info_rows : int or None
df.info() will usually show null-counts for each column.
For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than
specified.
[default: 1690785] [currently: 1690785]
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPython qtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]
display.max_seq_items : int or None
when pretty-printing a long sequence, no more then `max_seq_items`
will be printed. If items are omitted, they will be denoted by the
addition of "..." to the resulting string.
If set to None, the number of items to be printed is unlimited.
[default: 100] [currently: 100]
display.memory_usage : bool, string or None
This specifies if the memory usage of a DataFrame should be displayed when
df.info() is called. Valid values True,False,'deep'
[default: True] [currently: True]
display.multi_sparse : boolean
"sparsify" MultiIndex display (don't display repeated
elements in outer levels within groups)
[default: True] [currently: True]
display.notebook_repr_html : boolean
When True, IPython notebook will use html representation for
pandas objects (if it is available).
[default: True] [currently: True]
display.pprint_nest_depth : int
Controls the number of nested levels to process when pretty-printing
[default: 3] [currently: 3]
display.precision : int
Floating point output precision (number of significant digits). This is
only a suggestion
[default: 6] [currently: 2]
display.show_dimensions : boolean or 'truncate'
Whether to print out dimensions at the end of DataFrame repr.
If 'truncate' is specified, only print out the dimensions if the
frame is truncated (e.g. not display all rows and/or columns)
[default: truncate] [currently: truncate]
display.unicode.ambiguous_as_wide : boolean
Whether to use the Unicode East Asian Width to calculate the display text
width.
Enabling this may affect to the performance (default: False)
[default: False] [currently: False]
display.unicode.east_asian_width : boolean
Whether to use the Unicode East Asian Width to calculate the display text
width.
Enabling this may affect to the performance (default: False)
[default: False] [currently: False]
display.width : int
Width of the display in characters. In case python/IPython is running in
a terminal this can be set to None and pandas will correctly auto-detect
the width.
Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]
html.border : int
A ``border=value`` attribute is inserted in the ``<table>`` tag
for the DataFrame HTML repr.
[default: 1] [currently: 1]
(Deprecated, use `display.html.border` instead.)
io.excel.xls.writer : string
The default Excel writer engine for 'xls' files. Available options:
auto, xlwt.
[default: auto] [currently: auto]
io.excel.xlsm.writer : string
The default Excel writer engine for 'xlsm' files. Available options:
auto, openpyxl.
[default: auto] [currently: auto]
io.excel.xlsx.writer : string
The default Excel writer engine for 'xlsx' files. Available options:
auto, openpyxl, xlsxwriter.
[default: auto] [currently: auto]
io.hdf.default_format : format
default format writing format, if None, then
put will default to 'fixed' and append will default to 'table'
[default: None] [currently: None]
io.hdf.dropna_table : boolean
drop ALL nan rows when appending to a table
[default: False] [currently: False]
io.parquet.engine : string
The default parquet reader/writer engine. Available options:
'auto', 'pyarrow', 'fastparquet', the default is 'auto'
[default: auto] [currently: auto]
mode.chained_assignment : string
Raise an exception, warn, or no action if trying to use chained assignment,
The default is warn
[default: warn] [currently: warn]
mode.sim_interactive : boolean
Whether to simulate interactive mode for purposes of testing
[default: False] [currently: False]
mode.use_inf_as_na : boolean
True means treat None, NaN, INF, -INF as NA (old way),
False means None and NaN are null, but INF, -INF are not NA
(new way).
[default: False] [currently: False]
mode.use_inf_as_null use_inf_as_null had been deprecated and will be removed in a future version.
Use `use_inf_as_na` instead.
[default: False] [currently: False]
(Deprecated, use `mode.use_inf_as_na` instead.)
plotting.matplotlib.register_converters : bool
Whether to register converters with matplotlib's units registry for
dates, times, datetimes, and Periods. Toggling to False will remove
the converters, restoring any converters that pandas overwrote.
[default: True] [currently: True]
In [320]:
# 按名称搜索特定选项
pd.describe_option('rows')
display.max_info_rows : int or None
df.info() will usually show null-counts for each column.
For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than
specified.
[default: 1690785] [currently: 1690785]
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
`large_repr`, objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPython qtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]
Documentation for describe_option
In [321]:
# 将所有选项重置为其默认值
pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)
use_inf_as_null had been deprecated and will be removed in a future version.
Use `use_inf_as_na` instead.
Options and Settings
[Back to top]
29. How do I create a pandas DataFrame from another object? (video)
In [322]:
# 从字典创建一个DataFrame(键成为列名,值成为数据)
pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']})
Out[322]:
| color | id |
0 |
red |
100 |
1 |
blue |
101 |
2 |
red |
102 |
In [323]:
# 指定列的顺序并定义索引
df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])
df
Out[323]:
| id | color |
a |
100 |
red |
b |
101 |
blue |
c |
102 |
red |
Documentation for DataFrame
In [324]:
# 从列表中的列表结构创建一个DataFrame(每个内部列表成为一行)
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])
Out[324]:
| id | color |
0 |
100 |
red |
1 |
101 |
blue |
2 |
102 |
red |
In [325]:
# 创建一个NumPy数组(形状为4乘2)并用0到1之间的随机数填充它
import numpy as np
arr = np.random.rand(4, 2)
arr
Out[325]:
array([[0.75703155, 0.7684219 ],
[0.06043145, 0.1511115 ],
[0.5908619 , 0.02133966],
[0.78627718, 0.98944222]])
In [326]:
# 从NumPy数组创建一个DataFrame
pd.DataFrame(arr, columns=['one', 'two'])
Out[326]:
| one | two |
0 |
0.757032 |
0.768422 |
1 |
0.060431 |
0.151112 |
2 |
0.590862 |
0.021340 |
3 |
0.786277 |
0.989442 |
In [327]:
# 创建学生ID(100到109)和测试分数(60到10之间的随机整数)的DataFrame
pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)})
Out[327]:
| student | test |
0 |
100 |
61 |
1 |
101 |
84 |
2 |
102 |
100 |
3 |
103 |
75 |
4 |
104 |
97 |
5 |
105 |
73 |
6 |
106 |
67 |
7 |
107 |
82 |
8 |
108 |
76 |
9 |
109 |
69 |
Documentation for np.arange
and np.random
In [328]:
# set_index'可以与DataFrame构造函数链接以选择索引
pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')
Out[328]:
| test |
student | |
100 |
97 |
101 |
94 |
102 |
82 |
103 |
76 |
104 |
72 |
105 |
86 |
106 |
70 |
107 |
84 |
108 |
84 |
109 |
100 |
Documentation for set_index
In [329]:
# 使用Series构造函数创建一个新Series
s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')
s
Out[329]:
c round
b square
Name: shape, dtype: object
Documentation for Series
In [330]:
# 连接DataFrame和Series(使用axis = 1来连接列)
pd.concat([df, s], axis=1)
Out[330]:
| id | color | shape |
a |
100 |
red |
NaN |
b |
101 |
blue |
square |
c |
102 |
red |
round |
Notes:
- The Series name became the column name in the DataFrame.
- The Series data was aligned to the DataFrame by its index.
- The 'shape' for row 'a' was marked as a missing value (NaN) because that index was not present in the Series.
Documentation for concat
[Back to top]
In [331]:
"""笔记:
系列名称成为DataFrame中的列名称。
Series数据通过其索引与DataFrame对齐。
行'a'的'形状'被标记为缺失值(NaN),因为该索引在系列中不存在。"""
Out[331]:
"笔记:\n\n系列名称成为DataFrame中的列名称。\nSeries数据通过其索引与DataFrame对齐。\n行'a'的'形状'被标记为缺失值(NaN),因为该索引在系列中不存在。"
30. How do I apply a function to a pandas Series or DataFrame? (video)
In [332]:
# read the training dataset from kaggle's Titanic competition into a DataFrame
train = pd.read_csv(url5)
train.head()
Out[332]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | fare | Cabin | Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
Goal: Map the existing values of a Series to a different set of values
Method: map
(Series method)
In [333]:
"""目标:将Series的现有值映射到不同的值集
方法:map(Series方法)"""
Out[333]:
'目标:将Series的现有值映射到不同的值集\n\n方法:map(Series方法)'
In [334]:
# map 'female' to 0 and 'male' to 1
train['Sex_num'] = train.Sex.map({'female':0, 'male':1})
train.loc[0:4, ['Sex', 'Sex_num']]
Out[334]:
| Sex | Sex_num |
0 |
male |
1 |
1 |
female |
0 |
2 |
female |
0 |
3 |
female |
0 |
4 |
male |
1 |
Goal: Apply a function to each element in a Series
Method: apply
(Series method)
Note: map
can be substituted for apply
in many cases, but apply
is more flexible and thus is recommended
In [335]:
"""目标:将函数应用于系列中的每个元素
方法:apply(Series方法)
注意:在许多情况下,map可以替换为apply,但是apply更灵活,因此建议使用"""
Out[335]:
'目标:将函数应用于系列中的每个元素\n\n方法:apply(Series方法)\n\n注意:在许多情况下,map可以替换为apply,但是apply更灵活,因此建议使用'
In [336]:
# 计算“姓名”系列中每个字符串的长度
train['Name_length'] = train.Name.apply(len)
train.loc[0:4, ['Name', 'Name_length']]
Out[336]:
| Name | Name_length |
0 |
Braund, Mr. Owen Harris |
23 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
51 |
2 |
Heikkinen, Miss. Laina |
22 |
3 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
44 |
4 |
Allen, Mr. William Henry |
24 |
In [337]:
# 将“票价”系列中的每个元素四舍五入到下一个整数
import numpy as np
train['fare_ceil'] = train.fare.apply(np.ceil)
train.loc[0:4, ['fare', 'fare_ceil']]
Out[337]:
| fare | fare_ceil |
0 |
7.2500 |
8.0 |
1 |
71.2833 |
72.0 |
2 |
7.9250 |
8.0 |
3 |
53.1000 |
54.0 |
4 |
8.0500 |
9.0 |
In [338]:
# 我们想要提取每个人的姓氏
train.Name.head()
Out[338]:
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object
In [339]:
# 使用字符串方法以逗号分隔“名称”系列(返回一Series列表)
train.Name.str.split(',').head()
Out[339]:
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
Name: Name, dtype: object
In [340]:
# 定义一个函数,该函数根据位置从列表中返回一个元素
def get_element(my_list, position):
return my_list[position]
In [341]:
# 应用'get_element'函数并将'position'作为关键字参数传递
train.Name.str.split(',').apply(get_element, position=0).head()
Out[341]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
Name: Name, dtype: object
In [342]:
# 或者,使用lambda函数
train.Name.str.split(',').apply(lambda x: x[0]).head()
Out[342]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
Name: Name, dtype: object
Goal: Apply a function along either axis of a DataFrame
Method: apply
(DataFrame method)
In [343]:
"""目标:沿DataFrame的任一轴应用函数
方法:apply(DataFrame方法)"""
Out[343]:
'目标:沿DataFrame的任一轴应用函数\n\n方法:apply(DataFrame方法)'
In [344]:
drinks = pd.read_csv(url7)
drinks.head()
Out[344]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
Asia |
1 |
Albania |
89 |
132 |
54 |
4.9 |
Europe |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
Africa |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
Europe |
4 |
Angola |
217 |
57 |
45 |
5.9 |
Africa |
In [345]:
# 选择要使用的DataFrame的子集
drinks.loc[:, 'beer_servings':'wine_servings'].head()
Out[345]:
| beer_servings | spirit_servings | wine_servings |
0 |
0 |
0 |
0 |
1 |
89 |
132 |
54 |
2 |
25 |
0 |
14 |
3 |
245 |
138 |
312 |
4 |
217 |
57 |
45 |
In [346]:
# 沿axix 0应用'max'函数来计算每列中的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)
Out[346]:
beer_servings 376
spirit_servings 438
wine_servings 370
dtype: int64
In [347]:
# 沿axis 1应用'max'函数来计算每行的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1).head()
Out[347]:
0 0
1 132
2 25
3 312
4 217
dtype: int64
In [348]:
# 使用'np.argmax'计算哪一列具有每行的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head()
c:\users\qq123\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:52: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum Now.
return getattr(obj, method)(*args, **kwds)
Out[348]:
0 beer_servings
1 spirit_servings
2 beer_servings
3 wine_servings
4 beer_servings
dtype: object
Goal: Apply a function to every element in a DataFrame
Method: applymap
(DataFrame method)
In [349]:
# 将每个DataFrame元素转换为float
drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float).head()
Out[349]:
| beer_servings | spirit_servings | wine_servings |
0 |
0.0 |
0.0 |
0.0 |
1 |
89.0 |
132.0 |
54.0 |
2 |
25.0 |
0.0 |
14.0 |
3 |
245.0 |
138.0 |
312.0 |
4 |
217.0 |
57.0 |
45.0 |
In [350]:
#覆盖现有的DataFrame列
drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)
drinks.head()
Out[350]:
| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent |
0 |
Afghanistan |
0.0 |
0.0 |
0.0 |
0.0 |
Asia |
1 |
Albania |
89.0 |
132.0 |
54.0 |
4.9 |
Europe |
2 |
Algeria |
25.0 |
0.0 |
14.0 |
0.7 |
Africa |
3 |
Andorra |
245.0 |
138.0 |
312.0 |
12.4 |
Europe |
4 |
Angola |
217.0 |
57.0 |
45.0 |
5.9 |
Africa |
[Back to top]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。