1.创建一个最简单的DataFrame
fruits = pd.DataFrame([[30,21]],columns=['Apples','Bananas'])
Apples | Bananas | |
---|---|---|
0 | 30 | 21 |
2. 创建一个简单的Series
ingredients = pd.Series(['4 cups','1 cup','2 large','1 can'],index=['Flour','Milk','Eggs','Spam'],name='Dinner')
Flour 4 cups
Milk 1 cup
Eggs 2 large
Spam 1 can
Name: Dinner, dtype: object
3.读取文件
reviews = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv',index_col=0)
并将第零列,作为index
4.选择一列
desc = reviews.description
desc = reviews['description']
5.选择一行或多行
first_row = reviews.loc[0]
sample_reviews = reviews.loc[[1,2,3,5,8]]
6.行和列一起筛选
columns = ['country','province','region_1','region_2'] # 筛选的行
index = [0,1,10,100] # 筛选的列df = reviews.loc[index,columns]# loc 的行是标签 ,iloc的行是下标
7.细节问题
colmns = ['country','variety']
df = reviews.loc[0:99,colmns]在pandas 中loc 0:99 包含99与普通的0:99不一样
8. 按值筛选
italian_wines = reviews[reviews.country == 'Italy']
9.高级的按值筛选
必须要用小括号括起来,才能使用布尔型的运算
.isin是判断是否存在在里面的
top_oceania_wines = reviews[(reviews.country.isin(['Australia','New Zealand']))
& (reviews.points >= 95)]
10.Summary function
1. 描述函数
reviews.points.describe()
对于数字型的来说
'count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'
对于文字型的来说
‘count’,'unique','top','freq'
2.平均值
reviews.points.mean()
# 求中值
reviews.points.median()
3.种类函数 ----> 只有国家
reviews.name.unique()
4.种类计数函数 --> 国家加次数
reviews.name.value_counts()
Map 函数
# 初级Map函数 里面是一个函数,然后会对每个操作
reviews.points.map(lambda p: p - 5)
等价与 reviews.points - 5
# apply 函数 将reviews传入函数,用row接受,然后执行函数内的内容,按行滑动 (有点卡,这个函数)
def remean_points(row):
row.points = row.points - 3 return rowreviews.apply(remean_points, axis='columns')
# 等价与上面高级函数的简单操作
reviews.points - 3
centered_price = reviews.price - reviews.price.mean() --->> 机器学习中常见的居中化
# 字符串的拼接
reviews.country + "and" + reviews.region_1
out --> New Zealand and Alsace
# 返回最大值的索引
bargain_idx = (reviews.points/reviews.price).idxmax()
# 求字符串中包含 ‘fruity’ 中的列的总数
fruity = reviews.description.str.contains('fruity').sum()
# 一个一下考虑两行的问题 如果是加拿大的就是3颗星,如果不是但是评分是95+,也是3颗星,85-95是两颗星,85一下一颗星
def stars(row):
if row.country == 'Canada': return 3 elif row.points >= 95: return 3 elif row.points < 95 and row.points >= 85: return 2 else: return 1star_ratings = reviews.apply(stars,axis='columns')