import numpy as np
import pandas as pd
Week 01 (Data Transformation)
Prerequisites
Pada module ini kita akan coba mememahami package pandas, yang merupakan package inti dalam sains-data. kita akan coba melakukan beberapa transformasi data menggunakan pandas.
sebelum itu, python module
di bawah ini yang akan digunakan selama praktikum.
Series
pandas.Series
sangat mirip dengan array NumPy (bahkan dibangun di atas objek array NumPy). Yang membedakan array NumPy dari sebuah Series adalah bahwa sebuah Series dapat memiliki label index, yang berarti dapat diindeks dengan label, bukan hanya lokasi nomor saja. Selain itu, sebuah Series tidak perlu menyimpan data numerik, ia dapat menyimpan objek Python sembarang.
Membuat pd.Series dengan list
Paling mudah, ktia dapat membuat pd.Series
dengan python list
= ['a','b','c','d','e']
my_index= [1,2,3,4,5]
my_data= pd.Series(data=my_data, index=my_index) my_series
print(my_series)
print(my_series.__class__)
a 1
b 2
c 3
d 4
e 5
dtype: int64
<class 'pandas.core.series.Series'>
Membuat pd.Series dengan dictionary
Kita juga dapat membuat pd.Series
dengan dictionary
# creating a series from a dictionary
= {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
my_dict= pd.Series(my_dict) my_series_dict
print(my_series_dict)
print(my_series_dict.__class__)
a 1
b 2
c 3
d 4
e 5
dtype: int64
<class 'pandas.core.series.Series'>
Operasi pada Series
# Imaginary Sales Data for 1st and 2nd Quarters for Global Company
= {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q1 = {'Brazil': 100,'China': 500, 'India': 210,'USA': 260}
q2
# Creating a Series from a Dictionary q1 and q2
= pd.Series(q1)
q1_series= pd.Series(q2) q2_series
print(q1_series)
Japan 80
China 450
India 200
USA 250
dtype: int64
Kita dapat mengindeks dengan label
# call values of q1_series based on named index
print(q1_series['Japan'])
print(q1_series['China'])
print(q1_series['India'])
80
450
200
kita dapat tetap dapat mengindeks dengan integer
# u can also call values of q1_series based on positional index
print(q1_series[0])
print(q1_series[1])
print(q1_series[2])
80
450
200
hati-hati dalam melakukan indexing dengan label. bisa saja terjadi error jika label tidak ada di dalam pd.series
# remember named index is case sensitive
try:
print(q1_series['japan'])
except:
print('something went wrong')
something went wrong
Operasi aritmatik sederhana pada pd.Series
bersifat broadcasting
# operations with arithmetic on series are broadcasted to all values
print(q1_series*2)
Japan 160
China 900
India 400
USA 500
dtype: int64
print(q1_series+1000)
Japan 1080
China 1450
India 1200
USA 1250
dtype: int64
# operation between series are also broadcasted
print(q1_series+q2_series)
Brazil NaN
China 950.0
India 410.0
Japan NaN
USA 510.0
dtype: float64
print(q1_series.add(q2_series, fill_value=0))
Brazil 100.0
China 950.0
India 410.0
Japan 80.0
USA 510.0
dtype: float64
data frame
Sebuah pd.DataFrame
terdiri dari beberapa pd.Series
yang berbagi nilai indeks.
= np.random.randint(0,100,12).reshape(4,3)
my_data my_data
array([[25, 59, 18],
[75, 54, 65],
[29, 21, 7],
[32, 69, 16]])
Kita akan membuat pd.Dataframe
melalui python list
. Perhatikan bahwa kita dapat memberikan nama pada kolom dan baris
= ["jakarta", "bandung", "surabaya", "medan"]
my_index= ["apple", "orange", "banana"]
my_columns
= pd.DataFrame(data=my_data, index=my_index, columns=my_columns)
df df
apple | orange | banana | |
---|---|---|---|
jakarta | 25 | 59 | 18 |
bandung | 75 | 54 | 65 |
surabaya | 29 | 21 | 7 |
medan | 32 | 69 | 16 |
= pd.DataFrame(data=my_data)
df_2 df_2
0 | 1 | 2 | |
---|---|---|---|
0 | 25 | 59 | 18 |
1 | 75 | 54 | 65 |
2 | 29 | 21 | 7 |
3 | 32 | 69 | 16 |
= pd.DataFrame(data=my_data, columns=my_columns)
df_3 df_3
apple | orange | banana | |
---|---|---|---|
0 | 25 | 59 | 18 |
1 | 75 | 54 | 65 |
2 | 29 | 21 | 7 |
3 | 32 | 69 | 16 |
membaca file csv sebagai pd.DataFrame
Jika berkas .py atau .ipynb Anda berada di lokasi folder yang sama persis dengan berkas .csv yang ingin Anda baca, cukup berikan nama berkas sebagai string, misalnya:
df = pd.read_csv('[some_file.csv')
Berikan s berkas jika Anda berada di direktori yang berbeda. Jalur berkas harus 100% benar agar ini berfungsi. Misalnya:
df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")
sebelum itu, kalian dapat mendownload data tersebut melalui link berikut
pwd
'c:\\Users\\user\\Documents\\root\\personal\\github-personal\\sains-data-2023\\main-module'
= pd.read_csv('./data/tips.csv') df_tips
df_tips
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 | Sat17 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 | Thur672 |
244 rows × 11 columns
Operasi sederhana pada DataFrame
# mengecek nama kolom
df_tips.columns
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
dtype='object')
# mengecek
df_tips.index
RangeIndex(start=0, stop=244, step=1)
5) df_tips.head(
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
5) df_tips.tail(
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 | Sat2657 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 | Sat1766 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 | Sat3880 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 | Sat17 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 | Thur672 |
df_tips.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
total_bill | 244.0 | 1.978594e+01 | 8.902412e+00 | 3.070000e+00 | 1.334750e+01 | 1.779500e+01 | 2.412750e+01 | 5.081000e+01 |
tip | 244.0 | 2.998279e+00 | 1.383638e+00 | 1.000000e+00 | 2.000000e+00 | 2.900000e+00 | 3.562500e+00 | 1.000000e+01 |
size | 244.0 | 2.569672e+00 | 9.510998e-01 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
price_per_person | 244.0 | 7.888197e+00 | 2.914234e+00 | 2.880000e+00 | 5.800000e+00 | 7.255000e+00 | 9.390000e+00 | 2.027000e+01 |
CC Number | 244.0 | 2.563496e+15 | 2.369340e+15 | 6.040679e+10 | 3.040731e+13 | 3.525318e+15 | 4.553675e+15 | 6.596454e+15 |
Transformasi data (row-wise)
filtering
5) df_tips.head(
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
print(df_tips["size"] ==3)
= df_tips["size"] ==3 conditional_size
0 False
1 True
2 True
3 False
4 False
...
239 True
240 False
241 False
242 False
243 False
Name: size, Length: 244, dtype: bool
df_tips[conditional_size].head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 | 3.44 | Elizabeth Foster | 4240025044626033 | Sun9715 |
17 | 16.29 | 3.71 | Male | No | Sun | Dinner | 3 | 5.43 | John Pittman | 6521340257218708 | Sun2998 |
18 | 16.97 | 3.50 | Female | No | Sun | Dinner | 3 | 5.66 | Laura Martinez | 30422275171379 | Sun2789 |
= (df_tips["size"]==3) & (df_tips["total_bill"]>20)
conditionalprint(conditional)
0 False
1 False
2 True
3 False
4 False
...
239 True
240 False
241 False
242 False
243 False
Length: 244, dtype: bool
df_tips[conditional].head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 | 6.88 | Timothy Oneal | 6568069240986485 | Sat9213 |
35 | 24.06 | 3.60 | Male | No | Sat | Dinner | 3 | 8.02 | Joseph Mullins | 5519770449260299 | Sat632 |
39 | 31.27 | 5.00 | Male | No | Sat | Dinner | 3 | 10.42 | Mr. Brandon Berry | 6011525851069856 | Sat6373 |
48 | 28.55 | 2.05 | Male | No | Sun | Dinner | 3 | 9.52 | Austin Fisher | 6011481668986587 | Sun4142 |
"size"]==3) & (df_tips["total_bill"]>20)].head() df_tips[(df_tips[
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 | 6.88 | Timothy Oneal | 6568069240986485 | Sat9213 |
35 | 24.06 | 3.60 | Male | No | Sat | Dinner | 3 | 8.02 | Joseph Mullins | 5519770449260299 | Sat632 |
39 | 31.27 | 5.00 | Male | No | Sat | Dinner | 3 | 10.42 | Mr. Brandon Berry | 6011525851069856 | Sat6373 |
48 | 28.55 | 2.05 | Male | No | Sun | Dinner | 3 | 9.52 | Austin Fisher | 6011481668986587 | Sun4142 |
= ["Sun", "Sat"]
weekend= df_tips["day"].isin(weekend)
conditional_in df_tips[conditional_in].head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 |
mencari nilai unik
"day"].unique() df_tips[
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)
"day","time"])[["day","time"]] df_tips.drop_duplicates([
day | time | |
---|---|---|
0 | Sun | Dinner |
19 | Sat | Dinner |
77 | Thur | Lunch |
90 | Fri | Dinner |
220 | Fri | Lunch |
243 | Thur | Dinner |
Transforming Data (Column Wise)
Selecting Columns
print(df_tips["day"])
print("=======")
print(df_tips.day)
0 Sun
1 Sun
2 Sun
3 Sun
4 Sun
...
239 Sat
240 Sat
241 Sat
242 Sat
243 Thur
Name: day, Length: 244, dtype: object
=======
0 Sun
1 Sun
2 Sun
3 Sun
4 Sun
...
239 Sat
240 Sat
241 Sat
242 Sat
243 Thur
Name: day, Length: 244, dtype: object
"day","time"]] df_tips[[
day | time | |
---|---|---|
0 | Sun | Dinner |
1 | Sun | Dinner |
2 | Sun | Dinner |
3 | Sun | Dinner |
4 | Sun | Dinner |
... | ... | ... |
239 | Sat | Dinner |
240 | Sat | Dinner |
241 | Sat | Dinner |
242 | Sat | Dinner |
243 | Thur | Dinner |
244 rows × 2 columns
Mutating (create new column)
"tips_percentage"]= df_tips["tip"]/df_tips["total_bill"]*100
df_tips[
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | tips_percentage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 | 5.944673 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 | 16.054159 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 | 16.658734 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 | 13.978041 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 | 14.680765 |
renaming column
={"tips_percentage":"tips_percentage_%"}, inplace=True)
df_tips.rename(columns df_tips.head()
total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | Payment ID | tips_percentage_% | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 | Sun2959 | 5.944673 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 | Sun4608 | 16.054159 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 | Sun4458 | 16.658734 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 | Sun5260 | 13.978041 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 | Sun2251 | 14.680765 |
relocate columns
#relocate tips_percentage_% column to the rightmost
= list(df_tips.columns)
cols= [cols[-1]]+ cols[:-2]
cols
= df_tips[cols] df_tips
df_tips
tips_percentage_% | total_bill | tip | sex | smoker | day | time | size | price_per_person | Payer Name | CC Number | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.944673 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 8.49 | Christy Cunningham | 3560325168603410 |
1 | 16.054159 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 3.45 | Douglas Tucker | 4478071379779230 |
2 | 16.658734 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 7.00 | Travis Walters | 6011812112971322 |
3 | 13.978041 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 11.84 | Nathaniel Harris | 4676137647685994 |
4 | 14.680765 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 6.15 | Tonya Carter | 4832732618637221 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
239 | 20.392697 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 9.68 | Michael Avila | 5296068606052842 |
240 | 7.358352 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 13.59 | Monica Sanders | 3506806155565404 |
241 | 8.822232 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 11.34 | Keith Wong | 6011891618747196 |
242 | 9.820426 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 8.91 | Dennis Dixon | 4375220550950 |
243 | 15.974441 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 9.39 | Michelle Hardin | 3511451626698139 |
244 rows × 11 columns