Week 01 (Data Transformation)

Prerequisites

Pada module ini kita akan coba mememahami package pandas, yang merupakan package inti dalam sains-data. kita akan coba melakukan beberapa transformasi data menggunakan pandas.

sebelum itu, python module di bawah ini yang akan digunakan selama praktikum.

import numpy as np
import pandas as pd

Series

pandas.Series sangat mirip dengan array NumPy (bahkan dibangun di atas objek array NumPy). Yang membedakan array NumPy dari sebuah Series adalah bahwa sebuah Series dapat memiliki label index, yang berarti dapat diindeks dengan label, bukan hanya lokasi nomor saja. Selain itu, sebuah Series tidak perlu menyimpan data numerik, ia dapat menyimpan objek Python sembarang.

Membuat pd.Series dengan list

Paling mudah, ktia dapat membuat pd.Series dengan python list

my_index= ['a','b','c','d','e']
my_data= [1,2,3,4,5]
my_series= pd.Series(data=my_data, index=my_index)
print(my_series)
print(my_series.__class__)
a    1
b    2
c    3
d    4
e    5
dtype: int64
<class 'pandas.core.series.Series'>

Membuat pd.Series dengan dictionary

Kita juga dapat membuat pd.Series dengan dictionary

# creating a series from a dictionary
my_dict= {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
my_series_dict= pd.Series(my_dict)
print(my_series_dict)
print(my_series_dict.__class__)
a    1
b    2
c    3
d    4
e    5
dtype: int64
<class 'pandas.core.series.Series'>

Operasi pada Series

# Imaginary Sales Data for 1st and 2nd Quarters for Global Company
q1 = {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q2 = {'Brazil': 100,'China': 500, 'India': 210,'USA': 260}


# Creating a Series from a Dictionary q1 and q2
q1_series= pd.Series(q1)
q2_series= pd.Series(q2)
print(q1_series)
Japan     80
China    450
India    200
USA      250
dtype: int64

Kita dapat mengindeks dengan label

# call values of q1_series based on named index
print(q1_series['Japan'])
print(q1_series['China'])
print(q1_series['India'])
80
450
200

kita dapat tetap dapat mengindeks dengan integer

# u can also call values of q1_series based on positional index
print(q1_series[0])
print(q1_series[1])
print(q1_series[2])
80
450
200

hati-hati dalam melakukan indexing dengan label. bisa saja terjadi error jika label tidak ada di dalam pd.series

# remember named index is case sensitive
try:
    print(q1_series['japan'])
except:
    print('something went wrong')
something went wrong

Operasi aritmatik sederhana pada pd.Series bersifat broadcasting

# operations with arithmetic on series are broadcasted to all values
print(q1_series*2)
Japan    160
China    900
India    400
USA      500
dtype: int64
print(q1_series+1000)
Japan    1080
China    1450
India    1200
USA      1250
dtype: int64
# operation between series are also broadcasted
print(q1_series+q2_series)
Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64
print(q1_series.add(q2_series, fill_value=0))
Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64

data frame

Sebuah pd.DataFrame terdiri dari beberapa pd.Series yang berbagi nilai indeks.

my_data= np.random.randint(0,100,12).reshape(4,3)
my_data
array([[25, 59, 18],
       [75, 54, 65],
       [29, 21,  7],
       [32, 69, 16]])

Kita akan membuat pd.Dataframe melalui python list. Perhatikan bahwa kita dapat memberikan nama pada kolom dan baris

my_index= ["jakarta", "bandung", "surabaya", "medan"]
my_columns= ["apple", "orange", "banana"]

df= pd.DataFrame(data=my_data, index=my_index, columns=my_columns)
df
apple orange banana
jakarta 25 59 18
bandung 75 54 65
surabaya 29 21 7
medan 32 69 16
df_2= pd.DataFrame(data=my_data)
df_2
0 1 2
0 25 59 18
1 75 54 65
2 29 21 7
3 32 69 16
df_3= pd.DataFrame(data=my_data, columns=my_columns)
df_3
apple orange banana
0 25 59 18
1 75 54 65
2 29 21 7
3 32 69 16

membaca file csv sebagai pd.DataFrame

Jika berkas .py atau .ipynb Anda berada di lokasi folder yang sama persis dengan berkas .csv yang ingin Anda baca, cukup berikan nama berkas sebagai string, misalnya:

df = pd.read_csv('[some_file.csv')

Berikan s berkas jika Anda berada di direktori yang berbeda. Jalur berkas harus 100% benar agar ini berfungsi. Misalnya:

df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")

sebelum itu, kalian dapat mendownload data tersebut melalui link berikut

Download

pwd
'c:\\Users\\user\\Documents\\root\\personal\\github-personal\\sains-data-2023\\main-module'
df_tips= pd.read_csv('./data/tips.csv')
df_tips
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
... ... ... ... ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880
242 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950 Sat17
243 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139 Thur672

244 rows × 11 columns

Operasi sederhana pada DataFrame

# mengecek nama kolom
df_tips.columns
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')
# mengecek 
df_tips.index
RangeIndex(start=0, stop=244, step=1)
df_tips.head(5)
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
df_tips.tail(5)
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
239 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842 Sat2657
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404 Sat1766
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196 Sat3880
242 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950 Sat17
243 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139 Thur672
df_tips.describe().transpose()
count mean std min 25% 50% 75% max
total_bill 244.0 1.978594e+01 8.902412e+00 3.070000e+00 1.334750e+01 1.779500e+01 2.412750e+01 5.081000e+01
tip 244.0 2.998279e+00 1.383638e+00 1.000000e+00 2.000000e+00 2.900000e+00 3.562500e+00 1.000000e+01
size 244.0 2.569672e+00 9.510998e-01 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 6.000000e+00
price_per_person 244.0 7.888197e+00 2.914234e+00 2.880000e+00 5.800000e+00 7.255000e+00 9.390000e+00 2.027000e+01
CC Number 244.0 2.563496e+15 2.369340e+15 6.040679e+10 3.040731e+13 3.525318e+15 4.553675e+15 6.596454e+15

Transformasi data (row-wise)

filtering


df_tips.head(5)
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
print(df_tips["size"] ==3)
conditional_size = df_tips["size"] ==3
0      False
1       True
2       True
3      False
4      False
       ...  
239     True
240    False
241    False
242    False
243    False
Name: size, Length: 244, dtype: bool
df_tips[conditional_size].head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
16 10.33 1.67 Female No Sun Dinner 3 3.44 Elizabeth Foster 4240025044626033 Sun9715
17 16.29 3.71 Male No Sun Dinner 3 5.43 John Pittman 6521340257218708 Sun2998
18 16.97 3.50 Female No Sun Dinner 3 5.66 Laura Martinez 30422275171379 Sun2789
conditional= (df_tips["size"]==3) & (df_tips["total_bill"]>20)
print(conditional)
0      False
1      False
2       True
3      False
4      False
       ...  
239     True
240    False
241    False
242    False
243    False
Length: 244, dtype: bool
df_tips[conditional].head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
19 20.65 3.35 Male No Sat Dinner 3 6.88 Timothy Oneal 6568069240986485 Sat9213
35 24.06 3.60 Male No Sat Dinner 3 8.02 Joseph Mullins 5519770449260299 Sat632
39 31.27 5.00 Male No Sat Dinner 3 10.42 Mr. Brandon Berry 6011525851069856 Sat6373
48 28.55 2.05 Male No Sun Dinner 3 9.52 Austin Fisher 6011481668986587 Sun4142
df_tips[(df_tips["size"]==3) & (df_tips["total_bill"]>20)].head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
19 20.65 3.35 Male No Sat Dinner 3 6.88 Timothy Oneal 6568069240986485 Sat9213
35 24.06 3.60 Male No Sat Dinner 3 8.02 Joseph Mullins 5519770449260299 Sat632
39 31.27 5.00 Male No Sat Dinner 3 10.42 Mr. Brandon Berry 6011525851069856 Sat6373
48 28.55 2.05 Male No Sun Dinner 3 9.52 Austin Fisher 6011481668986587 Sun4142
weekend= ["Sun", "Sat"]
conditional_in= df_tips["day"].isin(weekend)
df_tips[conditional_in].head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251
df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251

mencari nilai unik

df_tips["day"].unique()
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)
df_tips.drop_duplicates(["day","time"])[["day","time"]]
day time
0 Sun Dinner
19 Sat Dinner
77 Thur Lunch
90 Fri Dinner
220 Fri Lunch
243 Thur Dinner

Transforming Data (Column Wise)

Selecting Columns

print(df_tips["day"])
print("=======")
print(df_tips.day)
0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object
=======
0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object
df_tips[["day","time"]]
day time
0 Sun Dinner
1 Sun Dinner
2 Sun Dinner
3 Sun Dinner
4 Sun Dinner
... ... ...
239 Sat Dinner
240 Sat Dinner
241 Sat Dinner
242 Sat Dinner
243 Thur Dinner

244 rows × 2 columns

Mutating (create new column)

df_tips["tips_percentage"]= df_tips["tip"]/df_tips["total_bill"]*100

df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID tips_percentage
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959 5.944673
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608 16.054159
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458 16.658734
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260 13.978041
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251 14.680765

renaming column

df_tips.rename(columns={"tips_percentage":"tips_percentage_%"}, inplace=True)
df_tips.head()
total_bill tip sex smoker day time size price_per_person Payer Name CC Number Payment ID tips_percentage_%
0 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410 Sun2959 5.944673
1 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230 Sun4608 16.054159
2 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322 Sun4458 16.658734
3 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994 Sun5260 13.978041
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221 Sun2251 14.680765

relocate columns

#relocate tips_percentage_% column to the rightmost
cols= list(df_tips.columns)
cols= [cols[-1]]+ cols[:-2]

df_tips= df_tips[cols]
df_tips
tips_percentage_% total_bill tip sex smoker day time size price_per_person Payer Name CC Number
0 5.944673 16.99 1.01 Female No Sun Dinner 2 8.49 Christy Cunningham 3560325168603410
1 16.054159 10.34 1.66 Male No Sun Dinner 3 3.45 Douglas Tucker 4478071379779230
2 16.658734 21.01 3.50 Male No Sun Dinner 3 7.00 Travis Walters 6011812112971322
3 13.978041 23.68 3.31 Male No Sun Dinner 2 11.84 Nathaniel Harris 4676137647685994
4 14.680765 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 4832732618637221
... ... ... ... ... ... ... ... ... ... ... ...
239 20.392697 29.03 5.92 Male No Sat Dinner 3 9.68 Michael Avila 5296068606052842
240 7.358352 27.18 2.00 Female Yes Sat Dinner 2 13.59 Monica Sanders 3506806155565404
241 8.822232 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011891618747196
242 9.820426 17.82 1.75 Male No Sat Dinner 2 8.91 Dennis Dixon 4375220550950
243 15.974441 18.78 3.00 Female No Thur Dinner 2 9.39 Michelle Hardin 3511451626698139

244 rows × 11 columns

Groupby