­čôł Veri ├ľn ─░┼čleme: Eksik Veriler ­čôł



Eksik Veri Nedir?

─░ncelenen veri setindeki g├Âzlemlerde eksiklik olmas─▒ durumunu ifade etmektedir.


Eksik Veri T├╝rleri Nelerdir?

T├╝m├╝yle Rastlant─▒sal Kay─▒p: Di─čer de─či┼čkenlerden ya da yap─▒sal bir problemden kaynaklanmayan tamamen rastgele olu┼čan g├Âzlemler

Rastlant─▒sal Kay─▒p: Di─čer de─či┼čkenlere ba─čl─▒ olarak olu┼čabilen eksiklik t├╝r├╝.

Rastlant─▒sal Olmayan Kay─▒p: G├Âz ard─▒ edilemeyecek olan ve yap─▒sal problemler ile ortaya ├ž─▒kan eksiklik t├╝r├╝


Eksik Veri Rassal─▒─č─▒n─▒n Testi

  • G├Ârsel Teknikler
  • Ba─č─▒ms─▒z iki ├Ârnekler t testi
  • Korelasyon testi
  • Little'nin MCAR Testi


Eksik Verilerin Anlaml─▒l─▒─č─▒

Eksik de─čerler baz─▒ durumlarda bilgi ifade edebilirler. ├ľrne─čin bir maa┼č anketinde 5k alt─▒ maa┼č olan ki┼čiler kas─▒tl─▒ olarak maa┼č alan─▒n─▒ bo┼č b─▒rakabilirler. Veya sonu├žlar─▒n public oldugu bir ankette, kendi g├Âr├╝┼č├╝n├╝ di─čer g├Âr├╝┼če nazaran daha az ise, anketi yapan ki┼či o soruyu yan─▒tlamaktan ka├ž─▒nabilir.

Bu durumda ayr─▒ bir kolon a├ž─▒l─▒p de─čerin daha ├Ânceden NA olup olmad─▒g─▒n─▒n belirtilmesi gerekir. ├ľrne─čin Maas kolonunda NA de─čerler varsa, "maas-NA" ad─▒nda bir kolon olu┼čturulur ve ├Ânceden de─čerleri olanlara F, de─čeri olmayanlara (NA) T girilebilir. B├Âylelikle eksik verilerin bilgisini kaybetmemi┼č oluruz.

Veri Setinin ─░lk Hali;

Eksik Veriyi Ortalama ─░le Doldurduktan ve Yeni Bir De─či┼čken Ekledikten Sonraki Hali;


Eksik veri problemi nas─▒l giderilir?

  • Silme Y├Ântemleri

G├Âzlem ya da de─či┼čken silme y├Ântemi

Liste Baz─▒nda silme y├Ântemi (Listwise Method)

├çiftler baz─▒nda silme y├Ântemi (Pairwise Method)

  • De─čer atama y├Ântemleri

Ortanca, ortalama, medyan

En benzer birime atama (hot deck)

D─▒┼č kaynakl─▒ atama

  • Tahmine dayal─▒ y├Ântemler

Makine ├ľ─črenmesi

EM

├çokulu Atama Y├Ântemi


Eksik Veriyi Direk Silmenin Zararlar─▒

Eksik de─čerlere sahip g├Âzlemlerin veri setinden direk ├ž─▒kar─▒lmas─▒ ve rasall─▒─č─▒n─▒n incelenmemesi, yap─▒lacak olan istatistiksel ├ž─▒kar─▒mlar─▒n, modelleme ├žal─▒┼čmalar─▒n─▒n g├╝venilirli─čini d├╝┼č├╝recektir.

Eksik g├Âzlemlerin veri setinden direk ├ž─▒karabilmesi i├žin veri setindeki eksikli─čin baz─▒ durumlarda k─▒smen baz─▒ durumlarda tamamen rastlant─▒sal olarak olu┼čmu┼č olmas─▒ gerekmektedir.

Dikkat! Veri setindeki eksikli─čin yap─▒sal bir eksiklik olup olmad─▒─č─▒n─▒n bilinmesi gerekir.

├ľrne─čin a┼ča─č─▒da yer alan eksik veriye bakal─▒m;

Bu durumda bu veriyi direk olarak veri setinden silmeden ├Ânce, bu durumun yap─▒sal bir ekiksiklik olup olmad─▒─č─▒n─▒ anlamam─▒z gerekir. Di─čer de─či┼čkenlere bakt─▒─č─▒m─▒zda ise b├Âyle bir durumla kar┼č─▒la┼č─▒yoruz;

Bir kredi kart─▒na sahip olmayan bir ki┼činin, kredi kart─▒ harcamas─▒ olmamas─▒ gayet normal bir durum. Bu y├╝zden bu veri bir eksik veri de─čildir. (NA'dan kurtulmak i├žin 0 de─čeri atanabilir.)




Veri Setimiz

Gerekli K├╝t├╝phaneler

In [1]:
import pandas as pd
import numpy as np
import missingno as msno # Eksik De─čerlerin Rassal─▒─č─▒n─▒ Daha ─░yi G├Ârebilmek ─░├žin


Veri Setine Genel Bak─▒┼č

In [2]:
df = pd.read_csv("http://veribilim.online/data/MoviesOnStreamingPlatforms.csv")
In [3]:
df.head()
Out[3]:
Unnamed: 0 ID Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ Type Directors Genres Country Language Runtime
0 0 1 The Irishman 2019 18+ 7.8/10 98/100 1 0 0 0 0 Martin Scorsese Biography,Crime,Drama United States English,Italian,Latin,Spanish,German 209.0
1 1 2 Dangal 2016 7+ 8.4/10 97/100 1 0 0 0 0 Nitesh Tiwari Action,Biography,Drama,Sport India,United States,United Kingdom,Australia,K... Hindi,English 161.0
2 2 3 David Attenborough: A Life on Our Planet 2020 7+ 9.0/10 95/100 1 0 0 0 0 Alastair Fothergill,Jonathan Hughes,Keith Scholey Documentary,Biography United Kingdom English 83.0
3 3 4 Lagaan: Once Upon a Time in India 2001 7+ 8.1/10 94/100 1 0 0 0 0 Ashutosh Gowariker Drama,Musical,Sport India,United Kingdom Hindi,English 224.0
4 4 5 Roma 2018 18+ 7.7/10 94/100 1 0 0 0 0 NaN Action,Drama,History,Romance,War United Kingdom,United States English 52.0
In [4]:
# Gereksiz Bir De─či┼čken Kald─▒r─▒yoruz
del df["Unnamed: 0"]


Eksik Veri Analizi

In [5]:
# Hangi De─či┼čkende Ne Kadar Eksik Veri Var ? 

df.isnull().sum()
Out[5]:
ID                    0
Title                 0
Year                  0
Age                4177
IMDb                206
Rotten Tomatoes       7
Netflix               0
Hulu                  0
Prime Video           0
Disney+               0
Type                  0
Directors           411
Genres              116
Country             254
Language            313
Runtime             319
dtype: int64
In [6]:
# Eksik Veri T├╝m Veri Setinin Y├╝zde Ka├ž─▒?

total_cells = np.product(df.shape)
total_missing = df.isnull().sum().sum()
print("Toplam Eksik Veri: %",(total_missing/total_cells)*100)
Toplam Eksik Veri: % 3.8117446137677353
In [7]:
# Rassal─▒─č─▒ Analiz Etmek ─░├žin

msno.matrix(df);
In [8]:
# Y├╝ksek Korelasyonlar Eksik De─čerlerin Ayn─▒ Sat─▒rda Oldu─čunu Do─črular.

msno.heatmap(df);





Eksik Verileri ├ç├Âzme Y├Ântemleri


Eksik Verileri Silme

In [9]:
# Eksik Verileri Siler

df.dropna()
Out[9]:
ID Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ Type Directors Genres Country Language Runtime
0 1 The Irishman 2019 18+ 7.8/10 98/100 1 0 0 0 0 Martin Scorsese Biography,Crime,Drama United States English,Italian,Latin,Spanish,German 209.0
1 2 Dangal 2016 7+ 8.4/10 97/100 1 0 0 0 0 Nitesh Tiwari Action,Biography,Drama,Sport India,United States,United Kingdom,Australia,K... Hindi,English 161.0
2 3 David Attenborough: A Life on Our Planet 2020 7+ 9.0/10 95/100 1 0 0 0 0 Alastair Fothergill,Jonathan Hughes,Keith Scholey Documentary,Biography United Kingdom English 83.0
3 4 Lagaan: Once Upon a Time in India 2001 7+ 8.1/10 94/100 1 0 0 0 0 Ashutosh Gowariker Drama,Musical,Sport India,United Kingdom Hindi,English 224.0
5 6 To All the Boys I've Loved Before 2018 13+ 7.1/10 94/100 1 0 0 0 0 Susan Johnson Comedy,Drama,Romance United States English 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9459 9460 Big Sur-Wild California 2010 all 6.7/10 40/100 0 0 0 1 0 Sue Houghton Documentary,History United States English 50.0
9462 9463 Justin Morgan Had a Horse 1972 all 6.5/10 39/100 0 0 0 1 0 Hollingsworth Morse Family,Drama,Western United States English 91.0
9464 9465 Richie Rich's Christmas Wish 1998 all 4.1/10 39/100 0 0 0 1 0 John Murlowski Comedy,Family United States English 84.0
9494 9495 Sultan And The Rock Star 1980 all 5.6/10 34/100 0 0 0 1 0 Edward M. Abroms Adventure,Drama,Family United States English 60.0
9495 9496 My Music Story: Yoshiki 2021 16+ 7.3/10 33/100 0 0 0 1 0 Aiji Okazaki,Kentaro Takayanagi Documentary,Music United States,Japan Japanese 47.0

5080 rows ├Ś 16 columns

In [10]:
print("Normal Seti Seti Uzunlu─ču:",len(df))
print("Eksik De─čerler Direk Silinince Veri Seti Uzunlu─ču:",len(df.dropna()))
Normal Seti Seti Uzunlu─ču: 9515
Eksik De─čerler Direk Silinince Veri Seti Uzunlu─ču: 5080

Not:

Eksik verileri silerken, değişken önemliliğine bakmak mantıklı olacaktır. Detay için "yellowbrick.features" modülüne bakabilirsiniz.


Eksik Verileri Doldurma

Ortalama ─░le Doldurma

In [11]:
# Runtime De─či┼čkeni ─░le ─░┼člem Yapal─▒m.
# Runtime De─či┼čkeni ─░├žinde Toplam Ka├ž Tane Eksik G├Âzelem Var?

df.Runtime.isnull().sum()
Out[11]:
319
In [12]:
# Runtime De─či┼čkeninin Eksik De─čerlerini, 
# Runtime De─či┼čkeninin Ortalamas─▒ (df.Runtime.mean()) ile
# doldural─▒m (fillna())

df.Runtime.fillna(df.Runtime.mean())
Out[12]:
0       209.000000
1       161.000000
2        83.000000
3       224.000000
4        52.000000
           ...    
9510     95.199435
9511     23.000000
9512     95.199435
9513     95.199435
9514     95.199435
Name: Runtime, Length: 9515, dtype: float64

Belli Bir De─čer ─░le Doldurma

In [13]:
# De─čerimizi ├ľnceden Belirleyelim;

deger = 1
In [14]:
# Belirledi─čimiz de─čer ile runtime de─či┼čkeninin eksik verilerini doldural─▒m

df.Runtime.fillna(deger)
Out[14]:
0       209.0
1       161.0
2        83.0
3       224.0
4        52.0
        ...  
9510      1.0
9511     23.0
9512      1.0
9513      1.0
9514      1.0
Name: Runtime, Length: 9515, dtype: float64

SimpleImputer ─░le doldurma

SimpleImputer ile eksik verileri doldurmak daha kolayd─▒r. Fakat teoride yukar─▒da g├Âsterdi─čimiz y├Ântemler ile ayn─▒d─▒r.

In [15]:
# Import i┼člemi yapmal─▒y─▒z;

from sklearn.impute import SimpleImputer
In [16]:
# missing_values: neleri doldurmak istedi─čimizi belirtiriz, np.nan eksik verileri doldurur
# strategy: hangi y├Ântemi kullanmak istedi─čimizi belirleriz, "mean" ortalamad─▒r.

imputer = SimpleImputer(missing_values = np.nan,strategy="mean")
In [17]:
# Yukar─▒daki i┼člemde, yapmak istedi─čimiz i┼člemlerin detaylar─▒n─▒ verdik.
# ┼×imdi ise bunu uygulamak istedi─čimiz de─či┼čken(ler)i belirtiyoruz.

imputer.fit_transform(df[["Runtime"]])
Out[17]:
array([[209.        ],
       [161.        ],
       [ 83.        ],
       ...,
       [ 95.19943454],
       [ 95.19943454],
       [ 95.19943454]])

G├Ârm├╝┼č oldu─čunuz gibi, bir tane imputer nesnesi olu┼čturdu─čumuz zaman, bunu istedi─čimiz de─či┼čkenlere fit_transform edebiliyoruz.


Eksik Verileri Tahmine Dayal─▒ Doldurma

KNN

In [18]:
!pip install ycimpute  
# Gerekli K├╝t├╝phaneyi import ediyoruz.

from ycimpute.imputer import knnimput
Requirement already satisfied: ycimpute in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (0.2)
Requirement already satisfied: torch>=1.1.0 in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from ycimpute) (1.9.0)
Requirement already satisfied: numpy>=1.10 in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from ycimpute) (1.19.5)
Requirement already satisfied: scipy in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from ycimpute) (1.5.0)
Requirement already satisfied: six in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from ycimpute) (1.15.0)
Requirement already satisfied: scikit-learn>=0.17.1 in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from ycimpute) (0.23.1)
Requirement already satisfied: typing-extensions in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from torch>=1.1.0->ycimpute) (3.7.4.2)
Requirement already satisfied: joblib>=0.11 in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.17.1->ycimpute) (0.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/ulcay/opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.17.1->ycimpute) (2.1.0)
In [19]:
# Bu i┼člem i├žin numeric datalar gerekli bu y├╝zden veri setimizden sadece numerik datalar─▒ ├žekiyoruz

df_numeric = df._get_numeric_data()
In [20]:
# ─░leride i┼čimize yaraca─č─▒ i├žin kolon isimlerini bir de─či┼čeken kaydettik
# df'yi array'e ├ževirdik c├╝nk├╝ algoritmada kullanaca─č─▒z

var_names = list(df_numeric)
n_df = np.array(df_numeric)
In [21]:
# k parametresi kom┼čuluk say─▒s─▒d─▒r. (KNN Algoritmas─▒ ile alakal─▒d─▒r. Ara┼čt─▒rabilirsiniz.)
# Bu algorit
# Yukar─▒da, algoritma i├žin array'e ├ževirdi─čimiz "n_df" de─či┼čkenini kullan─▒yoruz.

dff = knnimput.KNN(k=4).complete(n_df)
/Users/ulcay/opt/anaconda3/lib/python3.8/site-packages/ycimpute/utils/normalizer.py:13: RuntimeWarning: invalid value encountered in true_divide
  x[:,col] = (x[:,col] - min_val)/(max_val - min_val)
Imputing row 1/9515 with 0 missing, elapsed time: 8.808
Imputing row 101/9515 with 0 missing, elapsed time: 8.808
Imputing row 201/9515 with 0 missing, elapsed time: 8.809
Imputing row 301/9515 with 0 missing, elapsed time: 8.809
Imputing row 401/9515 with 0 missing, elapsed time: 8.809
Imputing row 501/9515 with 0 missing, elapsed time: 8.809
Imputing row 601/9515 with 0 missing, elapsed time: 8.810
Imputing row 701/9515 with 0 missing, elapsed time: 8.810
Imputing row 801/9515 with 0 missing, elapsed time: 8.810
Imputing row 901/9515 with 0 missing, elapsed time: 8.810
Imputing row 1001/9515 with 0 missing, elapsed time: 8.811
Imputing row 1101/9515 with 0 missing, elapsed time: 8.811
Imputing row 1201/9515 with 0 missing, elapsed time: 8.811
Imputing row 1301/9515 with 0 missing, elapsed time: 8.812
Imputing row 1401/9515 with 0 missing, elapsed time: 8.812
Imputing row 1501/9515 with 0 missing, elapsed time: 8.812
Imputing row 1601/9515 with 0 missing, elapsed time: 8.813
Imputing row 1701/9515 with 0 missing, elapsed time: 8.813
Imputing row 1801/9515 with 0 missing, elapsed time: 8.814
Imputing row 1901/9515 with 0 missing, elapsed time: 8.814
Imputing row 2001/9515 with 0 missing, elapsed time: 8.814
Imputing row 2101/9515 with 0 missing, elapsed time: 8.815
Imputing row 2201/9515 with 0 missing, elapsed time: 8.815
Imputing row 2301/9515 with 0 missing, elapsed time: 8.816
Imputing row 2401/9515 with 0 missing, elapsed time: 8.816
Imputing row 2501/9515 with 0 missing, elapsed time: 8.817
Imputing row 2601/9515 with 0 missing, elapsed time: 8.818
Imputing row 2701/9515 with 0 missing, elapsed time: 8.819
Imputing row 2801/9515 with 0 missing, elapsed time: 8.819
Imputing row 2901/9515 with 0 missing, elapsed time: 8.820
Imputing row 3001/9515 with 0 missing, elapsed time: 8.820
Imputing row 3101/9515 with 0 missing, elapsed time: 8.821
Imputing row 3201/9515 with 1 missing, elapsed time: 8.821
Imputing row 3301/9515 with 0 missing, elapsed time: 8.822
Imputing row 3401/9515 with 0 missing, elapsed time: 8.823
Imputing row 3501/9515 with 0 missing, elapsed time: 8.824
Imputing row 3601/9515 with 1 missing, elapsed time: 8.824
Imputing row 3701/9515 with 0 missing, elapsed time: 8.826
Imputing row 3801/9515 with 0 missing, elapsed time: 8.826
Imputing row 3901/9515 with 0 missing, elapsed time: 8.826
Imputing row 4001/9515 with 0 missing, elapsed time: 8.827
Imputing row 4101/9515 with 0 missing, elapsed time: 8.827
Imputing row 4201/9515 with 0 missing, elapsed time: 8.827
Imputing row 4301/9515 with 0 missing, elapsed time: 8.828
Imputing row 4401/9515 with 0 missing, elapsed time: 8.828
Imputing row 4501/9515 with 0 missing, elapsed time: 8.828
Imputing row 4601/9515 with 0 missing, elapsed time: 8.829
Imputing row 4701/9515 with 1 missing, elapsed time: 8.829
Imputing row 4801/9515 with 0 missing, elapsed time: 8.830
Imputing row 4901/9515 with 0 missing, elapsed time: 8.830
Imputing row 5001/9515 with 0 missing, elapsed time: 8.830
Imputing row 5101/9515 with 0 missing, elapsed time: 8.831
Imputing row 5201/9515 with 0 missing, elapsed time: 8.831
Imputing row 5301/9515 with 0 missing, elapsed time: 8.831
Imputing row 5401/9515 with 0 missing, elapsed time: 8.831
Imputing row 5501/9515 with 0 missing, elapsed time: 8.832
Imputing row 5601/9515 with 0 missing, elapsed time: 8.832
Imputing row 5701/9515 with 0 missing, elapsed time: 8.832
Imputing row 5801/9515 with 0 missing, elapsed time: 8.833
Imputing row 5901/9515 with 0 missing, elapsed time: 8.833
Imputing row 6001/9515 with 0 missing, elapsed time: 8.834
Imputing row 6101/9515 with 0 missing, elapsed time: 8.834
Imputing row 6201/9515 with 0 missing, elapsed time: 8.835
Imputing row 6301/9515 with 0 missing, elapsed time: 8.835
Imputing row 6401/9515 with 0 missing, elapsed time: 8.835
Imputing row 6501/9515 with 0 missing, elapsed time: 8.836
Imputing row 6601/9515 with 0 missing, elapsed time: 8.836
Imputing row 6701/9515 with 0 missing, elapsed time: 8.836
Imputing row 6801/9515 with 0 missing, elapsed time: 8.837
Imputing row 6901/9515 with 0 missing, elapsed time: 8.837
Imputing row 7001/9515 with 0 missing, elapsed time: 8.838
Imputing row 7101/9515 with 0 missing, elapsed time: 8.838
Imputing row 7201/9515 with 0 missing, elapsed time: 8.838
Imputing row 7301/9515 with 0 missing, elapsed time: 8.839
Imputing row 7401/9515 with 0 missing, elapsed time: 8.839
Imputing row 7501/9515 with 0 missing, elapsed time: 8.840
Imputing row 7601/9515 with 0 missing, elapsed time: 8.840
Imputing row 7701/9515 with 0 missing, elapsed time: 8.840
Imputing row 7801/9515 with 0 missing, elapsed time: 8.841
Imputing row 7901/9515 with 0 missing, elapsed time: 8.841
Imputing row 8001/9515 with 0 missing, elapsed time: 8.842
Imputing row 8101/9515 with 0 missing, elapsed time: 8.842
Imputing row 8201/9515 with 0 missing, elapsed time: 8.843
Imputing row 8301/9515 with 0 missing, elapsed time: 8.843
Imputing row 8401/9515 with 0 missing, elapsed time: 8.843
Imputing row 8501/9515 with 0 missing, elapsed time: 8.844
Imputing row 8601/9515 with 0 missing, elapsed time: 8.845
Imputing row 8701/9515 with 0 missing, elapsed time: 8.845
Imputing row 8801/9515 with 0 missing, elapsed time: 8.845
Imputing row 8901/9515 with 0 missing, elapsed time: 8.846
Imputing row 9001/9515 with 0 missing, elapsed time: 8.846
Imputing row 9101/9515 with 0 missing, elapsed time: 8.846
Imputing row 9201/9515 with 0 missing, elapsed time: 8.846
Imputing row 9301/9515 with 0 missing, elapsed time: 8.847
Imputing row 9401/9515 with 0 missing, elapsed time: 8.848
Imputing row 9501/9515 with 1 missing, elapsed time: 8.849
[KNN] Warning: 9515/76120 still missing after imputation, replacing with 0
In [22]:
# array olan dff veri setimizi, DataFrame'e ├ževiriyoruz

dff = pd.DataFrame(dff,columns=var_names)
dff.head()
Out[22]:
ID Year Netflix Hulu Prime Video Disney+ Type Runtime
0 1.0 2019.0 1.0 0.0 0.0 0.0 NaN 209.0
1 2.0 2016.0 1.0 0.0 0.0 0.0 NaN 161.0
2 3.0 2020.0 1.0 0.0 0.0 0.0 NaN 83.0
3 4.0 2001.0 1.0 0.0 0.0 0.0 NaN 224.0
4 5.0 2018.0 1.0 0.0 0.0 0.0 NaN 52.0