データチェックと基本的なデータの扱い
統計解析前のデータチェック・処理についてのまとめ。
1. エクセルデータの読み込み
readxlパッケージのread_excel関数が高速読み込みでおすすめ。
library(readxl)
dt <- read_excel("heart.xls")
Status | DeathCause | AgeCHD | Sex | AgeStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeDeath | Chol | CholStu | BPStu | WeightStu | SmokingStu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dead | Other | NA | Female | 29 | 62.50 | 140 | 78 | 124 | 121 | 0 | 55 | NA | NA | Normal | Overweight | Non-smoker |
Dead | Cancer | NA | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
Alive | NA | NA | Female | 57 | 62.25 | 132 | 90 | 170 | 114 | 10 | NA | 250 | High | High | Overweight | Moderate (6-15) |
Alive | NA | NA | Female | 39 | 65.75 | 158 | 80 | 128 | 123 | 0 | NA | 242 | High | Normal | Overweight | Non-smoker |
Alive | NA | NA | Male | 42 | 66.00 | 156 | 76 | 110 | 116 | 20 | NA | 281 | High | Optimal | Overweight | Heavy (16-25) |
Alive | NA | NA | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | NA | 196 | Desirable | High | Overweight | Non-smoker |
2. データの構造確認
変数名、変数の型、データ数をチェック
str(dt)
# tibble [5,209 × 17] (S3: tbl_df/tbl/data.frame)
# $ Status : chr [1:5209] "Dead" "Dead" "Alive" "Alive" ...
# $ DeathCause: chr [1:5209] "Other" "Cancer" NA NA ...
# $ AgeCHD : num [1:5209] NA NA NA NA NA NA NA NA NA NA ...
# $ Sex : chr [1:5209] "Female" "Female" "Female" "Female" ...
# $ AgeStart : num [1:5209] 29 41 57 39 42 58 36 53 35 52 ...
# $ Height : num [1:5209] 62.5 59.8 62.2 65.8 66 ...
# $ Weight : num [1:5209] 140 194 132 158 156 131 136 130 194 129 ...
# $ Diastolic : num [1:5209] 78 92 90 80 76 92 80 80 68 78 ...
# $ Systolic : num [1:5209] 124 144 170 128 110 176 112 114 132 124 ...
# $ MRW : num [1:5209] 121 183 114 123 116 117 110 99 124 106 ...
# $ Smoking : num [1:5209] 0 0 10 0 20 0 15 0 0 5 ...
# $ AgeDeath : num [1:5209] 55 57 NA NA NA NA NA 77 NA 82 ...
# $ Chol : num [1:5209] NA 181 250 242 281 196 196 276 211 284 ...
# $ CholStu : chr [1:5209] NA "Desirable" "High" "High" ...
# $ BPStu : chr [1:5209] "Normal" "High" "High" "Normal" ...
# $ WeightStu : chr [1:5209] "Overweight" "Overweight" "Overweight" "Overweight" ...
# $ SmokingStu: chr [1:5209] "Non-smoker" "Non-smoker" "Moderate (6-15)" "Non-smoker" ...
3. 要約統計量
summary(dt)
# Status DeathCause AgeCHD Sex
# Length:5209 Length:5209 Min. :32.0 Length:5209
# Class :character Class :character 1st Qu.:57.0 Class :character
# Mode :character Mode :character Median :63.0 Mode :character
# Mean :63.3
# 3rd Qu.:70.0
# Max. :90.0
# NA's :3760
# AgeStart Height Weight Diastolic
# Min. :28.00 Min. :51.50 Min. : 67.0 Min. : 50.00
# 1st Qu.:37.00 1st Qu.:62.25 1st Qu.:132.0 1st Qu.: 76.00
# Median :43.00 Median :64.50 Median :150.0 Median : 84.00
# Mean :44.07 Mean :64.81 Mean :153.1 Mean : 85.36
# 3rd Qu.:51.00 3rd Qu.:67.50 3rd Qu.:172.0 3rd Qu.: 92.00
# Max. :62.00 Max. :76.50 Max. :300.0 Max. :160.00
# NA's :6 NA's :6
# Systolic MRW Smoking AgeDeath Chol
# Min. : 82.0 Min. : 67 Min. : 0.000 Min. :36.00 Min. : 96.0
# 1st Qu.:120.0 1st Qu.:106 1st Qu.: 0.000 1st Qu.:63.00 1st Qu.:196.0
# Median :132.0 Median :118 Median : 1.000 Median :71.00 Median :223.0
# Mean :136.9 Mean :120 Mean : 9.367 Mean :70.54 Mean :227.4
# 3rd Qu.:148.0 3rd Qu.:131 3rd Qu.:20.000 3rd Qu.:79.00 3rd Qu.:255.0
# Max. :300.0 Max. :268 Max. :60.000 Max. :93.00 Max. :568.0
# NA's :6 NA's :36 NA's :3218 NA's :152
# CholStu BPStu WeightStu SmokingStu
# Length:5209 Length:5209 Length:5209 Length:5209
# Class :character Class :character Class :character Class :character
# Mode :character Mode :character Mode :character Mode :character
#
#
#
#
4. 変数ごとの集計
データ$変数名でいずれかの列の変数を指定
library(tidyverse)
library(kableExtra)
library(plotly)
p <- ggplot(dt, aes(x=AgeStart))+geom_histogram()+theme_bw()+labs(x="AgeStart 追跡開始時年齢", y="度数")+stat_bin(bins=13)
p
table(dt$DeathCause) %>% knitr::kable() %>% kable_classic(full_width=F, html_font="san-serif")
Var1 | Freq |
---|---|
Cancer | 539 |
Cerebral Vascular Disease | 378 |
Coronary Heart Disease | 605 |
Other | 357 |
Unknown | 112 |
%/%は左のオブジェクトを右の関数の第1引数に引き渡すパイプ演算子
library(knitr)とknitr::は同じ意味
5. 列の追加
dplyrパッケージのmutate関数を使用
試しに開始年齢から平均値を引いた変数stdAgeを作成
dt1 <- dplyr::mutate(dt, stdAge=AgeStart-mean(AgeStart))
head(dt1$stdAge) %>% kable(digits=2, aption = "stdAge") %>% kable_classic(full_width=F , html_font="san-serif")
x |
---|
-15.07 |
-3.07 |
12.93 |
-5.07 |
-2.07 |
13.93 |
6. ソート
dt[order(dt[, 4]),]もしくはdplyr::arrange(dt, Sex)で昇順ソート。
dt[order(dt[, 4]),decreasing=T]もしくはdplyr::arrange(dt, desc(Sex)で降順ソート
dt_sort <- dplyr::arrange(dt, Sex)
head(dt_sort) %>% kable(caption = "性別でソート") %>% kable_classic_2(full_width=F, html_font="san-serif", font_size=10)
Status | DeathCause | AgeCHD | Sex | AgeStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeDeath | Chol | CholStu | BPStu | WeightStu | SmokingStu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dead | Other | NA | Female | 29 | 62.50 | 140 | 78 | 124 | 121 | 0 | 55 | NA | NA | Normal | Overweight | Non-smoker |
Dead | Cancer | NA | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
Alive | NA | NA | Female | 57 | 62.25 | 132 | 90 | 170 | 114 | 10 | NA | 250 | High | High | Overweight | Moderate (6-15) |
Alive | NA | NA | Female | 39 | 65.75 | 158 | 80 | 128 | 123 | 0 | NA | 242 | High | Normal | Overweight | Non-smoker |
Alive | NA | NA | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | NA | 196 | Desirable | High | Overweight | Non-smoker |
Alive | NA | NA | Female | 36 | 64.75 | 136 | 80 | 112 | 110 | 15 | NA | 196 | Desirable | Normal | Overweight | Moderate (6-15) |