データチェックと基本的なデータの扱い
統計解析前のデータチェック・処理についてのまとめ。
1. エクセルデータの読み込み
readxlパッケージのread_excel関数が高速読み込みでおすすめ。
library(readxl)
dt <- read_excel("heart.xls")
| Status | DeathCause | AgeCHD | Sex | AgeStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeDeath | Chol | CholStu | BPStu | WeightStu | SmokingStu |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dead | Other | NA | Female | 29 | 62.50 | 140 | 78 | 124 | 121 | 0 | 55 | NA | NA | Normal | Overweight | Non-smoker |
| Dead | Cancer | NA | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
| Alive | NA | NA | Female | 57 | 62.25 | 132 | 90 | 170 | 114 | 10 | NA | 250 | High | High | Overweight | Moderate (6-15) |
| Alive | NA | NA | Female | 39 | 65.75 | 158 | 80 | 128 | 123 | 0 | NA | 242 | High | Normal | Overweight | Non-smoker |
| Alive | NA | NA | Male | 42 | 66.00 | 156 | 76 | 110 | 116 | 20 | NA | 281 | High | Optimal | Overweight | Heavy (16-25) |
| Alive | NA | NA | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | NA | 196 | Desirable | High | Overweight | Non-smoker |
2. データの構造確認
変数名、変数の型、データ数をチェック
str(dt)
## tibble [5,209 × 17] (S3: tbl_df/tbl/data.frame)
## $ Status : chr [1:5209] "Dead" "Dead" "Alive" "Alive" ...
## $ DeathCause: chr [1:5209] "Other" "Cancer" NA NA ...
## $ AgeCHD : num [1:5209] NA NA NA NA NA NA NA NA NA NA ...
## $ Sex : chr [1:5209] "Female" "Female" "Female" "Female" ...
## $ AgeStart : num [1:5209] 29 41 57 39 42 58 36 53 35 52 ...
## $ Height : num [1:5209] 62.5 59.8 62.2 65.8 66 ...
## $ Weight : num [1:5209] 140 194 132 158 156 131 136 130 194 129 ...
## $ Diastolic : num [1:5209] 78 92 90 80 76 92 80 80 68 78 ...
## $ Systolic : num [1:5209] 124 144 170 128 110 176 112 114 132 124 ...
## $ MRW : num [1:5209] 121 183 114 123 116 117 110 99 124 106 ...
## $ Smoking : num [1:5209] 0 0 10 0 20 0 15 0 0 5 ...
## $ AgeDeath : num [1:5209] 55 57 NA NA NA NA NA 77 NA 82 ...
## $ Chol : num [1:5209] NA 181 250 242 281 196 196 276 211 284 ...
## $ CholStu : chr [1:5209] NA "Desirable" "High" "High" ...
## $ BPStu : chr [1:5209] "Normal" "High" "High" "Normal" ...
## $ WeightStu : chr [1:5209] "Overweight" "Overweight" "Overweight" "Overweight" ...
## $ SmokingStu: chr [1:5209] "Non-smoker" "Non-smoker" "Moderate (6-15)" "Non-smoker" ...
3. 要約統計量
summary(dt)
## Status DeathCause AgeCHD Sex
## Length:5209 Length:5209 Min. :32.0 Length:5209
## Class :character Class :character 1st Qu.:57.0 Class :character
## Mode :character Mode :character Median :63.0 Mode :character
## Mean :63.3
## 3rd Qu.:70.0
## Max. :90.0
## NA's :3760
## AgeStart Height Weight Diastolic
## Min. :28.00 Min. :51.50 Min. : 67.0 Min. : 50.00
## 1st Qu.:37.00 1st Qu.:62.25 1st Qu.:132.0 1st Qu.: 76.00
## Median :43.00 Median :64.50 Median :150.0 Median : 84.00
## Mean :44.07 Mean :64.81 Mean :153.1 Mean : 85.36
## 3rd Qu.:51.00 3rd Qu.:67.50 3rd Qu.:172.0 3rd Qu.: 92.00
## Max. :62.00 Max. :76.50 Max. :300.0 Max. :160.00
## NA's :6 NA's :6
## Systolic MRW Smoking AgeDeath Chol
## Min. : 82.0 Min. : 67 Min. : 0.000 Min. :36.00 Min. : 96.0
## 1st Qu.:120.0 1st Qu.:106 1st Qu.: 0.000 1st Qu.:63.00 1st Qu.:196.0
## Median :132.0 Median :118 Median : 1.000 Median :71.00 Median :223.0
## Mean :136.9 Mean :120 Mean : 9.367 Mean :70.54 Mean :227.4
## 3rd Qu.:148.0 3rd Qu.:131 3rd Qu.:20.000 3rd Qu.:79.00 3rd Qu.:255.0
## Max. :300.0 Max. :268 Max. :60.000 Max. :93.00 Max. :568.0
## NA's :6 NA's :36 NA's :3218 NA's :152
## CholStu BPStu WeightStu SmokingStu
## Length:5209 Length:5209 Length:5209 Length:5209
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
4. 変数ごとの集計
データ$変数名でいずれかの列の変数を指定
library(tidyverse)
library(kableExtra)
library(plotly)
p <- ggplot(dt, aes(x=AgeStart))+geom_histogram()+theme_bw()+labs(x="AgeStart 追跡開始時年齢", y="度数")+stat_bin(bins=13)
p

table(dt$DeathCause) %>% knitr::kable() %>% kable_classic(full_width=F, html_font="san-serif")
| Var1 | Freq |
|---|---|
| Cancer | 539 |
| Cerebral Vascular Disease | 378 |
| Coronary Heart Disease | 605 |
| Other | 357 |
| Unknown | 112 |
# %/%は左のオブジェクトを右の関数の第1引数に引き渡すパイプ演算子
# library(knitr)とknitr::は同じ意味
5. 列の追加
dplyrパッケージのmutate関数を使用
試しに開始年齢から平均値を引いた変数stdAgeを作成
dt1 <- dplyr::mutate(dt, stdAge=AgeStart-mean(AgeStart))
head(dt1$stdAge) %>% kable(digits=2, aption = "stdAge") %>% kable_classic(full_width=F , html_font="san-serif")
| x |
|---|
| -15.07 |
| -3.07 |
| 12.93 |
| -5.07 |
| -2.07 |
| 13.93 |
6. ソート
dt[order(dt[, 4]),]もしくはdplyr::arrange(dt, Sex)で昇順ソート。
dt[order(dt[, 4]),decreasing=T]もしくはdplyr::arrange(dt, desc(Sex)で降順ソート
dt_sort <- dplyr::arrange(dt, Sex)
head(dt_sort) %>% kable(caption = "性別でソート") %>% kable_classic_2(full_width=F, html_font="san-serif", font_size=10)
| Status | DeathCause | AgeCHD | Sex | AgeStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeDeath | Chol | CholStu | BPStu | WeightStu | SmokingStu |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dead | Other | NA | Female | 29 | 62.50 | 140 | 78 | 124 | 121 | 0 | 55 | NA | NA | Normal | Overweight | Non-smoker |
| Dead | Cancer | NA | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
| Alive | NA | NA | Female | 57 | 62.25 | 132 | 90 | 170 | 114 | 10 | NA | 250 | High | High | Overweight | Moderate (6-15) |
| Alive | NA | NA | Female | 39 | 65.75 | 158 | 80 | 128 | 123 | 0 | NA | 242 | High | Normal | Overweight | Non-smoker |
| Alive | NA | NA | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | NA | 196 | Desirable | High | Overweight | Non-smoker |
| Alive | NA | NA | Female | 36 | 64.75 | 136 | 80 | 112 | 110 | 15 | NA | 196 | Desirable | Normal | Overweight | Moderate (6-15) |