データチェックと基本的なデータの扱い

1. エクセルデータの読み込み

readxlパッケージのread_excel関数が高速読み込みでおすすめ。

library(readxl)
dt <- read_excel("heart.xls")

Table 1: Heart.xlsデータ
Status	DeathCause	AgeCHD	Sex	AgeStart	Height	Weight	Diastolic	Systolic	MRW	Smoking	AgeDeath	Chol	CholStu	BPStu	WeightStu	SmokingStu
Dead	Other	NA	Female	29	62.50	140	78	124	121	0	55	NA	NA	Normal	Overweight	Non-smoker
Dead	Cancer	NA	Female	41	59.75	194	92	144	183	0	57	181	Desirable	High	Overweight	Non-smoker
Alive	NA	NA	Female	57	62.25	132	90	170	114	10	NA	250	High	High	Overweight	Moderate (6-15)
Alive	NA	NA	Female	39	65.75	158	80	128	123	0	NA	242	High	Normal	Overweight	Non-smoker
Alive	NA	NA	Male	42	66.00	156	76	110	116	20	NA	281	High	Optimal	Overweight	Heavy (16-25)
Alive	NA	NA	Female	58	61.75	131	92	176	117	0	NA	196	Desirable	High	Overweight	Non-smoker

2. データの構造確認

変数名、変数の型、データ数をチェック

str(dt)

## tibble [5,209 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Status    : chr [1:5209] "Dead" "Dead" "Alive" "Alive" ...
##  $ DeathCause: chr [1:5209] "Other" "Cancer" NA NA ...
##  $ AgeCHD    : num [1:5209] NA NA NA NA NA NA NA NA NA NA ...
##  $ Sex       : chr [1:5209] "Female" "Female" "Female" "Female" ...
##  $ AgeStart  : num [1:5209] 29 41 57 39 42 58 36 53 35 52 ...
##  $ Height    : num [1:5209] 62.5 59.8 62.2 65.8 66 ...
##  $ Weight    : num [1:5209] 140 194 132 158 156 131 136 130 194 129 ...
##  $ Diastolic : num [1:5209] 78 92 90 80 76 92 80 80 68 78 ...
##  $ Systolic  : num [1:5209] 124 144 170 128 110 176 112 114 132 124 ...
##  $ MRW       : num [1:5209] 121 183 114 123 116 117 110 99 124 106 ...
##  $ Smoking   : num [1:5209] 0 0 10 0 20 0 15 0 0 5 ...
##  $ AgeDeath  : num [1:5209] 55 57 NA NA NA NA NA 77 NA 82 ...
##  $ Chol      : num [1:5209] NA 181 250 242 281 196 196 276 211 284 ...
##  $ CholStu   : chr [1:5209] NA "Desirable" "High" "High" ...
##  $ BPStu     : chr [1:5209] "Normal" "High" "High" "Normal" ...
##  $ WeightStu : chr [1:5209] "Overweight" "Overweight" "Overweight" "Overweight" ...
##  $ SmokingStu: chr [1:5209] "Non-smoker" "Non-smoker" "Moderate (6-15)" "Non-smoker" ...

3. 要約統計量

summary(dt)

##     Status           DeathCause            AgeCHD         Sex           
##  Length:5209        Length:5209        Min.   :32.0   Length:5209       
##  Class :character   Class :character   1st Qu.:57.0   Class :character  
##  Mode  :character   Mode  :character   Median :63.0   Mode  :character  
##                                        Mean   :63.3                     
##                                        3rd Qu.:70.0                     
##                                        Max.   :90.0                     
##                                        NA's   :3760                     
##     AgeStart         Height          Weight        Diastolic     
##  Min.   :28.00   Min.   :51.50   Min.   : 67.0   Min.   : 50.00  
##  1st Qu.:37.00   1st Qu.:62.25   1st Qu.:132.0   1st Qu.: 76.00  
##  Median :43.00   Median :64.50   Median :150.0   Median : 84.00  
##  Mean   :44.07   Mean   :64.81   Mean   :153.1   Mean   : 85.36  
##  3rd Qu.:51.00   3rd Qu.:67.50   3rd Qu.:172.0   3rd Qu.: 92.00  
##  Max.   :62.00   Max.   :76.50   Max.   :300.0   Max.   :160.00  
##                  NA's   :6       NA's   :6                       
##     Systolic          MRW         Smoking          AgeDeath          Chol      
##  Min.   : 82.0   Min.   : 67   Min.   : 0.000   Min.   :36.00   Min.   : 96.0  
##  1st Qu.:120.0   1st Qu.:106   1st Qu.: 0.000   1st Qu.:63.00   1st Qu.:196.0  
##  Median :132.0   Median :118   Median : 1.000   Median :71.00   Median :223.0  
##  Mean   :136.9   Mean   :120   Mean   : 9.367   Mean   :70.54   Mean   :227.4  
##  3rd Qu.:148.0   3rd Qu.:131   3rd Qu.:20.000   3rd Qu.:79.00   3rd Qu.:255.0  
##  Max.   :300.0   Max.   :268   Max.   :60.000   Max.   :93.00   Max.   :568.0  
##                  NA's   :6     NA's   :36       NA's   :3218    NA's   :152    
##    CholStu             BPStu            WeightStu          SmokingStu       
##  Length:5209        Length:5209        Length:5209        Length:5209       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##

4. 変数ごとの集計

データ$変数名でいずれかの列の変数を指定

library(tidyverse)
library(kableExtra)
library(plotly)
p <- ggplot(dt, aes(x=AgeStart))+geom_histogram()+theme_bw()+labs(x="AgeStart 追跡開始時年齢", y="度数")+stat_bin(bins=13)
p

table(dt$DeathCause) %>% knitr::kable() %>% kable_classic(full_width=F, html_font="san-serif")

Var1	Freq
Cancer	539
Cerebral Vascular Disease	378
Coronary Heart Disease	605
Other	357
Unknown	112

# %/%は左のオブジェクトを右の関数の第1引数に引き渡すパイプ演算子
# library(knitr)とknitr::は同じ意味

5. 列の追加

dplyrパッケージのmutate関数を使用
試しに開始年齢から平均値を引いた変数stdAgeを作成

dt1 <- dplyr::mutate(dt, stdAge=AgeStart-mean(AgeStart))
head(dt1$stdAge) %>% kable(digits=2, aption = "stdAge") %>% kable_classic(full_width=F , html_font="san-serif")

x
-15.07
-3.07
12.93
-5.07
-2.07
13.93

6. ソート

dt[order(dt[, 4]),]もしくはdplyr::arrange(dt, Sex)で昇順ソート。
dt[order(dt[, 4]),decreasing=T]もしくはdplyr::arrange(dt, desc(Sex)で降順ソート

dt_sort <- dplyr::arrange(dt, Sex)
head(dt_sort) %>% kable(caption = "性別でソート") %>% kable_classic_2(full_width=F, html_font="san-serif", font_size=10)

Table 2: 性別でソート
Status	DeathCause	AgeCHD	Sex	AgeStart	Height	Weight	Diastolic	Systolic	MRW	Smoking	AgeDeath	Chol	CholStu	BPStu	WeightStu	SmokingStu
Dead	Other	NA	Female	29	62.50	140	78	124	121	0	55	NA	NA	Normal	Overweight	Non-smoker
Dead	Cancer	NA	Female	41	59.75	194	92	144	183	0	57	181	Desirable	High	Overweight	Non-smoker
Alive	NA	NA	Female	57	62.25	132	90	170	114	10	NA	250	High	High	Overweight	Moderate (6-15)
Alive	NA	NA	Female	39	65.75	158	80	128	123	0	NA	242	High	Normal	Overweight	Non-smoker
Alive	NA	NA	Female	58	61.75	131	92	176	117	0	NA	196	Desirable	High	Overweight	Non-smoker
Alive	NA	NA	Female	36	64.75	136	80	112	110	15	NA	196	Desirable	Normal	Overweight	Moderate (6-15)