本文首发于公众号:医学和生信笔记,完美观看体验请至公众号查看本文。
forcats
包简介今日学习处理因子变量的专用包forcats
,这个包不是tidyverse
的核心包,需要单独下载安装。
因子变量又被称为分类变量,它和普通的字符型变量不同,它包含一定的顺序,并且可以更改,对于统计建模、数据可视化等都非常重要。
先来一个简单的例子介绍下因子的作用。
假设我们要创建一个月份的变量,并按照月份的顺序进行排序,
library(tidyverse)
## -- Attaching packages ----------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts -------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(forcats)
months <- c("Dec","Apr","Jan","Mar")
这时使用sort
对它进行排序并不能出现我们想要的结果:
sort(months)
## [1] "Apr" "Dec" "Jan" "Mar"
我们可以通过将变量因子化,来解决这个问题:
## 首先创建我们想要的顺序,然后让变量遵从这个顺序
month_levels <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
months1 <- factor(months, levels = month_levels)
months1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
这样就能解决排序的问题了,另外还能解决你的拼写错误问题,它会把不在你顺序中的值变成NA
。
x1 <- c("Apr","Mar","Jan","Dee")
factor(x1, levels = month_levels)
## [1] Apr Mar Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
如果你的向量的顺序就是你想要的顺序,可以使用以下代码:
factor(months, levels = unique(months))
## [1] Dec Apr Jan Mar
## Levels: Dec Apr Jan Mar
months %>% factor() %>% fct_inorder()
## [1] Dec Apr Jan Mar
## Levels: Dec Apr Jan Mar
forcats
包所有函数根据forcats
包官网介绍,这个包的函数主要分为6类(其实是5类,再加一个数据集):
保持因子的值不变,但改变它们的顺序。对建模、表和可视化特别有用
fct_relevel()
:手动调整顺序fct_inorder()/fct_infreq()/fct_inseq()
:根据第一次出现的顺序、出现的频率多少、数字顺序进行排序fct_reorder()/fct_recorder2()/last2()/first2()
:根据另外一个变量的值调整顺序fct_shuffle()
:随机重新排列fct_rev()
:反转因子水平fct_shift()
:将因子向左或右移动改变因子的值,同时保持原来的顺序(尽可能)
fct_anon()
:按照因素水平fct_collapse()
:将因子水平折叠成手动定义的组fct_lump()/fct_lump_min()/fct_lump_prop()/fct_lump_n()/fct_lump_lowfreq()
:将出现次数较少的合并为“其他”fct_other()
:将指定的因子水平设置为“其他”fct_recode()
:手动改变因子的值fct_relabel()
:自动重新标记因子水平,必要时折叠fct_expand()
:fct_explicit_na()
:使缺失值显式显示fct_drop()
:fct_unify()
:fct_c()
:fct_cross()
:as_factor()
:fct_count()
:fct_match()
:fct_unique()
:lvls_reorder()/lvls_revalue()/lvls_expand()
:lvls_union()
:gss_cat
接下来会详细介绍每一个函数。
fct_relevel()
## 创建一个因子型向量
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
f
## [1] a b c d
## Levels: b c d a
## 把c,d放在地第1位,第2位
fct_relevel(f, c("c", "d"))
## [1] a b c d
## Levels: c d b a
## 把`a`放在第3的水平
fct_relevel(f, "a", after = 2)
## [1] a b c d
## Levels: b c a d
# 把`a`放到最后的位置
fct_relevel(f, "a", after = Inf)
## [1] a b c d
## Levels: b c d a
## 按照某个函数重新排序
fct_relevel(f, sort)
## [1] a b c d
## Levels: a b c d
## 注意这时的顺序是按照`sort(c("a","b","c","d"))`,不是按照`sort(f)`
## 按照随机顺序
fct_relevel(f, sample)
## [1] a b c d
## Levels: a b c d
## 反转顺序
fct_relevel(f, rev)
## [1] a b c d
## Levels: a d c b
下面是一个看起来很复杂,其实不复杂的例子,使用的是内置数据:gss_cat
,只选择其中的2列,我们的目标是把每一列中的Don't know
放到最后。
## 先看下原来的因子水平
df <- forcats::gss_cat[, c("rincome", "denom")]
lapply(df, levels) # 对df的每一列都使用`levels()`函数
## $rincome
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
##
## $denom
## [1] "No answer" "Don't know" "No denomination"
## [4] "Other" "Episcopal" "Presbyterian-dk wh"
## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which"
## [19] "Other methodist" "United methodist" "Afr meth ep zion"
## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists"
## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am"
## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
可以看到每一列都有一个Don't know
,我们要把它放到最后,顺便学习lapply
的用法。
# 对df的每一列使用`fct_relevel(..., "Don't know", after = Inf)`
df2 <- lapply(df, fct_relevel, "Don't know", after = Inf)
lapply(df2, levels) # 可以看到"Don't know"都被排在最后了
## $rincome
## [1] "No answer" "Refused" "$25000 or more" "$20000 - 24999"
## [5] "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" "$7000 to 7999"
## [9] "$6000 to 6999" "$5000 to 5999" "$4000 to 4999" "$3000 to 3999"
## [13] "$1000 to 2999" "Lt $1000" "Not applicable" "Don't know"
##
## $denom
## [1] "No answer" "No denomination" "Other"
## [4] "Episcopal" "Presbyterian-dk wh" "Presbyterian, merged"
## [7] "Other presbyterian" "United pres ch in us" "Presbyterian c in us"
## [10] "Lutheran-dk which" "Evangelical luth" "Other lutheran"
## [13] "Wi evan luth synod" "Lutheran-mo synod" "Luth ch in america"
## [16] "Am lutheran" "Methodist-dk which" "Other methodist"
## [19] "United methodist" "Afr meth ep zion" "Afr meth episcopal"
## [22] "Baptist-dk which" "Other baptists" "Southern baptist"
## [25] "Nat bapt conv usa" "Nat bapt conv of am" "Am bapt ch in usa"
## [28] "Am baptist asso" "Not applicable" "Don't know"
如果当前没有某个值会报错
fct_relevel(f, "e")
## Warning: Unknown levels in `f`: e
## [1] a b c d
## Levels: b c d a
fct_inorder()/fct_infreq()/fct_inseq()
这3个是同一家族函数,意思一样,具体用法稍有区别:
fct_inorder()
: 按照第一次出现的顺序
fct_infreq()
: 按照每个水平出现的频率(从大到小)
fct_inseq()
: 按照数字大小
f <- factor(c("b", "b", "a", "c", "c", "c"))
f #默认按字母顺序
## [1] b b a c c c
## Levels: a b c
fct_inorder(f) # 按第一次出现的顺序
## [1] b b a c c c
## Levels: b a c
fct_infreq(f) # 按出现的频率从大到小排列
## [1] b b a c c c
## Levels: c b a
f <- factor(1:3, levels = c("3", "2", "1"))
f
## [1] 1 2 3
## Levels: 3 2 1
fct_inseq(f) # 按照数字顺序排列,虽然你定义的顺序是"3", "2", "1"
## [1] 1 2 3
## Levels: 1 2 3
一个在画图中很有用的例子:
你画了一幅图如下:
library(ggplot2)
ggplot(starwars, aes(x = hair_color)) +
geom_bar() +
coord_flip()
但你发现这并不是你想要的,你想按照每一种的个数多少排列好画出来,你可以选择画图前就把顺序排好,或者像这样:
ggplot(starwars, aes(x = fct_infreq(hair_color))) +
geom_bar() +
coord_flip()
完美解决问题!
fct_reorder()/fct_recorder2()/last2()/first2()
fct_reorder()
对于因子映射到位置的一维显示非常有用;fct_reorder2()
用于2维显示,其中因子被映射到非位置。last2()
和first2()
是fct_reorder2()
的辅助函数;last2()
在y按照x排序时,查找y的最后一个值;first2()
查找第一个值。
## 生成一个简单的tibble
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
## 查看color这一列的顺序
df$color <- factor(df$color)
df$color
## [1] blue green purple red yellow
## Levels: blue green purple red yellow
按照a这一列从小到大的顺序,排序color这一列,可以看到color的levels已经变了
fct_reorder(df$color, df$a, min)
## [1] blue green purple red yellow
## Levels: blue red purple yellow green
fct_reorder()
用于画图小例子:
boxplot(Sepal.Width ~ Species, data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width), data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
fct_reorder2(df$color, df$a, df$b)
## [1] blue green purple red yellow
## Levels: purple red blue green yellow
fct_reorder2()
感觉很复杂的样子,但是你只要记住在画图的时候可能会用到它,神奇功能:使图例的顺序和线条的顺序一致。
下面是一个小例子:
chks <- subset(ChickWeight, as.integer(Chick) < 10)
chks <- transform(chks, Chick = fct_shuffle(Chick))
chks
## weight Time Chick Diet
## 85 42 0 8 1
## 86 50 2 8 1
## 87 61 4 8 1
## 88 71 6 8 1
## 89 84 8 8 1
## 90 93 10 8 1
## 91 110 12 8 1
## 92 116 14 8 1
## 93 126 16 8 1
## 94 134 18 8 1
## 95 125 20 8 1
## 96 42 0 9 1
## 97 51 2 9 1
## 98 59 4 9 1
## 99 68 6 9 1
## 100 85 8 9 1
## 101 96 10 9 1
## 102 90 12 9 1
## 103 92 14 9 1
## 104 93 16 9 1
## 105 100 18 9 1
## 106 100 20 9 1
## 107 98 21 9 1
## 108 41 0 10 1
## 109 44 2 10 1
## 110 52 4 10 1
## 111 63 6 10 1
## 112 74 8 10 1
## 113 81 10 10 1
## 114 89 12 10 1
## 115 96 14 10 1
## 116 101 16 10 1
## 117 112 18 10 1
## 118 120 20 10 1
## 119 124 21 10 1
## 144 41 0 13 1
## 145 48 2 13 1
## 146 53 4 13 1
## 147 60 6 13 1
## 148 65 8 13 1
## 149 67 10 13 1
## 150 71 12 13 1
## 151 70 14 13 1
## 152 71 16 13 1
## 153 81 18 13 1
## 154 91 20 13 1
## 155 96 21 13 1
## 168 41 0 15 1
## 169 49 2 15 1
## 170 56 4 15 1
## 171 64 6 15 1
## 172 68 8 15 1
## 173 68 10 15 1
## 174 67 12 15 1
## 175 68 14 15 1
## 176 41 0 16 1
## 177 45 2 16 1
## 178 49 4 16 1
## 179 51 6 16 1
## 180 57 8 16 1
## 181 51 10 16 1
## 182 54 12 16 1
## 183 42 0 17 1
## 184 51 2 17 1
## 185 61 4 17 1
## 186 72 6 17 1
## 187 83 8 17 1
## 188 89 10 17 1
## 189 98 12 17 1
## 190 103 14 17 1
## 191 113 16 17 1
## 192 123 18 17 1
## 193 133 20 17 1
## 194 142 21 17 1
## 195 39 0 18 1
## 196 35 2 18 1
## 209 41 0 20 1
## 210 47 2 20 1
## 211 54 4 20 1
## 212 58 6 20 1
## 213 65 8 20 1
## 214 73 10 20 1
## 215 77 12 20 1
## 216 89 14 20 1
## 217 98 16 20 1
## 218 107 18 20 1
## 219 115 20 20 1
## 220 117 21 20 1
ggplot(chks, aes(Time, weight, colour = Chick)) +
geom_point() +
geom_line()
# 图例的顺序和线的顺序一样
ggplot(chks, aes(Time, weight, colour = fct_reorder2(Chick, Time, weight))) +
geom_point() +
geom_line() +
labs(colour = "Chick")
fct_shuffle()
随机重排,完全打乱顺序
f <- factor(c("a", "b", "c"))
f
## [1] a b c
## Levels: a b c
set.seed(111)
fct_shuffle(f) # 每次运行都会出现不同的顺序,除非设置种子数
## [1] a b c
## Levels: b a c
fct_rev()
反转顺序
f <- factor(c("a", "b", "c"))
f
## [1] a b c
## Levels: a b c
fct_rev(f)
## [1] a b c
## Levels: c b a
fct_shift()
将因子水平左右移动,默认向左移
x <- factor(
c("Mon", "Tue", "Wed"),
levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),
ordered = TRUE
)
x
## [1] Mon Tue Wed
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
fct_shift(x)
## [1] Mon Tue Wed
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
fct_shift(x, 2)
## [1] Mon Tue Wed
## Levels: Tue < Wed < Thu < Fri < Sat < Sun < Mon
fct_shift(x, -1)
## [1] Mon Tue Wed
## Levels: Sat < Sun < Mon < Tue < Wed < Thu < Fri
fct_anon()
用任意数字标识符替换因子级别。值和级别的顺序都不会被保留
gss_cat$relig %>% fct_count()
## # A tibble: 16 x 2
## f n
##
## 1 No answer 93
## 2 Don't know 15
## 3 Inter-nondenominational 109
## 4 Native american 23
## 5 Christian 689
## 6 Orthodox-christian 95
## 7 Moslem/islam 104
## 8 Other eastern 32
## 9 Hinduism 71
## 10 Buddhism 147
## 11 Other 224
## 12 None 3523
## 13 Jewish 388
## 14 Catholic 5124
## 15 Protestant 10846
## 16 Not applicable 0
gss_cat$relig %>% fct_anon() %>% fct_count()
## # A tibble: 16 x 2
## f n
##
## 1 01 32
## 2 02 224
## 3 03 93
## 4 04 3523
## 5 05 689
## 6 06 5124
## 7 07 10846
## 8 08 104
## 9 09 109
## 10 10 147
## 11 11 23
## 12 12 71
## 13 13 388
## 14 14 0
## 15 15 15
## 16 16 95
gss_cat$relig %>% fct_anon("X") %>% fct_count()
## # A tibble: 16 x 2
## f n
##
## 1 X01 109
## 2 X02 5124
## 3 X03 224
## 4 X04 3523
## 5 X05 95
## 6 X06 0
## 7 X07 689
## 8 X08 93
## 9 X09 32
## 10 X10 147
## 11 X11 15
## 12 X12 71
## 13 X13 388
## 14 X14 104
## 15 X15 23
## 16 X16 10846
fct_collapse()
简单的说就是可以给因子分组。
fct_count(gss_cat$partyid)
## # A tibble: 10 x 2
## f n
##
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind,near rep 1791
## 7 Independent 4119
## 8 Ind,near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
一共有10行,也就是10个水平,现在我们可以把10个水平分组,手动定义新的组:
partyid2 <- fct_collapse(gss_cat$partyid,
missing = c("No answer", "Don't know"),
rep = c("Strong republican", "Not str republican"),
other = "Other party",
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)
fct_count(partyid2)
## # A tibble: 5 x 2
## f n
##
## 1 missing 155
## 2 other 393
## 3 rep 5346
## 4 ind 8409
## 5 dem 7180
fct_lump()
这个是一系列函数,可以将满足某些条件的水平合并为一组。如果你经常做机器学习、统计建模等工作,你可能会经常需要把一些占比比较低的组都变成“其他”组。Python中的pandas可以很容易做到,R语言当然也可以。
fct_lump_min()
: 把小于某些次数的归为其他类.
fct_lump_prop()
: 把小于某个比例的归为其他类.
fct_lump_n()
: 把个数最多的n个留下,其他的归为一类(如果n < 0,则个数最少的n个留下).
fct_lump_lowfreq()
: 将最不频繁的级别合并在一起.
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% table()
## .
## A B C D E F G H I
## 40 10 5 27 1 1 1 1 1
把个数最多的3个留下,其他归为一类
x %>% fct_lump_n(3) %>% table() # ties.method = c("min", "average", "first", "last", "random", "max")
## .
## A B D Other
## 40 10 27 10
把个数最多的3个归为其他类
x %>% fct_lump_n(-3) %>% table()
## .
## E F G H I Other
## 1 1 1 1 1 82
把比例小于0.1的归为一类
x %>% fct_lump_prop(0.1) %>% table()
## .
## A B D Other
## 40 10 27 10
把小于2次的归为其他类
x %>% fct_lump_min(2, other_level = "其他") %>% table()
## .
## A B C D 其他
## 40 10 5 27 5
把频率小的归为其他类,同时确保其他类仍然是频率最小的
x %>% fct_lump_lowfreq() %>% table()
## .
## A D Other
## 40 27 20
fct_other()
把某些因子归为其他类,类似于 fct_lump
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
# 把A,B留下,其他归为一类
fct_other(x, keep = c("A", "B"), other_level = "other")
## [1] A A A A A A A A A A A A
## [13] A A A A A A A A A A A A
## [25] A A A A A A A A A A A A
## [37] A A A A B B B B B B B B
## [49] B B other other other other other other other other other other
## [61] other other other other other other other other other other other other
## [73] other other other other other other other other other other other other
## [85] other other other
## Levels: A B other
# 把A,B归为一类,其他留下
fct_other(x, drop = c("A", "B"), other_level = "hhahah")
## [1] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [11] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [21] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [31] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [41] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [51] C C C C C D D D D D
## [61] D D D D D D D D D D
## [71] D D D D D D D D D D
## [81] D D E F G H I
## Levels: C D E F G H I hhahah
fct_recode()
手动更改因子水平
x <- factor(c("apple", "bear", "banana", "dear"))
x
## [1] apple bear banana dear
## Levels: apple banana bear dear
fct_recode(x, fruit = "apple", fruit = "banana")
## [1] fruit bear fruit dear
## Levels: fruit bear dear
fct_recode(x, NULL = "apple", fruit = "banana")
## [1] bear fruit dear
## Levels: fruit bear dear
fct_recode(x, "an apple" = "apple", "a bear" = "bear")
## [1] an apple a bear banana dear
## Levels: an apple banana a bear dear
x <- factor(c("apple", "bear", "banana", "dear"))
levels <- c(fruit = "apple", fruit = "banana")
fct_recode(x, !!!levels)
## [1] fruit bear fruit dear
## Levels: fruit bear dear
fct_relable()
gss_cat$partyid %>% fct_count()
## # A tibble: 10 x 2
## f n
##
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind,near rep 1791
## 7 Independent 4119
## 8 Ind,near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
gss_cat$partyid %>% fct_relabel(~ gsub(",", ", ", .x)) %>% fct_count()
## # A tibble: 10 x 2
## f n
##
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind, near rep 1791
## 7 Independent 4119
## 8 Ind, near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
fct_expand()
增加因子水平
f <- factor(sample(letters[1:3], 20, replace = TRUE))
f
## [1] c b b a a a b a b b b a a c a c a a a b
## Levels: a b c
fct_expand(f, "d", "e", "f")
## [1] c b b a a a b a b b b a a c a c a a a b
## Levels: a b c d e f
fct_drop()
删除没用的因子水平
f <- factor(c("a", "b"), levels = c("a", "b", "c"))
f
## [1] a b
## Levels: a b c
fct_drop(f, "c")
## [1] a b
## Levels: a b
fct_explicit_na()
给 NA
一个水平,确保画图或汇总的时候能用上
f1 <- factor(c("a", "a", NA, NA, "a", "b", NA, "c", "a", "c", "b"))
fct_count(f1)
## # A tibble: 4 x 2
## f n
##
## 1 a 4
## 2 b 2
## 3 c 2
## 4 3
f2 <- fct_explicit_na(f1, na_level = "missing")
fct_count(f2)
## # A tibble: 4 x 2
## f n
##
## 1 a 4
## 2 b 2
## 3 c 2
## 4 missing 3
fct_unify()
这个函数是作用于列表的,用于统一列表内的因子水平
fs <- list(factor("a"),
factor("b"),
factor(c("a", "b")))
fct_unify(fs, levels = c("a", "b", "c"))
## [[1]]
## [1] a
## Levels: a b c
##
## [[2]]
## [1] b
## Levels: a b c
##
## [[3]]
## [1] a b
## Levels: a b c
fct_c()
拼接因子向量
fa <- factor("a")
fb <- factor("b")
fab <- factor(c("a", "b"))
c(fa, fb, fab)
## [1] a b a b
## Levels: a b
fct_c(fa, fb, fab)
## [1] a b a b
## Levels: a b
fct_cross()
组合因子向量,形成新的因子向量,不是简单的连在一起
fruit <- factor(c("apple", "kiwi", "apple", "apple"))
colour <- factor(c("green", "green", "red", "green"))
eaten <- c("yes", "no", "yes", "no")
fct_cross(fruit, colour)
## [1] apple:green kiwi:green apple:red apple:green
## Levels: apple:green kiwi:green apple:red
fct_cross(fruit, colour, eaten)
## [1] apple:green:yes kiwi:green:no apple:red:yes apple:green:no
## Levels: apple:green:no kiwi:green:no apple:green:yes apple:red:yes
as_factor()
变成因子向量,和 as.factor()
作用一样,但略有不同
x <- c("a", "z", "g")
as.factor(x) # 会改变顺序
## [1] a z g
## Levels: a g z
as_factor(x) # 还是按照原来的顺序
## [1] a z g
## Levels: a z g
fct_count()
统计因子个数
f <- factor(sample(letters)[rpois(1000, 10)])
table(f)
## f
## a b d e g h i j k l m n o q r t u v x y
## 13 2 17 1 13 47 10 1 106 28 132 21 97 51 99 43 3 128 1 63
## z
## 124
fct_count(f, sort = T, prop = T) # 计算个数,按顺序排列,并计算比例
## # A tibble: 21 x 3
## f n p
##
## 1 m 132 0.132
## 2 v 128 0.128
## 3 z 124 0.124
## 4 k 106 0.106
## 5 r 99 0.099
## 6 o 97 0.097
## 7 y 63 0.063
## 8 q 51 0.051
## 9 h 47 0.047
## 10 t 43 0.043
## # ... with 11 more rows
fct_match()
检查是否存在某个因子
table(fct_match(gss_cat$marital, c("Married", "Divorced")))
##
## FALSE TRUE
## 7983 13500
fct_unique()
每个水平只保留一个因子
f <- factor(letters[rpois(100, 10)])
unique(f)
## [1] i o j n p l k h q e f a m d g b
## Levels: a b d e f g h i j k l m n o p q
fct_unique(f)
## [1] a b d e f g h i j k l m n o p q
## Levels: a b d e f g h i j k l m n o p q
lvls_reorder()
f <- factor(c("a", "b", "c"))
lvls_reorder(f, 3:1)
## [1] a b c
## Levels: c b a
lvls_revalue(f, c("apple", "banana", "carrot"))
## [1] apple banana carrot
## Levels: apple banana carrot
lvls_expand(f, c("a", "b", "c", "d"))
## [1] a b c
## Levels: a b c d
lvls_union()
作用于列表
fs <- list(factor("a"), factor("b"), factor(c("a", "b")))
lvls_union(fs)
## [1] "a" "b"
以上就是forcats
包的全部内容,希望大家都能学会,如果有问题,欢迎交流讨论。
本文首发于公众号:医学和生信笔记,完美观看体验请至公众号查看本文。