数据处理是一个不断积累反复的训练过程,本文所涉及的命令既有 Stata 系统自带的,也有外部用户编写的,不求面面俱到,但求有用有效。热烈欢迎对此文有兴趣的学友批评指正文内错误、推荐好用命令,版主会第一时间修改补充!
最新版本:
20240804
命令数量:
28
初始版本:
20240731
1. browse if
:精准浏览数据集
代码语言:js复制sysuse auto, clear
browse if substr(make, 1, 1) == "A" //只显示 make 中首字母为 A 的观测值
browse //浏览窗口恢复至整个数据
2. fre
:同时显示类别变量的标签及其取值
代码语言:js复制ssc install fre
sysuse auto, clear
tab foreign
/*
Car origin | Freq. Percent Cum.
------------ -----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
------------ -----------------------------------
Total | 74 100.00 */
fre foreign
/*foreign -- Car origin
----------------------------------------------------------------
| Freq. Percent Valid Cum.
------------------- --------------------------------------------
Valid 0 Domestic | 52 70.27 70.27 70.27
1 Foreign | 22 29.73 29.73 100.00
Total | 74 100.00 100.00
---------------------------------------------------------------- */
3. mdesc
:查看变量缺失值及其占比
代码语言:js复制ssc install mdesc
sysuse lifeexp, clear
help mdesc
mdesc //所有变量
/*
Variable | Missing Total Percent Missing
---------------- -----------------------------------------------
region | 0 68 0.00
country | 0 68 0.00
popgrowth | 0 68 0.00
lexp | 0 68 0.00
gnppc | 5 68 7.35
safewater | 28 68 41.18
---------------- ----------------------------------------------- */
mdssc safewater //指定变量
4. ereplace
:一步实现变量数值替换
代码语言:js复制ssc install ereplace
sysuse auto, clear
ereplace mpg = max(mpg) //相比于 egen 而言便捷许多
5. distinct
:查看变量非重复取值数量
代码语言:js复制sysuse lifeexp, clear
distinct
distinct, missing
distinct safewater
/*
| Observations
| total distinct
----------- ----------------------
safewater | 40 29 */
distinct safewater, missing //缺失值被算作一类特殊“数值”
/*
| Observations
| total distinct
----------- ----------------------
safewater | 68 30 */
6. bysort var1(var2): keep if _n= 1
:保留组内最大值
代码语言:js复制sysuse census, clear
*保留各区域内人口数最小的观测值
bysort region(pop): gen num = _n //region(pop) 表示 pop 在 region 内排序
keep if num ==1
drop num
*一步实现上述命令
bysort region (pop): keep if _n==1
7. isid
:变量观测值是否唯一
代码语言:js复制sysuse auto, clear
isid foreign //variable foreign does not uniquely identify the observations
isid make
isid foreign make
8. levelsof
:变量不同取值的数量
代码语言:js复制sysuse auto, clear
levelsof mpg
return list
dis r(r) //mpg有21个不同的取值
levelsof rep78
return list
dis r(r) //rep78有5个不同的取值
9. duplicates
:处理重复的观测值
代码语言:txt复制sysuse auto, clear
help duplicates
duplicates report price
duplicates report rep78
duplicates example rep78
duplicates drop rep78, force
10. trim()
:消除字符形变量中的空格
代码语言:js复制clear all
input str12 str
"String A "
" String B "
" String C"
end
replace str = trim(str)
11. 生成新变量的同时控制其位置
代码语言:js复制sysuse auto, clear
gen price_k = price/1000, before(price) //price_k 位于 price 之前
*字符变量小写
gen make_lower = lower(make), after(make) //make_lower 位于 make 之后
12. 提取字符型变量的元素
代码语言:js复制sysuse auto, clear
gen first = substr(make, 1, 2), after(make) //提取 make 变量中的前 2 个字母
13. _n 和 _N
代码语言:js复制sysuse auto, clear
keep rep78
sort rep78
gen n = _n //观测值序号
bys rep78: gen group_n = _n //按组分割后的观测值序号
gen N = _N //观测值总数
bys rep78: gen group_N = _N //按组分割后的观测值总数
list in 1/15, sepby(rep78)
/*
-------------------------------------
| rep78 n group_n N group_N |
|-------------------------------------|
1. | 1 1 1 74 2 |
2. | 1 2 2 74 2 |
|-------------------------------------|
3. | 2 3 1 74 8 |
4. | 2 4 2 74 8 |
5. | 2 5 3 74 8 |
6. | 2 6 4 74 8 |
7. | 2 7 5 74 8 |
8. | 2 8 6 74 8 |
9. | 2 9 7 74 8 |
10. | 2 10 8 74 8 |
|-------------------------------------|
11. | 3 11 1 74 30 |
12. | 3 12 2 74 30 |
13. | 3 13 3 74 30 |
14. | 3 14 4 74 30 |
15. | 3 15 5 74 30 |
------------------------------------- */
14. 使用循环批量修改变量名称
代码语言:js复制sysuse census, clear
*全部变量
foreach var of varlist _all {
rename `var' `var'_new
}
*指定变量
foreach var of varlist state state2 region pop {
rename `var' `var'_new
}
15. total()
:观测值总数
代码语言:js复制sysuse auto, clear
gen one = 1, before(make) //before() 选项在第10条中有介绍
egen total_all = total(one) //样本观测值总数
bysort rep78: egen total_rep78 = total(one) //按照rep78分类的子样本观测值数量
order total_all total_rep78 rep78
list total_all total_rep78 rep78 in 1/15, sepby(rep78)
/*
-----------------------------
| total_~l total~78 rep78 |
|-----------------------------|
1. | 74 2 1 |
2. | 74 2 1 |
|-----------------------------|
3. | 74 8 2 |
4. | 74 8 2 |
5. | 74 8 2 |
6. | 74 8 2 |
7. | 74 8 2 |
8. | 74 8 2 |
9. | 74 8 2 |
10. | 74 8 2 |
|-----------------------------|
11. | 74 30 3 |
12. | 74 30 3 |
13. | 74 30 3 |
14. | 74 30 3 |
15. | 74 30 3 |
----------------------------- */
16. xtile
:生成变量的分位数类别
代码语言:js复制sysuse auto, clear
keep price
*四分位数类别
xtile price_quartile = price, nq(4)
*十分位数类别
xtile price_decile = price, nq(10)
*第 99th 分位数取值
egen high_price = pctile(price), p(99)
*第 50th 分位数取值
egen median_price = pctile(price), p(50)
list in 1/10, sep(0)
/*
----------------------------------------------------
| price pri~tile pri~cile high_p~e median~e |
|----------------------------------------------------|
1. | 4,099 1 2 15906 5006.5 |
2. | 4,749 2 5 15906 5006.5 |
3. | 3,799 1 1 15906 5006.5 |
4. | 4,816 2 5 15906 5006.5 |
5. | 7,827 4 8 15906 5006.5 |
6. | 5,788 3 7 15906 5006.5 |
7. | 4,453 2 4 15906 5006.5 |
8. | 5,189 3 6 15906 5006.5 |
9. | 10,372 4 9 15906 5006.5 |
10. | 4,082 1 2 15906 5006.5 |
---------------------------------------------------- */
17. asgen
:计算加权平均
代码语言:js复制capture ssc install asgen
sysuse census, clear
keeporder state region medage pop
bys region: asgen medage_m1 = medage //均值
bys region: egen medage_m2 = mean(medage) //与上方命令等价
bys region: asgen medage_w = medage, weight(pop) //加权均值
duplicates drop region, force
list region medage_m1 medage_m2 medage_w
/*
--------------------------------------------
| region medage_m1 medage~2 medage_w |
|--------------------------------------------|
1. | NE 31.233333 31.23333 31.82184 |
2. | N Cntrl 29.525 29.525 29.634891 |
3. | South 29.61875 29.61875 29.799904 |
4. | West 28.284616 28.28462 29.348818 |
-------------------------------------------- */
18. collapse
:进行组内计算
代码语言:js复制sysuse census, clear
gen N = 1, before(region)
sort region
collapse (mean) medage (sum) N, by(region)
list
/*
-----------------------
| region medage N |
|-----------------------|
1. | NE 31.23 9 |
2. | N Cntrl 29.52 12 |
3. | South 29.62 16 |
4. | West 28.28 13 |
----------------------- */
sysuse census, clear
collapse (mean) medage [aweight=pop], by(region) //按 pop 变量加权
/*
------------------
| region medage |
|------------------|
1. | NE 31.82 |
2. | N Cntrl 29.63 |
3. | South 29.80 |
4. | West 29.35 |
------------------ */
19. seq
:生成重复的整数序列
代码语言:js复制capture ssc install seq
sysuse auto, clear
keep rep78
sort rep78
*repeat 1 2 3 4 5
seq rep1, from(1) to(5)
*repeat 1 1 2 2 3 3 4 4 5 5
seq rep2, from(1) to(5) block(2)
*repeat 1 2 3 ...
seq rep3, by(rep78)
list in 1/15, clean sepby(rep78)
/*
rep78 rep1 rep2 rep3
1. 1 1 1 1
2. 1 2 1 2
3. 2 3 2 1
4. 2 4 2 2
5. 2 5 3 3
6. 2 1 3 4
7. 2 2 4 5
8. 2 3 4 6
9. 2 4 5 7
10. 2 5 5 8
11. 3 1 1 1
12. 3 2 1 2
13. 3 3 2 3
14. 3 4 2 4
15. 3 5 3 5 */
20. rowtotal
:变量间求和
代码语言:js复制sysuse auto, clear
keep price mpg rep78
*只要有缺失值相加后仍为缺失值
gen sum1 = price mpg rep78
*rowtatal将缺失值视为 0
egen sum2 = rowtotal(price mpg rep78)
egen sum3 = rowtotal(price-rep78)
list in 1/10, sep(0)
/*
----------------------------------------------
| price mpg rep78 sum1 sum2 sum3 |
|----------------------------------------------|
1. | 4,099 22 3 4124 4124 4124 |
2. | 4,749 17 3 4769 4769 4769 |
3. | 3,799 22 . . 3821 3821 |
4. | 4,816 20 3 4839 4839 4839 |
5. | 7,827 15 4 7846 7846 7846 |
6. | 5,788 18 3 5809 5809 5809 |
7. | 4,453 26 . . 4479 4479 |
8. | 5,189 20 3 5212 5212 5212 |
9. | 10,372 16 3 10391 10391 10391 |
10. | 4,082 19 3 4104 4104 4104 |
---------------------------------------------- */
21. 生成随机数或随机变量
代码语言:js复制sysuse auto, clear
set seed 123
*服从均匀分布
gen r = runiform(), before(make) //0-1之间随机数
gen r1 = runiform(1,10), before(make) //1-10之间随机数
gen r2 = runiformint(1,10), before(make) //1-10之间的整数随机数
*服从正态分布
gen rn = rnormal(), before(make) //标准正态分布(0, 1)
gen rn1 = rnormal(1), before(make) //正态分布(1, 1)
gen rn2 = rnormal(5, 2), before(make) //正态分布(5, 2)
*更多函数:Random-number functions
22. clonevar
:生成完全一致的变量
代码语言:js复制sysuse auto2, clear
keep rep78
gen rep78_2 = rep78
clonevar rep78_3 = rep78
list in 1/10
/*
-----------------------------
| rep78 rep78_2 rep78_3 |
|-----------------------------|
1. | Average 3 Average |
2. | Average 3 Average |
3. | . . . |
4. | Average 3 Average |
5. | Good 4 Good |
|-----------------------------|
6. | Average 3 Average |
7. | . . . |
8. | Average 3 Average |
9. | Average 3 Average |
10. | Average 3 Average |
----------------------------- */
tab rep78,m
tab rep78_2, m
tab rep78_3, m
23. keeporder
:一步实现 keep 和 order
代码语言:js复制 ssc install keeporder
*keep order
sysuse auto, clear
keep foreign rep78 make
order foreign rep78 make
*keeporder
sysuse auto, clear
keeporder foreign rep78 make
24. gsort
:由大到小排序
代码语言:js复制*sort 将变量由小到大排序
sysuse auto, clear
*由小达大两者均可
sort mpg
gsort mpg
**由大到小只能gsort
gsort -mpg
*应用:将变量按类别生成排名
bys foreign (-mpg): gen n=_n //- invalid name (报错无法执行)
gsort foreign -mpg //先排序
by foreign: gen n = _n //再生成
*下面的方式也可等价实现
gen mpg_rev = -mpg
bys foreign (mpg_rev): gen n=_n
drop mpg_rev
25. expand
:按倍数复制观测行
代码语言:js复制sysuse auto, clear
keeporder make mpg
expand 2 //按现有观测行的2倍复制观测行
sort make
26. 生成滞后(lag)与超前(lead)的变量
代码语言:js复制sysuse uslifeexp, clear
keeporder year le
gen lag_le = le[_n-1], after(le)
gen lead_le = le[_n 1], after(lag_le)
/*
--------------------------------
| year le lag_le lead_le |
|--------------------------------|
1. | 1900 47.3 . 49.1 |
2. | 1901 49.1 47.3 51.5 |
3. | 1902 51.5 49.1 50.5 |
4. | 1903 50.5 51.5 47.6 |
5. | 1904 47.6 50.5 48.7 |
6. | 1905 48.7 47.6 48.7 |
7. | 1906 48.7 48.7 47.6 |
8. | 1907 47.6 48.7 51.1 |
9. | 1908 51.1 47.6 52.1 |
10. | 1909 52.1 51.1 50 |
11. | 1910 50 52.1 52.6 |
12. | 1911 52.6 50 53.5 |
13. | 1912 53.5 52.6 52.5 |
14. | 1913 52.5 53.5 54.2 |
15. | 1914 54.2 52.5 54.5 |
-------------------------------- */
27. recode
:将连续变量生成新分类变量
代码语言:js复制webuse fullauto, clear
fre rep77 rep78
recode rep77 rep78 (1 2 = 1 "Below average") ///
(3 = 2 "Average") ///
(4 5 = 3 "Above average"), ///
pre(new) label(newrep)
fre newrep77 newrep78
label list repair newrep
/*
repair:
1 Poor
2 Fair
3 Average
4 Good
5 Excellent
newrep:
1 Below average
2 Average
3 Above average */
28. carryforward
:向下填充
代码语言:js复制clear all
input id time y
1 1 1.2
1 3 2.4
1 4 3.4
1 7 3.2
1 9 2.4
2 3 1.8
2 4 5.6
2 6 4.3
3 2 2.3
3 4 4.5
3 7 6.7
end
list, sepby(id)
tsset id time //设定时间序列数据
tsfill, full
list, sepby(id)
bysort id: carryforward y, gen(y1) //直接向下填充
gsort id -time //以此实现 carry backforward
bysort id: carryforward y1, gen(y2) //反向调整后向下填充
gsort id time
/*
-----------------------------
| id time y y1 y2 |
|-----------------------------|
1. | 1 1 1.2 1.2 1.2 |
2. | 1 2 . 1.2 1.2 |
3. | 1 3 2.4 2.4 2.4 |
4. | 1 4 3.4 3.4 3.4 |
5. | 1 5 . 3.4 3.4 |
6. | 1 6 . 3.4 3.4 |
7. | 1 7 3.2 3.2 3.2 |
8. | 1 8 . 3.2 3.2 |
9. | 1 9 2.4 2.4 2.4 |
|-----------------------------|
10. | 2 1 . . 1.8 |
11. | 2 2 . . 1.8 |
12. | 2 3 1.8 1.8 1.8 |
13. | 2 4 5.6 5.6 5.6 |
14. | 2 5 . 5.6 5.6 |
15. | 2 6 4.3 4.3 4.3 |
16. | 2 7 . 4.3 4.3 |
17. | 2 8 . 4.3 4.3 |
18. | 2 9 . 4.3 4.3 |
|-----------------------------|
19. | 3 1 . . 2.3 |
20. | 3 2 2.3 2.3 2.3 |
21. | 3 3 . 2.3 2.3 |
22. | 3 4 4.5 4.5 4.5 |
23. | 3 5 . 4.5 4.5 |
24. | 3 6 . 4.5 4.5 |
25. | 3 7 6.7 6.7 6.7 |
26. | 3 8 . 6.7 6.7 |
27. | 3 9 . 6.7 6.7 |
----------------------------- */
...... To be revised continuously ......