Stata 数据处理 | 一份常用并实用的命令清单

2024-08-04 13:39:02 浏览数 (1)

数据处理是一个不断积累反复的训练过程,本文所涉及的命令既有 Stata 系统自带的,也有外部用户编写的,不求面面俱到,但求有用有效。热烈欢迎对此文有兴趣的学友批评指正文内错误、推荐好用命令,版主会第一时间修改补充!

最新版本:20240804

命令数量:28

初始版本:20240731


1. browse if:精准浏览数据集

代码语言:js复制
sysuse auto, clear
browse if substr(make, 1, 1) == "A" //只显示 make 中首字母为 A 的观测值
browse //浏览窗口恢复至整个数据

2. fre :同时显示类别变量的标签及其取值

代码语言:js复制
ssc install fre    
sysuse auto, clear
tab foreign 
/*
 Car origin |      Freq.     Percent        Cum.
------------ -----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------ -----------------------------------
      Total |         74      100.00 */

fre foreign

/*foreign -- Car origin
----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
------------------- --------------------------------------------
Valid   0 Domestic |         52      70.27      70.27      70.27
        1 Foreign  |         22      29.73      29.73     100.00
        Total      |         74     100.00     100.00            
---------------------------------------------------------------- */

3. mdesc:查看变量缺失值及其占比

代码语言:js复制
ssc install mdesc   
sysuse lifeexp, clear
help mdesc
mdesc //所有变量
/*
    Variable    |     Missing          Total     Percent Missing
---------------- -----------------------------------------------
         region |           0             68           0.00
        country |           0             68           0.00
      popgrowth |           0             68           0.00
           lexp |           0             68           0.00
          gnppc |           5             68           7.35
      safewater |          28             68          41.18
---------------- ----------------------------------------------- */

mdssc safewater //指定变量

4. ereplace:一步实现变量数值替换

代码语言:js复制
ssc install ereplace
sysuse auto, clear
ereplace mpg = max(mpg) //相比于 egen 而言便捷许多

5. distinct:查看变量非重复取值数量

代码语言:js复制
sysuse lifeexp, clear
distinct
distinct, missing
distinct safewater 
/*
           |        Observations
           |      total   distinct
----------- ----------------------
 safewater |         40         29 */

distinct safewater, missing  //缺失值被算作一类特殊“数值”
/*
           |        Observations
           |      total   distinct
----------- ----------------------
 safewater |         68         30 */

6. bysort var1(var2): keep if _n= 1:保留组内最大值

代码语言:js复制
sysuse census, clear

*保留各区域内人口数最小的观测值
bysort region(pop): gen num = _n //region(pop) 表示 pop 在 region 内排序
keep if num ==1
drop num

*一步实现上述命令
bysort region (pop): keep if _n==1 

7. isid:变量观测值是否唯一

代码语言:js复制
sysuse auto, clear
isid foreign //variable foreign does not uniquely identify the observations
isid make
isid foreign make

8. levelsof:变量不同取值的数量

代码语言:js复制
sysuse auto, clear

levelsof mpg
return list
dis r(r) //mpg有21个不同的取值

levelsof rep78
return list
dis r(r) //rep78有5个不同的取值

9. duplicates:处理重复的观测值

代码语言:txt复制
sysuse auto, clear 
help duplicates
duplicates report price  
duplicates report rep78
duplicates example rep78
duplicates drop rep78, force

10. trim():消除字符形变量中的空格

代码语言:js复制
clear all 

input str12 str 
"String A " 
" String B " 
" String C" 
end 

replace str = trim(str) 

11. 生成新变量的同时控制其位置

代码语言:js复制
sysuse auto, clear
gen price_k = price/1000, before(price)     //price_k 位于 price 之前
*字符变量小写
gen make_lower = lower(make), after(make)   //make_lower 位于 make 之后

12. 提取字符型变量的元素

代码语言:js复制
sysuse auto, clear 
gen first = substr(make, 1, 2), after(make) //提取 make 变量中的前 2 个字母

13. _n 和 _N

代码语言:js复制
sysuse auto, clear
keep rep78
sort rep78

gen n = _n                   //观测值序号 

bys rep78: gen group_n = _n  //按组分割后的观测值序号

gen N = _N                   //观测值总数        
      
bys rep78: gen group_N = _N  //按组分割后的观测值总数

list in 1/15, sepby(rep78)
/*
      ------------------------------------- 
     | rep78    n   group_n    N   group_N |
     |-------------------------------------|
  1. |     1    1         1   74         2 |
  2. |     1    2         2   74         2 |
     |-------------------------------------|
  3. |     2    3         1   74         8 |
  4. |     2    4         2   74         8 |
  5. |     2    5         3   74         8 |
  6. |     2    6         4   74         8 |
  7. |     2    7         5   74         8 |
  8. |     2    8         6   74         8 |
  9. |     2    9         7   74         8 |
 10. |     2   10         8   74         8 |
     |-------------------------------------|
 11. |     3   11         1   74        30 |
 12. |     3   12         2   74        30 |
 13. |     3   13         3   74        30 |
 14. |     3   14         4   74        30 |
 15. |     3   15         5   74        30 |
      -------------------------------------  */

14. 使用循环批量修改变量名称

代码语言:js复制
sysuse census, clear 

*全部变量
foreach var of varlist _all {     
        rename `var' `var'_new 
        }
        
*指定变量        
foreach var of varlist state state2 region pop { 
        rename `var' `var'_new 
        }

15. total():观测值总数

代码语言:js复制
sysuse auto, clear  
gen one = 1, before(make)     //before() 选项在第10条中有介绍

egen total_all = total(one)   //样本观测值总数

bysort rep78: egen total_rep78 = total(one)  //按照rep78分类的子样本观测值数量

order total_all total_rep78 rep78
list total_all total_rep78 rep78 in 1/15, sepby(rep78)
/*
      ----------------------------- 
     | total_~l   total~78   rep78 |
     |-----------------------------|
  1. |       74          2       1 |
  2. |       74          2       1 |
     |-----------------------------|
  3. |       74          8       2 |
  4. |       74          8       2 |
  5. |       74          8       2 |
  6. |       74          8       2 |
  7. |       74          8       2 |
  8. |       74          8       2 |
  9. |       74          8       2 |
 10. |       74          8       2 |
     |-----------------------------|
 11. |       74         30       3 |
 12. |       74         30       3 |
 13. |       74         30       3 |
 14. |       74         30       3 |
 15. |       74         30       3 |
      -----------------------------  */

16. xtile:生成变量的分位数类别

代码语言:js复制
sysuse auto, clear 
keep price

*四分位数类别
xtile price_quartile = price, nq(4) 

*十分位数类别
xtile price_decile   = price, nq(10)

*第 99th 分位数取值
egen high_price   = pctile(price), p(99)

*第 50th 分位数取值
egen median_price = pctile(price), p(50) 

list in 1/10, sep(0)
/*
      ---------------------------------------------------- 
     |  price   pri~tile   pri~cile   high_p~e   median~e |
     |----------------------------------------------------|
  1. |  4,099          1          2      15906     5006.5 |
  2. |  4,749          2          5      15906     5006.5 |
  3. |  3,799          1          1      15906     5006.5 |
  4. |  4,816          2          5      15906     5006.5 |
  5. |  7,827          4          8      15906     5006.5 |
  6. |  5,788          3          7      15906     5006.5 |
  7. |  4,453          2          4      15906     5006.5 |
  8. |  5,189          3          6      15906     5006.5 |
  9. | 10,372          4          9      15906     5006.5 |
 10. |  4,082          1          2      15906     5006.5 |
      ----------------------------------------------------   */

17. asgen:计算加权平均

代码语言:js复制
capture ssc install asgen
sysuse census, clear
keeporder state region medage pop

bys region: asgen medage_m1 = medage       //均值
bys region: egen  medage_m2 = mean(medage) //与上方命令等价

bys region: asgen medage_w = medage, weight(pop) //加权均值
duplicates drop region, force

list region medage_m1 medage_m2 medage_w 
/*
      -------------------------------------------- 
     | region    medage_m1   medage~2    medage_w |
     |--------------------------------------------|
  1. | NE        31.233333   31.23333    31.82184 |
  2. | N Cntrl      29.525     29.525   29.634891 |
  3. | South      29.61875   29.61875   29.799904 |
  4. | West      28.284616   28.28462   29.348818 |
      --------------------------------------------  */

18. collapse:进行组内计算

代码语言:js复制
sysuse census, clear
gen N = 1, before(region)
sort region
collapse (mean) medage (sum) N, by(region)
list
/*
      ----------------------- 
     | region    medage    N |
     |-----------------------|
  1. | NE         31.23    9 |
  2. | N Cntrl    29.52   12 |
  3. | South      29.62   16 |
  4. | West       28.28   13 |
      -----------------------  */

sysuse census, clear
collapse (mean) medage [aweight=pop], by(region)  //按 pop 变量加权
/*
      ------------------ 
     | region    medage |
     |------------------|
  1. | NE         31.82 |
  2. | N Cntrl    29.63 |
  3. | South      29.80 |
  4. | West       29.35 |
      ------------------  */

19. seq:生成重复的整数序列

代码语言:js复制
capture ssc install seq
sysuse auto, clear
keep rep78
sort rep78

*repeat 1 2 3 4 5
seq rep1, from(1) to(5)

*repeat 1 1 2 2 3 3 4 4 5 5
seq rep2, from(1) to(5) block(2)

*repeat 1 2 3 ...
seq rep3, by(rep78)

list in 1/15, clean sepby(rep78)
/*
       rep78   rep1   rep2   rep3  
  1.       1      1      1      1  
  2.       1      2      1      2  
  3.       2      3      2      1  
  4.       2      4      2      2  
  5.       2      5      3      3  
  6.       2      1      3      4  
  7.       2      2      4      5  
  8.       2      3      4      6  
  9.       2      4      5      7  
 10.       2      5      5      8  
 11.       3      1      1      1  
 12.       3      2      1      2  
 13.       3      3      2      3  
 14.       3      4      2      4  
 15.       3      5      3      5   */

20. rowtotal:变量间求和

代码语言:js复制
sysuse auto, clear
keep price mpg rep78

*只要有缺失值相加后仍为缺失值
gen sum1 = price   mpg   rep78

*rowtatal将缺失值视为 0
egen sum2 = rowtotal(price mpg rep78) 
egen sum3 = rowtotal(price-rep78)

list in 1/10, sep(0)
/*
      ---------------------------------------------- 
     |  price   mpg   rep78    sum1    sum2    sum3 |
     |----------------------------------------------|
  1. |  4,099    22       3    4124    4124    4124 |
  2. |  4,749    17       3    4769    4769    4769 |
  3. |  3,799    22       .       .    3821    3821 |
  4. |  4,816    20       3    4839    4839    4839 |
  5. |  7,827    15       4    7846    7846    7846 |
  6. |  5,788    18       3    5809    5809    5809 |
  7. |  4,453    26       .       .    4479    4479 |
  8. |  5,189    20       3    5212    5212    5212 |
  9. | 10,372    16       3   10391   10391   10391 |
 10. |  4,082    19       3    4104    4104    4104 |
      ----------------------------------------------  */

21. 生成随机数或随机变量

代码语言:js复制
sysuse auto, clear
set seed 123

*服从均匀分布
gen r  = runiform(), before(make)        //0-1之间随机数
gen r1 = runiform(1,10), before(make)    //1-10之间随机数
gen r2 = runiformint(1,10), before(make) //1-10之间的整数随机数

*服从正态分布
gen rn  = rnormal(), before(make)         //标准正态分布(0, 1)
gen rn1 = rnormal(1), before(make)       //正态分布(1, 1)
gen rn2 = rnormal(5, 2), before(make)    //正态分布(5, 2)

*更多函数:Random-number functions

22. clonevar:生成完全一致的变量

代码语言:js复制
sysuse auto2, clear
keep rep78

gen rep78_2 = rep78

clonevar rep78_3 = rep78

list in 1/10
/*
      ----------------------------- 
     |   rep78   rep78_2   rep78_3 |
     |-----------------------------|
  1. | Average         3   Average |
  2. | Average         3   Average |
  3. |       .         .         . |
  4. | Average         3   Average |
  5. |    Good         4      Good |
     |-----------------------------|
  6. | Average         3   Average |
  7. |       .         .         . |
  8. | Average         3   Average |
  9. | Average         3   Average |
 10. | Average         3   Average |
      -----------------------------  */
tab rep78,m
tab rep78_2, m
tab rep78_3, m

23. keeporder:一步实现 keep 和 order

代码语言:js复制
 ssc install keeporder
 
 *keep   order
 sysuse auto, clear
 keep foreign rep78 make
 order foreign rep78 make
 
 *keeporder
 sysuse auto, clear
 keeporder foreign rep78 make

24. gsort:由大到小排序

代码语言:js复制
*sort 将变量由小到大排序
sysuse auto, clear

*由小达大两者均可
sort  mpg
gsort mpg

**由大到小只能gsort
gsort -mpg

*应用:将变量按类别生成排名
 bys foreign (-mpg): gen n=_n //- invalid name (报错无法执行)

 gsort foreign -mpg      //先排序
 by foreign: gen n = _n  //再生成

*下面的方式也可等价实现
 gen mpg_rev = -mpg
 bys foreign (mpg_rev): gen n=_n
 drop mpg_rev

25. expand:按倍数复制观测行

代码语言:js复制
sysuse auto, clear
keeporder make mpg 
expand 2  //按现有观测行的2倍复制观测行
sort make

26. 生成滞后(lag)与超前(lead)的变量

代码语言:js复制
sysuse uslifeexp, clear
keeporder year le
gen lag_le =  le[_n-1], after(le)
gen lead_le = le[_n 1], after(lag_le)
/*
      -------------------------------- 
     | year     le   lag_le   lead_le |
     |--------------------------------|
  1. | 1900   47.3        .      49.1 |
  2. | 1901   49.1     47.3      51.5 |
  3. | 1902   51.5     49.1      50.5 |
  4. | 1903   50.5     51.5      47.6 |
  5. | 1904   47.6     50.5      48.7 |
  6. | 1905   48.7     47.6      48.7 |
  7. | 1906   48.7     48.7      47.6 |
  8. | 1907   47.6     48.7      51.1 |
  9. | 1908   51.1     47.6      52.1 |
 10. | 1909   52.1     51.1        50 |
 11. | 1910     50     52.1      52.6 |
 12. | 1911   52.6       50      53.5 |
 13. | 1912   53.5     52.6      52.5 |
 14. | 1913   52.5     53.5      54.2 |
 15. | 1914   54.2     52.5      54.5 |
      --------------------------------  */

27. recode:将连续变量生成新分类变量

代码语言:js复制
webuse fullauto, clear
fre rep77 rep78 
recode rep77 rep78 (1 2 = 1 "Below average")   ///
                   (3   = 2 "Average")         ///
                   (4 5 = 3 "Above average"),  ///
                   pre(new) label(newrep)
fre newrep77 newrep78

label list repair newrep
/*
repair:
           1 Poor
           2 Fair
           3 Average
           4 Good
           5 Excellent
newrep:
           1 Below average
           2 Average
           3 Above average */

28. carryforward:向下填充

代码语言:js复制
clear all

input id  time  y  
       1   1   1.2  
       1   3   2.4  
       1   4   3.4  
       1   7   3.2  
       1   9   2.4  
       2   3   1.8  
       2   4   5.6  
       2   6   4.3  
       3   2   2.3  
       3   4   4.5  
       3   7   6.7 
end 

list, sepby(id)
tsset id time  //设定时间序列数据
tsfill, full
list, sepby(id)

bysort id: carryforward y, gen(y1)  //直接向下填充

gsort id -time //以此实现 carry backforward
bysort id: carryforward y1, gen(y2) //反向调整后向下填充

gsort id time
/*
      ----------------------------- 
     | id   time     y    y1    y2 |
     |-----------------------------|
  1. |  1      1   1.2   1.2   1.2 |
  2. |  1      2     .   1.2   1.2 |
  3. |  1      3   2.4   2.4   2.4 |
  4. |  1      4   3.4   3.4   3.4 |
  5. |  1      5     .   3.4   3.4 |
  6. |  1      6     .   3.4   3.4 |
  7. |  1      7   3.2   3.2   3.2 |
  8. |  1      8     .   3.2   3.2 |
  9. |  1      9   2.4   2.4   2.4 |
     |-----------------------------|
 10. |  2      1     .     .   1.8 |
 11. |  2      2     .     .   1.8 |
 12. |  2      3   1.8   1.8   1.8 |
 13. |  2      4   5.6   5.6   5.6 |
 14. |  2      5     .   5.6   5.6 |
 15. |  2      6   4.3   4.3   4.3 |
 16. |  2      7     .   4.3   4.3 |
 17. |  2      8     .   4.3   4.3 |
 18. |  2      9     .   4.3   4.3 |
     |-----------------------------|
 19. |  3      1     .     .   2.3 |
 20. |  3      2   2.3   2.3   2.3 |
 21. |  3      3     .   2.3   2.3 |
 22. |  3      4   4.5   4.5   4.5 |
 23. |  3      5     .   4.5   4.5 |
 24. |  3      6     .   4.5   4.5 |
 25. |  3      7   6.7   6.7   6.7 |
 26. |  3      8     .   6.7   6.7 |
 27. |  3      9     .   6.7   6.7 |
      -----------------------------  */

...... To be revised continuously ......

0 人点赞