R语言：PPS抽样

今天有朋友咨询我怎么写PPS抽样的代码，试着找了下，找到一个实现PPS抽样的R包。

百度百科: PPS 抽样是指按概率比例抽样，属于概率抽样中的一种。是指在多阶段抽样中，尤其是二阶段抽样中，初级抽样单位被抽中的机率取决于其初级抽样单位的规模大小，初级抽样单位规模越大，被抽中的机会就越大，初级抽样单位规模越小，被抽中的机率就越小。就是将总体按一种准确的标准划分出容量不等的具有相同标志的单位在总体中不同比率分配的样本量进行的抽样。

R语言源码：

代码语言：javascript复制

function (m, x) 
{
    N <- length(x)
    pk <- x/sum(x)
    cumpk <- cumsum(pk)
    U <- runif(m)
    ints <- cbind(c(0, cumpk[-N]), cumpk)
    sam <- rep(0, m)
    for (i in 1:m) {
        sam[i] <- which(U[i] > ints[, 1] & U[i] < ints[, 2])
    }
    return(cbind(sam, pk[sam]))
}

这段代码来自R包：TeachingSampling，从代码便可以知道它的原理。

例子：

代码语言：javascript复制

> library(TeachingSampling)
> data(Lucy)
> attach(Lucy)
The following objects are masked from Lucy (pos = 3):

    Employees, ID, Income, Level, SPAM, Taxes, Ubication, Zone

> res<-S.PPS(400,Income)#基于Income抽样
> head(res)
      sam             
[1,]  894 0.0002994541
[2,] 1717 0.0006278877
[3,]   49 0.0003226377
[4,] 2336 0.0015590934
[5,]  194 0.0003187737
[6,] 1700 0.0007921045
> sam <- res[,1]
> head(sam)
[1]  894 1717   49 2336  194 1700
> data <- Lucy[sam,]#得到的抽样样本
> head(data)
         ID Ubication  Level Zone Income Employees Taxes SPAM
894  AB2054     c10k3  Small    C    310        94     4  yes
1717 AB1145    c18k34 Medium    A    650       117    21  yes
49    AB050     c1k49  Small    A    334        16     5   no
2336 AB1126    c25k59    Big    A   1614       159   138  yes
194  AB1398     c2k95  Small    B    330        39     4  yes
1700 AB1122    c18k17 Medium    A    820        82    34  yes
> dim(data)
[1] 400   8

理论部分的解释请看：http://blog.csdn.net/zrjdds/article/details/50231551

百度

0 人点赞