英文原文链接:http://www.temida.si/~bojan/probability_estimation.php 原文: Probability estimation 1 Introduction Let us assume that in an experiment we have conducted n independent trials, of which there are r successes. The rest of the trials (n-r) are failures. Examples of such experiments are tossing a coin (e.g. success is a head), throwing a dice (e.g. success is 6 on top), shooting a free throw in basketball (success is a scored point), etc.
In this report we will discuss the following question: how to estimate the probability of success in the next (n 1) trial. In addition, we would like to devote special attention to the cases when the sample size is effectively small. In this context we will discuss and compare three methods: relative frequency, Laplace’s law of succession and m-estimate.
What exactly do we mean by effectively small sample size? Good (1965) states then when there are more than three successes and more than three failures in our examples set, there is little difference between any of the described methods for many practical purposes. So, we will basically be dealing with the cases when either the number of successes or the number of failures are small (e.g. 0, 1, 2). Note that such situations can occur also when the actual sample size is large, but we divide the trial set to subsets that fulfill certain conditions for the purpose of estimating conditional probabilities in such subsets. 2 Estimates for success in the next trial 2.1 Relative frequency Relative frequency is sometimes called also maximum likelihood estimate. Probability of success in the nest trial is computed according to the following formula: P=r/n The estimation of probabilities can be regarded as a relatively simple task when the sample size is large enough. In such case we would hardly require any theory. Bernoulli’s theorem states that when n is large enough, the probability of success in the next trial can be reliably estimated by the relative frequency P=r/n More formally, for any arbitrary small ε and δ such n0 can be found that for every n≥n0 the following inequation holds: P(|r/n -P| <ε) >1 - δ However, after completing just one trial, which was a failure, the relative frequency probability estimate of a success in the next trial would be 0. 2.2 Laplace’s law of succession In order to alleviate such zero probability estimations, a modified formula was proposed: P = (r 1)/(n 2) In this formula a uniform prior probability is assumed (Good, 1965). In fact, the rationale behind adding 1 in the numerator and 2 in denominator is the following: before performing the actual experiment, we assume that there were two trials, one successful and one failure. 2.3 Bayesian m-estimate More general Bayesian probability estimate is described in (Cestnik, 1990, 1991). To calculate a general Bayesian probability estimate on the basis of evidence, a prior probability distribution has to be assumed first. Then, given the evidence, the prior distribution is updated to a posterior one, from which the expectation can be taken as a point estimate of p. The task of determining a prior probability distribution has been identified as an intrinsic difficulty of the Bayesian approach. P=(r Pam)/(n m) 3 Estimation of conditional probabilities The basic idea behind the proposed m-estimate (Cestnik, 1990, 1991) for estimating conditional probabilities is that the prior probabilities can be estimated from an unconditional sample. The remaining parameter m, which is related to the variance, has to be determined also. The prior variance is computed by the following formula: Var(p) = Pa(1-Pa)/(m 1) The parameter m is inversely proportional to the variance of the prior distribution. It also leverages the tradeoff between relative frequency and prior probability, as can be observed from the following form of m-estimate: P=n/(n m) * r/n m/(n m) * Pa 4 Literature B. Cestnik: Estimating probabilities in machine learning, Ph.D. thesis, University of Ljubljana, Faculty of Computer and Information Science, 1991.
B. Cestnik: Estimating probabilities: A crucial task in machine learning. In: Carlucci Aiello, Luigia (ed.). ECAI 90. London: Pitman, 1990, str. 147-149.
B. Cestnik, I. Bratko: On estimating probabilities in tree pruning. In: Kodratoff, Yves. Machine learning - EWSL-91 : European working session on learning, Porto, Portugal, March 6-8, 1991: proceedings, (Lecture notes in computer science, Lecture notes in artificial intelligence, 482). Berlin [etc.]: Springer-Verlag, 1991, str. 138-150.
S. Džeroski, B. Cestnik, I. Petrovski: Using the m-estimate in rule induction. CIT. J. Comput. Inf. Technol., 1993, vol. 1, str. 37-46 I. J. Good: The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Cambridge, Mass., M.I.T. Press, 1965.
译文: 概率估计 1引言 假设在一次实验中,我们进行了n次独立试验,其中有r次成功。其余的试验(n-r)失败。这样的试验例如:掷硬币(例如,头像的面为成功),掷骰子(例如,六朝上为成功),篮球中的罚球投篮(一个得分点是成功),等等。 在这份报告中,我们将讨论以下问题:如何估计下次(第n 1次)试验成功的概率。此外,我们将特别关注样例大小足够小的情况。在这种情况下我们将讨论和比较三种方法:相对频率,拉普拉斯平滑定理和M-估计。 足够小究竟是什么意思?Good(1965)指出当在我们的样例集中有三个以上的成功和三个以上的失败时,对于许多实际应用来说以上鸟叔的任何方法之间差别不大。所以,我们基本上要处理的情况是失败的数目或者成功的数目是很小的(如,0,1,2)。注意在这种情况下也常发生在样本大小很大时,但我们将分割试验集成子集以满足一定条件来估计这些子集上的条件概率。 2估计下次试验成功的概率 2.1相对频率 有时也称相对概率为极大似然估计。下次试验成功的概率按照以下公式计算: P=r/n 当样本数量足够大的概率估计,可以看作是一个相对简单的任务。在这种情况下,我们就不需要任何理论。伯努利定理之处,当n足够大时,下次试验成功的概率可以可信的按相对频率P=r/n估计。更正式地说,对于任何任意小ε和δ,存在n0使得对于每一个n≥n0,有以下不等式: P(|r/n -P| <ε) >1 - δ 然而,在完成一次试验后,结果是失败,下次试验的成功的相对频率的估计概率为0。 2.2拉普拉斯平滑定理 为了缓解这种零概率估计,修改该后的方案是: P = (r 1)/(n 2) 在此公式中,假定了统一先验概率(Good,1965)。事实上,分子加1和分母加2背后的基本原理是这样的:在执行实际的试验之前,我们假设已经有两次试验,一次成功和一次失败。 2.3贝叶斯M-估计 (Cestnik,1990,1991)中描述了一种更一般的贝叶斯概率估计。在证据的基础上来计算一般的贝叶斯概率估计,首先假定先验概率分布。然后,根据已有的证据,更新先验分布为后验分布,其预期可以被视为p的一个点估计。确定先验概率分布的任务被认为是贝叶斯方法的一个固有难点。 P=(r Pam)/(n m) 3条件概率的估计 为条件概率估计提出的m-估计(Cestnik,1990,1991)背后的基本想法是,先验概率可以从无条件样本中估计。方差相关的其它参数 m,也还必须确定。先验方差由下式计算: Var(p) = Pa(1-Pa)/(m 1) 参数m与先验分布的方差成反比。从以下形式的M-估计可以观察到,它还利用相对频率和先验概率之间的折中: P=n/(n m) * r/n m/(n m) * Pa
另附M-估计的论文: Estimating Probabilities: A Crucial Task in Machine Learning. Cestnik 1990