侵权投诉# Strong Consistency of the Good-Turing Estimator

## TOP相关主题

*We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d. random variables with unknown distribution. We focus on the regime in which the block length is large yet no symbol appears*

a r X i v :c s /0607014v 1 [c s .I T ] 5 J u l 2006

Strong Consistency of the Good-Turing Estimator

Aaron B.Wagner

Coordinated Science Laboratory Univ.of Illinois at Urbana-Champaign and School of ECE,Cornell University

wagner@ece.cornell.edu

Pramod Viswanath

Coordinated Science Laboratory

and ECE Dept.

Univ.of Illinois at Urbana-Champaign

pramodv@uiuc.edu

Sanjeev R.Kulkarni

EE Dept.

Princeton University kulkarni@princeton.edu

Abstract —We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d.random variables with unknown distribution.We focus on the regime in which the block length is large yet no symbol appears frequently in the string.This is accomplished by allowing the distribution to change with the block length.Under a natural convergence assumption on the sequence of underlying distributions,we show that the total probabilities converge to a deterministic limit,which we characterize.We then show that the Good-Turing total probability estimator is strongly consistent.

I.I NTRODUCTION

The problem of estimating the underlying probability dis-tribution from an observed data sequence arises in a variety of ﬁelds such as compression,adaptive control,and linguistics.The most familiar technique is to use the empirical distribution of the data,also known as the type.This approach has a number of virtues.It is the maximum likelihood (ML)distribution,and if each symbol appears frequently in the string,then the law of large numbers guarantees that the estimate will be close to the true distribution.

In some situations,however,not all symbols will appear frequently in the observed data.One example is a digital image with the pixels themselves,rather than bits,viewed as the symbols [1].Here the size of the alphabet can meet or exceed the total number of observed symbols,i.e.,the number of pixels in the image.Another example is English text.Even in large corpora,many words will appear once or twice or not at all [2].This makes estimating the distribution of English words using the type ineffective.This problem is particularly pronounced when one attempts to estimate the distribution of bigrams,or pairs of words,since the number of bigrams is evidently the square of the number of words.

To see that the empirical distribution is lacking as an estimator for the probabilities of uncommon symbols,con-sider the extreme situation in which the alphabet is inﬁnite and we observe a length-n sequence containing n distinct symbols [3].The ML estimator will assign probability 1/n to the n symbols that appear in the string and zero probability to the rest.But common sense suggests that the (n +1)st symbol in the sequence is very likely to be one that has not yet appeared.It seems that the ML estimator is overﬁtting the data.Modiﬁcations to the ML estimator such as the Laplace “add one”and the Krichevsky-Troﬁmov “add half”[4]have been proposed as remedies,but these only alleviate the problem [3].

In collaboration with Turing,Good [5]proposed an esti-mator for the probabilities of rare symbols that differs con-siderably from the ML estimator.The Good-Turing estimator has been shown to work well in practice [6],and it is now used in several application areas [3].Early theoretical work on the estimator focused on its bias [5],[7],[8].Recent work has been directed toward developing conﬁdence intervals for the estimates using central limit theorems [9],[10]or concentration inequalities [11],[12].Orlitsky,Santhanam,and Zhang [3]showed that the estimator has a pattern redundancy that is small but not optimal.None of these works,however,have shown that the estimator is strongly consistent.

We show that the Good-Turing estimator is strongly consis-tent under a natural formulation of the problem.We consider the problem of estimating the total probability of all symbols that appear k times in the observed string for each nonnegative integer k .For k =0,this is the total probability of the unseen symbols,a quantity that has received particular attention [7],[13].Estimating the total probability of all symbols with the same empirical frequency is a natural approach when the symbols appear infrequently so that there is insufﬁcient data to accurately estimate the probabilities of the individual symbols.Although the total probabilities are themselves random,we show that under our model they converge to a deterministic limit,which we characterize.Note that if the alphabet is small and the block length is large,then the problem effectively reduces to the usual probability estimation problem since it is unlikely that multiple symbols will have the same empirical frequency.

It is known that the Good-Turing estimator performs poorly for high-probability symbols [3],but this is not a problem since the ML estimator can be employed to estimate the probabilities of symbols that appear frequently in the observed string.We therefore focus on the situation in which the symbols are unlikely,meaning that they have probability O (1/n ).We allow the underlying distributions to vary with the block length n in order to maintain this condition,and we assume that,properly scaled,these distributions converge.This model is discussed in detail in the next section,where we also describe the Good-Turing estimator.In Section III,we establish the convergence of the total probabilities.Section IV uses this convergence result to show strong consistency of the Good-Turing estimator.Some comments regarding how to estimate other quantities of interest are made in the ﬁnal section.

- 小学六年级健康教育教案
- 电子厂员工培训考试题目样本
- 人防工程技术标
- 几种改进的邻苯三酚自氧化法测定超氧化物歧化酶活性的比较
- 贵阳市20112012学年度八年级数学上册期末试题及答案
- 山东省事业单位考试教育类教学基础知识部分试卷结构
- 辽宁省凌海市石山初级中学九年级语文上册 8《致女儿的信》“一案三单”问题导读单 (新版)新人教版
- 中国十大元帅和十大大将简介
- 2014年造价工程师考试真题解析
- 银行计算机系统概述
- 调经种玉汤治疗妇科不孕症84例
- 学生教材领取登记表
- 安全专工岗位职责
- 浅谈建筑项目管理的创新
- 高校辅导员笔试资料(B)
- 古建筑监理细则
- 初中化学万能说课稿
- 横格纸打印版
- 物联网在港口物流中应用模式分析
- 两位数乘一位数笔算(连续进位)
- 《婴幼儿营养与保育》三次作业参考答案
- 5S现场可视化规范
- 基础砖胎膜技术交底
- 影视鉴赏论文《阿凡达》
- IQ测试题整理版

本站所有资源均来自互联网，本站只负责收集和整理，均不承担任何法律责任，如有侵权等其它行为请联系我们.

文档下载 Copyright 2013 doc.guandang.net All Rights Reserved.

文档下载 Copyright 2013 doc.guandang.net All Rights Reserved.

## 热门文档