The package MASS provides a function, fitdistr to fit an observation over discrete distribution using Maximum likelihood.

Obtaining data

We first need to generate some data to fit. The rnegbin(n,mu,theta) function can be used to generate n samples of negative binomial with mean mu and variance mu + mu^2 / theta. To describe the distribution in terms of \(r\) (number of successes) and \(p\) (probability of success), we need to derive these parameters from the properties of \(NB\) distribution, i.e.

\[ \begin{align} r &= \frac{\mu p}{1 - p} = \theta \\ p &= \frac{\theta}{\mu + \theta} \end{align} \]

In the example below, we sample \(X\) from a negative binomial distribution with * mean \(\mu=\) 5 * variance \(\sigma^2 = \mu + \mu^2/\theta,\, \theta=\) 1

or equivalently

\(X \sim NB(r=\) 1\(,p=\) 0.1666667\()\)

Fitting with pre-determined distribution

If we know \(X \sim NB(r,p)\), we can fit the data using this distribution.

x_plot = seq(0, 100,1)
nbfit = fitdistr(x,'negative binomial')
nbfit
##      size         mu    
##   1.1955179   4.4299973 
##  (0.2341407) (0.4565670)

The fitted model contains 2 parameters, which can be converted to our conventional \((r,p)\) as following

r.fit = nbfit$estimate['size']
p.fit = r.fit / (nbfit$estimate['mu'] + r.fit)

The effects of sample size

Conventional wisdom tells us that the more samples we have, the better we should be able to infer, i.e. the less error we make. When the number samples increase, we should expect the inferred parameters to approach the true value. The graphs below clearly illustrate that when the number of samples increases * Log-likelihood approaches zero, i.e. the more confident we are * \((r,p)\) approach the true values

Goodness-of-fit

What if we don’t know the distribution of \(X\) before hand? There are multiple statistical hypothesis tests to evaluate whether a distribution is good for fitting a data set, for example Shapiro-Wilk for normality, Anderson-Darling for a given distribution…

Assuming Poisson distribution

We’ll explore how the Pearson chi-squared test can tell if a distribution, e.g. Poisson can properly describe our data set.

\[ \chi^2 = \sum_{i=1}^{n}\frac{(O_i - E_i)^2}{E_i} \] where * \(O_i\) is the observed frequency in bin \(i\) * \(E_i\) is the expected frequency for bin \(i\)

We want to test whether

\[ \begin{align} H_0 \equiv X \sim Poisson(\lambda) \\ \lambda = \mu \end{align} \]

Under \(H_0\), the outcome of \(X\) would be

nsamples = length(x)
f.obs = table(x)
bins = as.numeric(names(f.obs))
f.hyp = dpois(bins,lambda = mu)*nsamples
chiSquare.pois = sum((f.obs-f.hyp)^2/f.hyp)

Assuming NB distribution

nsamples = length(x)
f.obs = table(x)
bins = as.numeric(names(f.obs))
f.hyp = dnbinom(bins,size=r.fit,prob=p.fit)*nsamples
chiSquare.nb = sum((f.obs-f.hyp)^2/f.hyp)
dof = length(bins) - 1

The goodness-of-fit for each distribution is determined by consulting the \(\chi^2\) distribution with 32 degree-of-freedom.

Distribution chi-squared P(chi.sq)
Poisson 5.400142e+14 0.000000e+00
Negative binomial 9.787575e+01 4.717345e-09

In other words, it’s extremely unlikely that \(X \sim Poisson(\lambda=\) 5 \()\) (Prob = 0 ). The decision with \(H_0 \equiv X \sim NB(r =\) 1 \(, p=\) 0.1666667 \()\) is a bit more difficult, depending on our choice of threshold. In fact it seems this goodness-of-fit test does not do a good job in telling good fit from a bad one.