DiscreteEntropy

Summary

DiscreteEntropy is a Julia package to estimate the Shannon entropy of discrete data.

DiscreteEntropy implements a large collection of entropy estimators.

At present, we have implementations for:

FunctionType
maximum_likelihoodMaximumLikelihood
jackknife_mleJackknifeMLE
miller_madowMillerMadow
grassbergerGrassberger
schurmannSchurmann
schurmann_generalisedSchurmannGeneralised
bubBUB
chao_shenChaoShen
zhangZhang
bonachelaBonachela
shrinkShrink
chao_wang_jostChaoWangJost
unseenUnseen
bayesBayes
jeffreyJeffrey
laplaceLaplace
schurmann_grassbergerSchurmannGrassberger
minimaxMinimax
nsbNSB
ansbANSB
pymPYM

The type is mainly used with the function estimate_h, see below.

We also have some non-traditional mixed estimators, such as jackknife, which allows jackknife resampling to be applied to any estimator, bayesian_bootstrap which applies bootstrap resampling to an estimator, and pert, which is a three point estimation technique combining pessimistic and optimistic estimations.

In addition, we also provide a number of other information theoretic measures which use these estimators under the hood:

Installing DiscreteEntropy

  1. If you have not done so already, install Julia. Julia 1.8 to Julia <=1.10 are currently supported. Nightly and Julia 1.11 are not (yet) supported.

  2. Install DiscreteEntropy using

using Pkg; Pkg.add("DiscreteEntropy")

or

] add DiscreteEntropy

Basic Usage

using DiscreteEntropy

data = [1,2,3,4,3,2,1];
7-element Vector{Int64}:
 1
 2
 3
 4
 3
 2
 1

Most of the estimators take a CountData object. This is a compact representation of the histogram of the random variable. It can be pretty easy to forget whether a vector represents a histogram or a set of samples, so DiscreteEntropy forces you to say which it is when creating a CountData object. The easiest way to create a CountData object is using from_data.

# if `data` is a histogram already
cd = from_data(data, Histogram)
CountData([4.0 2.0 3.0 1.0; 1.0 2.0 2.0 2.0], 16.0, 7)
# or if `data` is actually a vector of samples

cds = from_data(data, Samples)
CountData([2.0 1.0; 3.0 1.0], 7.0, 4)
# now we can estimate
h = estimate_h(from_data(data, Histogram), ChaoShen)
# treating data as a vector of samples
h = estimate_h(from_data(data, Samples), ChaoShen)
1.6310218225019266

DiscreteEntropy.jl outputs Shannon measures in nats. There are helper functions to convert to_bits and to_bans

h = to_bits(estimate_h(cd, ChaoShen))
2.997302182277761
h = to_bans(estimate_h(cd, ChaoShen))
0.9022778629347157

Contributing

All contributions are welcome! Please see CONTRIBUTING.md for details. Anyone wishing to add an estimator is particularly welcome. Ideally, the estimator will take a CountData struct, though this might not always be suitable (eg schurmann_generalised) and also added to estimate_h. Any estimator will also have to come with tests.