## 真实熵

$E= (\frac{1}{n} \sum_i \Lambda_i )^{-1} ln (n)$

where $\Lambda_i$ is the length of the shortest substring starting at position i which doesn’t previously appear from position 1 to i-1.

The estimated entropy converges to the real entropy of the time series when n approaches to infinity.1

## Lempel-Ziv Compression Algorithm

In Haifa, Israel in 1977, Abraham Lempel and Jacob Ziv develop a highly flexible algorithm that compresses data with no data loss. First published in the IEEE Transactions on Information Theory in May 1977 and improved in 1978, it enabled efficient data transmission via the internet.

These algorithms, originally known as LZ77 and LZ78 and now referred to as LZ1 and LZ2, respectively, were foundational to the development of subsequent compression algorithms and are the root of compression programs like GIF and DEFLATE, which is used in PNG files.

Lempel and Ziv’s 1977 algorithm compressed data by replacing repeated instances of data with a single reference copy of that data as it appeared earlier in the uncompressed, or input, data stream. These data matches were registered as a pair of numbers known as the length-distance pair. To code the matches, LZ77 used so-called sliding window compression. The most recent data in the stream would be held for a length of time, during which the encoder would search for matches. The longer the sliding window, the greater the ability of the encoder to build references.

LZ78 changed the encoding scheme by replacing repeated instances of data with references to a dictionary. This dictionary was built to match the data entering the input stream. Subsequent modifications of LZ78, such as LZW, used an algorithm that was pre-initialized with all possible characters.

import math
def contains(small, big):
for i in range(len(big)-len(small)+1):
if big[i:i+len(small)] == small:
return True
return False

def actual_entropy(l):
n = len(l)
sequence = [l[0]]
sum_gamma = 0
shortest_substrings = [[0, l[0], None]]
for i in range(1, n):
for j in range(i+1, n+1):
s = l[i:j]
if not contains(list(s), sequence): # s is not contained in previous sequence
shortest_substrings.append([i,l[i], list(s)])
sum_gamma += len(s)
sequence.append(l[i])
break

ae = 1 / (sum_gamma / n ) * math.log(n)
return ae, sum_gamma, n, shortest_substrings

To construct a time series for each user we segment the three month observation period into hour-long intervals. Each interval is assigned a tower ID if one is known (i.e. the phone was used in that time interval). If multiple calls were made in a given interval, we choose one of them at random. Finally if no call is made in a given interval, we assign it an ID “?”, implying an unknown location. Thus for each user i we obtain a string of length $L = 24 * 7 * 14 = 2352$ with $N_i + 1$ distinct symbols, each denoting one of the Ni towers visited by the user and one for the missing location “?”. (Supporting Online Material for Limits of Predictability in Human Mobility, page 4)

D1='1212121212'
D2='2112221121'

actual_entropy(D1)

(1.3544618194082623,
17,
10,
[[0, '1', None],
[1, '2', ['2']],
[2, '1', ['1', '2', '1']],
[3, '2', ['2', '1', '2']],
[4, '1', ['1', '2', '1', '2', '1']],
[5, '2', ['2', '1', '2', '1', '2']]])

actual_entropy(D2)

(1.2118868910494978,
19,
10,
[[0, '2', None],
[1, '1', ['1']],
[2, '1', ['1', '2']],
[3, '2', ['2', '2']],
[4, '2', ['2', '2']],
[5, '2', ['2', '1', '1', '2', '1']],
[6, '1', ['1', '1', '2', '1']],
[7, '1', ['1', '2', '1']]])

### 问题：第二个序列的信息更丰富，熵越小？错误❌

import random
random.seed(2020)

D2 = []

for i in range(200):
if random.random() >= 0.5:
D2.append('1')
else:
D2.append('2')
D2  = ''.join(D2)
D1 = '12'*100

actual_entropy(D1), actual_entropy(D2)

(0.21197508967985743, 0.7645479605408423)

import random
import numpy as np

random.seed(2020)
ent_list = []

for n in range(5, 200):
D2 = []
for i in range(n*2):
if random.random() >= 0.5:
D2.append('1')
else:
D2.append('2')
D2  = ''.join(D2)
D1 = '12'*n
ent_list.append([n, actual_entropy(D1), actual_entropy(D2)])

n, ent1, ent2 = np.array(ent_list).T

import pylab as plt

plt.plot(2*n, ent1, label = 'regular string')
plt.plot(2*n, ent2, label = 'random string')
plt.axhline(y=0.8,ls=":",c="red", alpha = 0.3)#添加水平直线
plt.axhline(y=0.7,ls=":",c="red", alpha = 0.3)#添加水平直线
plt.axhline(y=0.2,ls=":",c="red", alpha = 0.3)#添加水平直线
plt.axhline(y=0.1,ls=":",c="red", alpha = 0.3)#添加水平直线

plt.ylim(0, 2)
plt.legend(fontsize = 15)
#plt.xscale('log', basex=2)
plt.xlabel('length of string',fontsize = 15)
plt.ylabel('actual entropy',fontsize = 15)
plt.show()

1. Kontoyiannis I., Algoet P. H., Suhov Yu. M., Wyner A. J. Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text, IEEE Transactions on Information Theory 44, 1319-1327 (1998).

Tags:

Categories:

Updated: