六、频率和分布
大约 2 分钟
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# To recall, this is the code to mimic the roll dice game for 50 times
die = pd.DataFrame([1, 2, 3, 4, 5, 6])
trial = 50
results = [die.sample(2, replace=True).sum().loc[0] for i in range(trial)]
# This is the code for summarizing the results of sum of faces by frequency
freq = pd.DataFrame(results)[0].value_counts()
sort_freq = freq.sort_index()
print(sort_freq)
3 3
4 5
5 4
6 11
7 14
8 2
9 4
10 1
11 3
12 3
Name: 0, dtype: int64
#plot the bar chart base on the result
sort_freq.plot(kind='bar', color='blue', figsize=(15, 8))
<Axes: xlabel='0'>

Relative Frequency
# Using relative frequency, we can rescale the frequency so that we can compare results from different number of trials
relative_freq = sort_freq/trial
relative_freq.plot(kind='bar', color='blue', figsize=(15, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x7efd2dbdabe0>

# Let us try to increase the number of trials to 10000, and see what will happen...
trial = 10000
results = [die.sample(2, replace=True).sum().loc[0] for i in range(trial)]
freq = pd.DataFrame(results)[0].value_counts()
sort_freq = freq.sort_index()
relative_freq = sort_freq/trial
relative_freq.plot(kind='bar', color='blue', figsize=(15, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x7efd2dc84828>

我们可以看到,随着试验次数的增加,结果越来越稳定,这非常接近于概率分布。尝试进一步增加“trial”的次数(但Jupyter Notebook可能需要一些时间才能输出结果)
Expectation and Variance of a distribution
# assume that we have fair dice, which means all faces will be shown with equal probability
# then we can say we know the 'Distribtuion' of the random variable - sum_of_dice
X_distri = pd.DataFrame(index=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
X_distri['Prob'] = [1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1]
X_distri['Prob'] = X_distri['Prob']/36
X_distri

mean = pd.Series(X_distri.index * X_distri['Prob']).sum()
var = pd.Series(((X_distri.index - mean)**2)*X_distri['Prob']).sum()
#Output the mean and variance of the distribution. Mean and variance can be used to describe a distribution
print(mean, var)
6.999999999999999 5.833333333333334
Empirical mean and variance
# if we calculate mean and variance of outcomes (with high enough number of trials, eg 20000)...
trial = 20000
results = [die.sample(2, replace=True).sum().loc[0] for i in range(trial)]
#print the mean and variance of the 20000 trials
results = pd.Series(results)
print(results.mean(), results.var())
6.99505 5.864618728436524