Descriptive Statistics Final Project – Python

Overview

Welcome to the Descriptive Statistics Final Project! In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with drawing from a deck of playing cards and creating a writeup containing your findings.

Be sure to check through the project rubric to self-assess and share with others who will give you feedback.

Questions for Investigation

This experiment will require the use of a standard deck of playing cards. This is a deck of fifty-two cards divided into four suits (spades (♠), hearts (), diamonds (), and clubs (♣)), each suit containing thirteen cards (Ace, numbers 2-10, and face cards Jack, Queen, and King). You can use either a physical deck of cards for this experiment or you may use a virtual deck of cards such as that found on random.org (http://www.random.org/playing-cards/).

For the purposes of this task, assign each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10.

1. First, create a histogram depicting the relative frequencies of the card values.

2. Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Replace the drawn cards back into the deck and repeat this sampling procedure a total of at least thirty times.

3. Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.

4. Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?

5. Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.

Investigation

1.Histogram depicting the relative frequencies of the card values.

Deck_Cards.csv contains deck of fifty-two cards divided into four suits, along with assigning each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10)

Below is the format:

cards,suits,value
A,S,1
2,S,2
3,S,3
4,S,4
5,S,5
6,S,6
7,S,7
8,S,8
9,S,9
10,S,10
J,S,10
Q,S,10
K,S,10
A,H,1
2,H,2
3,H,3
4,H,4
5,H,5
6,H,6
7,H,7
8,H,8
9,H,9
10,H,10
J,H,10
Q,H,10
K,H,10
A,D,1
2,D,2
3,D,3
4,D,4
5,D,5
6,D,6
7,D,7
8,D,8
9,D,9
10,D,10
J,D,10
Q,D,10
K,D,10
A,C,1
2,C,2
3,C,3
4,C,4
5,C,5
6,C,6
7,C,7
8,C,8
9,C,9
10,C,10
J,C,10
Q,C,10
K,C,10

Load Data

import pandas as pd
import numpy as np

df = pd.read_csv("/Users/....../Documents/Deck_Cards.csv",header='infer')
print df.head(3)
print df.iloc[:,2].describe()

 

Output:
  cards suits value
0 A S 1
1 2 S 2
2 3 S 3
count   52.000000
mean     6.538462
std      3.183669
min      1.000000
25%      4.000000
50%      7.000000
75%     10.000000
max     10.000000
Name: value, dtype: float64

Plot Histogram

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(211)
ax.hist(df['value'],bins = 10,range=[0.5, 10.5],facecolor='g', align='mid')
ax.xaxis.set_ticks(np.arange(0, 12, 1))


plt.title('Relative Frequencies')
plt.xlabel('Values')
plt.ylabel('Count')
plt.grid(True)

ay = fig.add_subplot(212)
ay.boxplot(df['value'])

plt.show()

 

Output:
histogram_1

Show Frequency Table

freq_table = df.groupby(['value'])
print freq_table.size()
Output:

value
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
9 4
10 16
dtype: int64

2.Sampling and Sampling distribution for sum and average of the three cards’ values

from random import sample

c1=[]
c2=[]
c3=[]
sum_s=[]
average_s=[]

for i in xrange(1000):
 rindex = np.array(sample(xrange(len(df)),3))
 dfr = df.ix[rindex]
 c1.append(dfr.iloc[0,2])
 c2.append(dfr.iloc[1,2])
 c3.append(dfr.iloc[2,2])
 sum_s.append(dfr.iloc[0,2]+dfr.iloc[1,2]+dfr.iloc[2,2])
 average_s.append((dfr.iloc[0,2]+dfr.iloc[1,2]+dfr.iloc[2,2])/(3.0))

sampling_df = pd.DataFrame({'card1':c1,'card2':c2,'card3':c3,'sum_col':sum_s,'average_col':average_s})

print sampling_df.head(3)
Output:

  average_col card1     card2    card3     sum_col
0 7.000000     7         10       4         21
1 5.333333     4         10       2         16
2 7.000000     5          7       9         21

3.Distribution of the card sums. Descriptive statistics for the samples drawn.(At least two measures of central tendency and two measures of variability.)

print sampling_df['sum_col'].describe()
Output:

count 1000.000000
mean 19.646000
std 5.328703
min 5.000000
25% 16.000000
50% 20.000000
75% 23.000000
max 30.000000
Name: sum_col, dtype: float64

4.Histogram of the sampled card sums and comparison of its shape to original distribution

fig = plt.figure()
ay = fig.add_subplot(211)
ay.hist(sampling_df['sum_col'],bins = 25,range=[4, 34],facecolor='g', align='mid')
#ay.xaxis.set_ticks(np.arange(0, 40, 5))

plt.title('Sampling Distribution - SUM')
plt.xlabel('Values(SUM)')
plt.ylabel('Count')

ay = fig.add_subplot(212)
ay.boxplot(sampling_df['sum_col'])

plt.show()

 

Output:

histogram_2

 

The shape of the sampled card sums is same as that of normal distribution and this is in accordance with Central Limit Theorem.

5.Descriptive statistics and histogram of Sampling distribution for Average of the three cards’ values

print sampling_df['average_col'].describe()
Output:

count 1000.000000
mean 6.548667
std 1.776234
min 1.666667
25% 5.333333
50% 6.666667
75% 7.666667
max 10.000000
Name: average_col, dtype: float64

 

fig = plt.figure()
ay = fig.add_subplot(211)
ay.hist(sampling_df['average_col'],bins = 40,range=[4, 34],facecolor='g', align='mid')
#ay.xaxis.set_ticks(np.arange(0, 40, 5))

plt.title('Sampling Distribution - AVG')
plt.xlabel('Values(AVG)')
plt.ylabel('Count')

ay = fig.add_subplot(212)
ay.boxplot(sampling_df['average_col'])

plt.show()

 

Output:

histogram_3

 

The shape of theSampling distribution for Average of the three cards’ values is same as that of normal distribution and this is in accordance with Central Limit Theorem.

Also as perCentral Limit Theorem for Sampling distribution for Average of the three cards’ values,

Population SD/SD of Sampling distribution for Average of the three cards’ values is square root of sample size

3.183669/1.776234 = 1.7923702620262871 ~√ (3 )

 

6.Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? (For sampled card sums distribution)

90% of values will fall between the points corresponding to 5% and 95%.

import scipy.stats as sp

mean_sum_distribution = 19.481000
variance_sum_distribution = 5.435364


p1=sp.norm.ppf(0.05)*variance_sum_distribution+mean_sum_distribution

p2=sp.norm.ppf(0.95)*variance_sum_distribution+mean_sum_distribution

print 'It is expected approximately 90% of the draw values will fall between ',p1,' and ',p2

 

Output:

It is expected approximately 90% of the draw values will fall between  10.5406218108  and  28.4213781892

 

Manual Calculation by referring Z Table (For sampled card sums distribution):

95% is 1.64
5%  is -1.65
90% is between 95% and 5%

Mean = 19.481

Variance = 5.435364

Converting Z values to X

1.64*5.435364+19.481 = 28.39499696~28.4
-1.65*5.435364+19.481 = 10.512649400000003~10.51

The approximate probability that you will get a draw value of at most 20

p_20_atmost = sp.norm.cdf(z_20)

print z_20

print 'The approximate probability that we get a draw value of at least 20 is',(1-p_20_atmost)

 

Output:

0.0954857853126
The approximate probability that we get a draw value of at least 20 is 0.461964490173

 

Manual Calculation by referring Z Table (For sampled card sums distribution):

Z-score = (X-Mu)/Sigma

Z-score for 20 = (20-19.481)/5.435364 = 0.09548578531263009, left of 20

To get more than 20 i.e. atleast 20

1 – Probability(Z-score for 20) i.e 1 – 0.5359 = 0.46409999999999996~0.464

 

References:

Descriptive Statistics Final Project –https://docs.google.com/document/d/1059JMJ9C5dn7vKUrmfWYle57Ai3Uk9PzxPQBGj5drjE/pub?embedded=true
Python For Data Science Cheat Sheet – https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf
R-bloggers – https://www.r-bloggers.com/descriptive-statistics-final-project-with-python-r/
Cheat Sheet for Exploratory Data Analysis in Python – https://www.analyticsvidhya.com/blog/2015/06/infographic-cheat-sheet-data-exploration-python/
Z Table – https://s3.amazonaws.com/udacity-hosted-downloads/ZTable.jpg
Cheat sheet: Data Visualisation in Python – https://www.analyticsvidhya.com/blog/2015/06/data-visualization-in-python-cheat-sheet/

NIL: .to write(1) ~ help:about

Glossolalia about technology by @aknin

Lightflow's Weblog

a collection of interesting experiences as I learn