Descriptive Statistics Final Project – Python
December 4, 2016
Posted by on Overview
Welcome to the Descriptive Statistics Final Project! In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with drawing from a deck of playing cards and creating a writeup containing your findings.
Be sure to check through the project rubric to self-assess and share with others who will give you feedback.
Questions for Investigation
This experiment will require the use of a standard deck of playing cards. This is a deck of fifty-two cards divided into four suits (spades (♠), hearts (♥), diamonds (♦), and clubs (♣)), each suit containing thirteen cards (Ace, numbers 2-10, and face cards Jack, Queen, and King). You can use either a physical deck of cards for this experiment or you may use a virtual deck of cards such as that found on random.org (http://www.random.org/playing-cards/).
For the purposes of this task, assign each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10.
1. First, create a histogram depicting the relative frequencies of the card values.
2. Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Replace the drawn cards back into the deck and repeat this sampling procedure a total of at least thirty times.
3. Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.
4. Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?
5. Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.
Investigation
1.Histogram depicting the relative frequencies of the card values.
Deck_Cards.csv contains deck of fifty-two cards divided into four suits, along with assigning each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10)
Below is the format:
cards,suits,value
A,S,1
2,S,2
3,S,3
4,S,4
5,S,5
6,S,6
7,S,7
8,S,8
9,S,9
10,S,10
J,S,10
Q,S,10
K,S,10
A,H,1
2,H,2
3,H,3
4,H,4
5,H,5
6,H,6
7,H,7
8,H,8
9,H,9
10,H,10
J,H,10
Q,H,10
K,H,10
A,D,1
2,D,2
3,D,3
4,D,4
5,D,5
6,D,6
7,D,7
8,D,8
9,D,9
10,D,10
J,D,10
Q,D,10
K,D,10
A,C,1
2,C,2
3,C,3
4,C,4
5,C,5
6,C,6
7,C,7
8,C,8
9,C,9
10,C,10
J,C,10
Q,C,10
K,C,10
Load Data
import pandas as pd import numpy as np df = pd.read_csv("/Users/....../Documents/Deck_Cards.csv",header='infer') print df.head(3) print df.iloc[:,2].describe()
Output: cards suits value 0 A S 1 1 2 S 2 2 3 S 3 count 52.000000 mean 6.538462 std 3.183669 min 1.000000 25% 4.000000 50% 7.000000 75% 10.000000 max 10.000000 Name: value, dtype: float64
Plot Histogram
import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(211) ax.hist(df['value'],bins = 10,range=[0.5, 10.5],facecolor='g', align='mid') ax.xaxis.set_ticks(np.arange(0, 12, 1)) plt.title('Relative Frequencies') plt.xlabel('Values') plt.ylabel('Count') plt.grid(True) ay = fig.add_subplot(212) ay.boxplot(df['value']) plt.show()
Output:
Show Frequency Table
freq_table = df.groupby(['value']) print freq_table.size()
Output: value 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 16 dtype: int64
2.Sampling and Sampling distribution for sum and average of the three cards’ values
from random import sample c1=[] c2=[] c3=[] sum_s=[] average_s=[] for i in xrange(1000): rindex = np.array(sample(xrange(len(df)),3)) dfr = df.ix[rindex] c1.append(dfr.iloc[0,2]) c2.append(dfr.iloc[1,2]) c3.append(dfr.iloc[2,2]) sum_s.append(dfr.iloc[0,2]+dfr.iloc[1,2]+dfr.iloc[2,2]) average_s.append((dfr.iloc[0,2]+dfr.iloc[1,2]+dfr.iloc[2,2])/(3.0)) sampling_df = pd.DataFrame({'card1':c1,'card2':c2,'card3':c3,'sum_col':sum_s,'average_col':average_s}) print sampling_df.head(3)
Output: average_col card1 card2 card3 sum_col 0 7.000000 7 10 4 21 1 5.333333 4 10 2 16 2 7.000000 5 7 9 21
3.Distribution of the card sums. Descriptive statistics for the samples drawn.(At least two measures of central tendency and two measures of variability.)
print sampling_df['sum_col'].describe()
Output: count 1000.000000 mean 19.646000 std 5.328703 min 5.000000 25% 16.000000 50% 20.000000 75% 23.000000 max 30.000000 Name: sum_col, dtype: float64
4.Histogram of the sampled card sums and comparison of its shape to original distribution
fig = plt.figure() ay = fig.add_subplot(211) ay.hist(sampling_df['sum_col'],bins = 25,range=[4, 34],facecolor='g', align='mid') #ay.xaxis.set_ticks(np.arange(0, 40, 5)) plt.title('Sampling Distribution - SUM') plt.xlabel('Values(SUM)') plt.ylabel('Count') ay = fig.add_subplot(212) ay.boxplot(sampling_df['sum_col']) plt.show()
Output:
The shape of the sampled card sums is same as that of normal distribution and this is in accordance with Central Limit Theorem.
5.Descriptive statistics and histogram of Sampling distribution for Average of the three cards’ values
print sampling_df['average_col'].describe()
Output: count 1000.000000 mean 6.548667 std 1.776234 min 1.666667 25% 5.333333 50% 6.666667 75% 7.666667 max 10.000000 Name: average_col, dtype: float64
fig = plt.figure() ay = fig.add_subplot(211) ay.hist(sampling_df['average_col'],bins = 40,range=[4, 34],facecolor='g', align='mid') #ay.xaxis.set_ticks(np.arange(0, 40, 5)) plt.title('Sampling Distribution - AVG') plt.xlabel('Values(AVG)') plt.ylabel('Count') ay = fig.add_subplot(212) ay.boxplot(sampling_df['average_col']) plt.show()
Output:
The shape of theSampling distribution for Average of the three cards’ values is same as that of normal distribution and this is in accordance with Central Limit Theorem.
Also as perCentral Limit Theorem for Sampling distribution for Average of the three cards’ values,
Population SD/SD of Sampling distribution for Average of the three cards’ values is square root of sample size
3.183669/1.776234 = 1.7923702620262871 ~√ (3 )
6.Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? (For sampled card sums distribution)
90% of values will fall between the points corresponding to 5% and 95%.
import scipy.stats as sp mean_sum_distribution = 19.481000 variance_sum_distribution = 5.435364 p1=sp.norm.ppf(0.05)*variance_sum_distribution+mean_sum_distribution p2=sp.norm.ppf(0.95)*variance_sum_distribution+mean_sum_distribution print 'It is expected approximately 90% of the draw values will fall between ',p1,' and ',p2
Output: It is expected approximately 90% of the draw values will fall between 10.5406218108 and 28.4213781892
Manual Calculation by referring Z Table (For sampled card sums distribution):
95% is 1.64
5% is -1.65
90% is between 95% and 5%
Mean = 19.481
Variance = 5.435364
Converting Z values to X
1.64*5.435364+19.481 = 28.39499696~28.4
-1.65*5.435364+19.481 = 10.512649400000003~10.51
The approximate probability that you will get a draw value of at most 20
p_20_atmost = sp.norm.cdf(z_20) print z_20 print 'The approximate probability that we get a draw value of at least 20 is',(1-p_20_atmost)
Output: 0.0954857853126 The approximate probability that we get a draw value of at least 20 is 0.461964490173
Manual Calculation by referring Z Table (For sampled card sums distribution):
Z-score = (X-Mu)/Sigma
Z-score for 20 = (20-19.481)/5.435364 = 0.09548578531263009, left of 20
To get more than 20 i.e. atleast 20
1 – Probability(Z-score for 20) i.e 1 – 0.5359 = 0.46409999999999996~0.464
References:
Descriptive Statistics Final Project –https://docs.google.com/document/d/1059JMJ9C5dn7vKUrmfWYle57Ai3Uk9PzxPQBGj5drjE/pub?embedded=true
Python For Data Science Cheat Sheet – https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf
R-bloggers – https://www.r-bloggers.com/descriptive-statistics-final-project-with-python-r/
Cheat Sheet for Exploratory Data Analysis in Python – https://www.analyticsvidhya.com/blog/2015/06/infographic-cheat-sheet-data-exploration-python/
Z Table – https://s3.amazonaws.com/udacity-hosted-downloads/ZTable.jpg
Cheat sheet: Data Visualisation in Python – https://www.analyticsvidhya.com/blog/2015/06/data-visualization-in-python-cheat-sheet/