pewanalytics.stats: Statistical Tools
In the pewanalytics.stats
module, you’ll find a variety of statistical utilities for weighting, clustering, dimensionality reduction, and inter-rater reliability.
Clustering
The pewanalytics.stats.clustering
submodule contains several functions for extracting clusters from your data.
Functions:
|
Uses K-Means to cluster an arbitrary set of features. |
|
Uses HDBSCAN* to identify the best number of clusters and map each unit to one. |
- compute_kmeans_clusters(features, k=10, return_score=False)[source]
Uses K-Means to cluster an arbitrary set of features. This function expects input data where the rows are units and columns are features.
- Parameters
features – TF-IDF sparse matrix or
pandas.DataFrame
k (int) – The number of clusters to extract
return_score (bool) – If True, the function returns a tuple with the cluster assignments and the silhouette score of the clustering; otherwise the function just returns a list of cluster labels for each row. (Default=False)
- Returns
A list with the cluster label for each row, or a tuple containing the labels followed by the silhouette score of the K-Means model.
- Return type
list
Usage:
from pewanalytics.stats.clustering import compute_kmeans_clusters from sklearn import datasets import pandas as pd # The iris dataset is a common example dataset included in scikit-learn with 3 main clusters # Let's see if we can find them df = pd.DataFrame(datasets.load_iris().data) >>> df['cluster'] = compute_kmeans_clusters(df, k=3) KMeans: n_clusters 3, score is 0.5576853964035263 >>> df['cluster'].value_counts() 1 62 0 50 2 38 Name: cluster, dtype: int64
- compute_hdbscan_clusters(features, min_cluster_size=100, min_samples=1, **kwargs)[source]
Uses HDBSCAN* to identify the best number of clusters and map each unit to one. This function expects input data where the rows are units and columns are features. Additional keyword arguments are passed to HDBSCAN. Check out the official documentation for more: https://hdbscan.readthedocs.io/en/latest
- Parameters
features – TF-IDF sparse matrix or
pandas.DataFrame
min_cluster_size (int) – int - minimum number of documents/units that can exist in a cluster.
min_samples (int) – Minimum number of samples to draw (see HDBSCAN documentation for more)
kwargs – Additional HDBSCAN parameters: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html
- Returns
A list with the cluster label for each row
Usage:
from pewanalytics.stats.clustering import compute_hdbscan_clusters from sklearn import datasets import pandas as pd df = pd.DataFrame(datasets.load_iris().data) >>> df['cluster'] = compute_hdbscan_clusters(df, min_cluster_size=10) HDBSCAN: n_clusters 2 >>> df['cluster'].value_counts() 1 100 0 50 Name: cluster, dtype: int64
Dimensionality Reduction
The pewanalytics.stats.dimensionality_reduction
submodule contains functions for collapsing your data into underlying dimensions using methods like PCA and correspondence analysis.
Functions:
|
Performs PCA on a set of features. |
|
Performs LSA on a set of features. |
|
Performs correspondence analysis on a set of features. |
- get_pca(features, feature_names=None, k=20)[source]
Performs PCA on a set of features. This function expects input data where the rows are units and columns are features.
For more information about how PCA is implemented, visit the Scikit-Learn Documentation.
- Parameters
features – A
pandas.DataFrame
or sparse matrix where rows are units/observations and columns are featuresfeature_names (list) – An optional list of feature names (for sparse matrices)
k (int) – Number of dimensions to extract
- Returns
A tuple of two
pandas.DataFrame
s, (features x components, units x components)- Return type
tuple
Usage:
from pewanalytics.stats.dimensionality_reduction import get_pca from sklearn import datasets import pandas as pd df = pd.DataFrame(datasets.load_iris().data) >>> feature_weights, df_reduced = get_pca(df, k=2) Decomposition explained variance ratio: 0.977685206318795 Top features: Component 0: [2 0 3 1] Component 1: [1 0 3 2] >>> feature_weights pca_0 pca_1 0 0.361387 0.656589 1 -0.084523 0.730161 2 0.856671 -0.173373 3 0.358289 -0.075481 >>> df_reduced.head() pca_0 pca_1 pca 0 -2.684126 0.319397 pca_1 1 -2.714142 -0.177001 pca_1 2 -2.888991 -0.144949 pca_1 3 -2.745343 -0.318299 pca_1 4 -2.728717 0.326755 pca_1
- get_lsa(features, feature_names=None, k=20)[source]
Performs LSA on a set of features. This function expects input data where the rows are units and columns are features.
For more information about how LSA is implemented, visit the Scikit-Learn Documentation.
- Parameters
features – A
pandas.DataFrame
or sparse matrix with rows are units/observations and columns are featuresfeature_names (list) – An optional list of feature names (for sparse matrices)
k (int) – Number of dimensions to extract
- Returns
A tuple of two
pandas.DataFrame
s, (features x components, documents x components)- Return type
tuple
Usage:
from pewanalytics.stats.dimensionality_reduction import get_lsa from sklearn import datasets import pandas as pd df = pd.DataFrame(datasets.load_iris().data) >>> feature_weights, df_reduced = get_lsa(df, k=2) Decomposition explained variance ratio: 0.9772093692426493 Top features: Component 0: [0 2 1 3] Component 1: [1 0 3 2] >>> feature_weights lsa_0 lsa_1 0 0.751108 0.284175 1 0.380086 0.546745 2 0.513009 -0.708665 3 0.167908 -0.343671 >>> df_reduced.head() lsa_0 lsa_1 lsa 0 5.912747 2.302033 lsa_0 1 5.572482 1.971826 lsa_0 2 5.446977 2.095206 lsa_0 3 5.436459 1.870382 lsa_0 4 5.875645 2.328290 lsa_0
- correspondence_analysis(edges, n=1)[source]
Performs correspondence analysis on a set of features.
Most useful in the context of network analysis, where you might wish to, for example, identify the underlying dimension in a network of Twitter users by using a matrix representing whether or not they follow one another (when news and political accounts are included, the underlying dimension often appears to approximate the left-right political spectrum.)
- Parameters
edges – A
pandas.DataFrame
of NxN where both the rows and columns are “nodes” and the values are some sort of closeness or similarity measure (like a cosine similarity matrix)n (int) – The number of dimensions to extract
- Returns
A
pandas.DataFrame
where rows are the units and the columns correspond to the extracted dimensions.
Usage:
from pewanalytics.stats.dimensionality_reduction import correspondence_analysis import nltk import pandas as pd from sklearn.metrics.pairwise import linear_kernel from sklearn.feature_extraction.text import TfidfVectorizer nltk.download("inaugural") df = pd.DataFrame([ {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids() ]) vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text']) tfidf = vec.transform(df['text']) cosine_similarities = linear_kernel(tfidf) matrix = pd.DataFrame(cosine_similarities, columns=df['speech']) # Looks like the main source of variation in the language of inaugural speeches is time! >>> mca = correspondence_analysis(matrix) >>> mca.sort_values("mca_1").head() node mca_1 57 1993-Clinton.txt -0.075508 56 2017-Trump.txt -0.068168 55 1997-Clinton.txt -0.061567 54 1973-Nixon.txt -0.060698 53 1989-Bush.txt -0.056305 >>> mca.sort_values("mca_1").tail() node mca_1 4 1877-Hayes.txt 0.040037 3 1817-Monroe.txt 0.040540 2 1845-Polk.txt 0.042847 1 1849-Taylor.txt 0.050937 0 1829-Jackson.txt 0.056201
Inter-Rater Reliability
The pewanalytics.stats.irr
submodule contains functions for computing measures of inter-rater reliability and model performance, including Cohen’s Kappa, Krippendorf’s Alpha, precision, recall, and much more.
Functions:
|
Python translation of the |
|
Helps determine the required document sample size to confirm that Cohen's Kappa between coders is at or above a minimum threhsold. |
|
Computes a variety of inter-rater reliability scores, including Cohen's kappa, Krippendorf's alpha, precision, and recall. |
|
Computes overall inter-rater reliability scores (Krippendorf's Alpha and Fleiss' Kappa). |
Computes overall inter-rater reliability scores (Krippendorf's Alpha and Fleiss' Kappa). |
- kappa_sample_size_power(rate1, rate2, k1, k0, alpha=0.05, power=0.8, twosided=False)[source]
Python translation of the
N.cohen.kappa
function from theirr
R package.Source: https://cran.r-project.org/web/packages/irr/irr.pdf
- Parameters
rate1 (float) – The probability that the first rater will record a positive diagnosis
rate2 (float) – The probability that the second rater will record a positive diagnosis
k1 (float) – The true Cohen’s Kappa statistic
k0 (float) – The value of kappa under the null hypothesis
alpha (float) – Type I error of test
power (float) – The desired power to detect the difference between true kappa and hypothetical kappa
twosided – Set this to
True
if the test is two-sidedtwosided – bool
- Returns
Returns the required sample size
- Return type
int
- kappa_sample_size_CI(kappa0, kappaL, props, kappaU=None, alpha=0.05)[source]
Helps determine the required document sample size to confirm that Cohen’s Kappa between coders is at or above a minimum threhsold. Useful in situations where multiple coders code a set of documents for a binary outcome.
This function takes the observed kappa and proportion of positive cases from the sample, along with a lower-bound for the minimum acceptable kappa, and returns the sample size required to confirm that the coders’ agreement is truly above that minimum level of kappa with 95% certainty. If the current sample size is below the required sample size returned by this function, it can provide a rough estimate of how many additional documents need to be coded - assuming that the coders continue agreeing and observing positive cases at the same rate.
Translated from the
kappaSize
R package,CIBinary
: https://github.com/cran/kappaSize/blob/master/R/CIBinary.R- Parameters
kappa0 – The preliminary value of kappa
kappa0 – float
kappaL (float) – The desired expected lower bound for a two-sided 100(1 - alpha) % confidence interval for kappa. Alternatively, if kappaU is set to NA, the procedure produces the number of required subjects for a one-sided confidence interval
props (float) – The anticipated prevalence of the desired trait
kappaU (float) – The desired expected upper confidence limit for kappa
alpha (float) – The desired type I error rate
- Returns
Returns the required sample size
Usage:
from pewanalytics.stats.irr import kappa_sample_size_CI observed_kappa = 0.8 desired_kappa = 0.7 observed_proportion = 0.5 >>> kappa_sample_size(observed_kappa, desired_kappa, observed_proportion) 140
- compute_scores(coder_df, coder1, coder2, outcome_column, document_column, coder_column, weight_column=None, pos_label=None)[source]
Computes a variety of inter-rater reliability scores, including Cohen’s kappa, Krippendorf’s alpha, precision, and recall. The input data must consist of a
pandas.DataFrame
with the following columns:A column with values that indicate the coder (like a name)
A column with values that indicate the document (like an ID)
A column with values that indicate the code value
(Optional) A column with document weights
This function will return a
pandas.DataFrame
with agreement scores between the two specified coders.- Parameters
coder_df (
pandas.DataFrame
) – Apandas.DataFrame
of codescoder1 (str or int) – The value in
coder_column
for rows corresponding to the first codercoder2 (str or int) – The value in
coder_column
for rows corresponding to the second coderoutcome_column (str) – The column that contains the codes
document_column (str) – The column that contains IDs for the documents
coder_column (str) – The column containing values that indicate which coder assigned the code
weight_column (str) – The column that contains sampling weights
pos_label (str or int) – The value indicating a positive label (optional)
- Returns
A dictionary of scores
- Return type
dict
Note
If using a multi-class (non-binary) code, some scores may come back null or not compute as expected. We recommend running the function separately for each specific code value as a binary flag by providing each unique value to the
pos_label
argument. Ifpos_label
is not provided for multi-class codes, this function will attempt to compute scores based on support-weighted averages.Usage:
from pewanalytics.stats.irr import compute_scores import pandas as pd df = pd.DataFrame([ {"coder": "coder1", "document": 1, "code": "2"}, {"coder": "coder2", "document": 1, "code": "2"}, {"coder": "coder1", "document": 2, "code": "1"}, {"coder": "coder2", "document": 2, "code": "2"}, {"coder": "coder1", "document": 3, "code": "0"}, {"coder": "coder2", "document": 3, "code": "0"}, ]) >>> compute_scores(df, "coder1", "coder2", "code", "document", "coder") {'coder1': 'coder1', 'coder2': 'coder2', 'n': 3, 'outcome_column': 'code', 'pos_label': None, 'coder1_mean_unweighted': 1.0, 'coder1_std_unweighted': 0.5773502691896257, 'coder2_mean_unweighted': 1.3333333333333333, 'coder2_std_unweighted': 0.6666666666666666, 'alpha_unweighted': 0.5454545454545454, 'accuracy': 0.6666666666666666, 'f1': 0.5555555555555555, 'precision': 0.5, 'recall': 0.6666666666666666, 'precision_recall_min': 0.5, 'matthews_corrcoef': 0.6123724356957946, 'roc_auc': None, 'pct_agree_unweighted': 0.6666666666666666} >>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="0") {'coder1': 'coder1', 'coder2': 'coder2', 'n': 3, 'outcome_column': 'code', 'pos_label': '0', 'coder1_mean_unweighted': 0.3333333333333333, 'coder1_std_unweighted': 0.3333333333333333, 'coder2_mean_unweighted': 0.3333333333333333, 'coder2_std_unweighted': 0.3333333333333333, 'alpha_unweighted': 1.0, 'cohens_kappa': 1.0, 'accuracy': 1.0, 'f1': 1.0, 'precision': 1.0, 'recall': 1.0, 'precision_recall_min': 1.0, 'matthews_corrcoef': 1.0, 'roc_auc': 1.0, 'pct_agree_unweighted': 1.0} >>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="1") {'coder1': 'coder1', 'coder2': 'coder2', 'n': 3, 'outcome_column': 'code', 'pos_label': '1', 'coder1_mean_unweighted': 0.3333333333333333, 'coder1_std_unweighted': 0.3333333333333333, 'coder2_mean_unweighted': 0.0, 'coder2_std_unweighted': 0.0, 'alpha_unweighted': 0.0, 'cohens_kappa': 0.0, 'accuracy': 0.6666666666666666, 'f1': 0.0, 'precision': 0.0, 'recall': 0.0, 'precision_recall_min': 0.0, 'matthews_corrcoef': 1.0, 'roc_auc': None, 'pct_agree_unweighted': 0.6666666666666666} >>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="2") {'coder1': 'coder1', 'coder2': 'coder2', 'n': 3, 'outcome_column': 'code', 'pos_label': '2', 'coder1_mean_unweighted': 0.3333333333333333, 'coder1_std_unweighted': 0.3333333333333333, 'coder2_mean_unweighted': 0.6666666666666666, 'coder2_std_unweighted': 0.3333333333333333, 'alpha_unweighted': 0.4444444444444444, 'cohens_kappa': 0.3999999999999999, 'accuracy': 0.6666666666666666, 'f1': 0.6666666666666666, 'precision': 0.5, 'recall': 1.0, 'precision_recall_min': 0.5, 'matthews_corrcoef': 0.5, 'roc_auc': 0.75, 'pct_agree_unweighted': 0.6666666666666666}
- compute_overall_scores(coder_df, outcome_column, document_column, coder_column)[source]
Computes overall inter-rater reliability scores (Krippendorf’s Alpha and Fleiss’ Kappa). Allows for more than two coders and code values. The input data must consist of a
pandas.DataFrame
with the following columns:A column with values that indicate the coder (like a name)
A column with values that indicate the document (like an ID)
A column with values that indicate the code value
- Parameters
coder_df (
pandas.DataFrame
) – Apandas.DataFrame
of codesoutcome_column (str) – The column that contains the codes
document_column (str) – The column that contains IDs for the documents
coder_column (str) – The column containing values that indicate which coder assigned the code
- Returns
A dictionary containing the scores
- Return type
dict
Usage:
from pewanalytics.stats.irr import compute_overall_scores import pandas as pd df = pd.DataFrame([ {"coder": "coder1", "document": 1, "code": "2"}, {"coder": "coder2", "document": 1, "code": "2"}, {"coder": "coder1", "document": 2, "code": "1"}, {"coder": "coder2", "document": 2, "code": "2"}, {"coder": "coder1", "document": 3, "code": "0"}, {"coder": "coder2", "document": 3, "code": "0"}, ]) >>> compute_overall_scores(df, "code", "document", "coder") {'alpha': 0.5454545454545454, 'fleiss_kappa': 0.4545454545454544}
- compute_overall_scores_multivariate(coder_df, document_column, coder_column, outcome_columns)[source]
Computes overall inter-rater reliability scores (Krippendorf’s Alpha and Fleiss’ Kappa). Allows for more than two coders, code values, AND variables. All variables and values will be converted into a matrix of dummy variables, and Alpha and Kappa will be computed using four different distance metrics:
Discrete agreement (exact agreement across all outcome columns)
Jaccard coefficient
MASI distance
Cosine similarity
The input data must consist of a
pandas.DataFrame
with the following columns:A column with values that indicate the coder (like a name)
A column with values that indicate the document (like an ID)
One or more columns with values that indicate code values
This code was adapted from a very helpful StackExchange post: https://stats.stackexchange.com/questions/511927/interrater-reliability-with-multi-rater-multi-label-dataset
- Parameters
coder_df (
pandas.DataFrame
) – Apandas.DataFrame
of codesdocument_column (str) – The column that contains IDs for the documents
coder_column (str) – The column containing values that indicate which coder assigned the code
outcome_columns (list) – The columns that contains the codes
- Returns
A dictionary containing the scores
- Return type
dict
Usage:
from pewanalytics.stats.irr import compute_overall_scores_multivariate import pandas as pd coder_df = pd.DataFrame([ {"coder": "coder1", "document": 1, "code": "2"}, {"coder": "coder2", "document": 1, "code": "2"}, {"coder": "coder1", "document": 2, "code": "1"}, {"coder": "coder2", "document": 2, "code": "2"}, {"coder": "coder1", "document": 3, "code": "0"}, {"coder": "coder2", "document": 3, "code": "0"}, ]) >>> compute_overall_scores_multivariate(coder_df, 'document', 'coder', ["code"]) {'fleiss_kappa_discrete': 0.4545454545454544, 'fleiss_kappa_jaccard': 0.49999999999999994, 'fleiss_kappa_masi': 0.49999999999999994, 'fleiss_kappa_cosine': 0.49999999999999994, 'alpha_discrete': 0.5454545454545454, 'alpha_jaccard': 0.5454545454545454, 'alpha_masi': 0.5454545454545454, 'alpha_cosine': 0.5454545454545454} coder_df = pd.DataFrame([ {"coder": "coder1", "document": 1, "code1": "2", "code2": "1"}, {"coder": "coder2", "document": 1, "code1": "2", "code2": "1"}, {"coder": "coder1", "document": 2, "code1": "1", "code2": "0"}, {"coder": "coder2", "document": 2, "code1": "2", "code2": "1"}, {"coder": "coder1", "document": 3, "code1": "0", "code2": "0"}, {"coder": "coder2", "document": 3, "code1": "0", "code2": "0"}, ]) >>> compute_overall_scores_multivariate(coder_df, 'document', 'coder', ["code1", "code2"]) {'fleiss_kappa_discrete': 0.4545454545454544, 'fleiss_kappa_jaccard': 0.49999999999999994, 'fleiss_kappa_masi': 0.49999999999999994, 'fleiss_kappa_cosine': 0.49999999999999994, 'alpha_discrete': 0.5454545454545454, 'alpha_jaccard': 0.5161290322580645, 'alpha_masi': 0.5361781076066792, 'alpha_cosine': 0.5}
Mutual Information
The pewanalytics.stats.mutual_info
submodule provides a function for extracting pointwise mutual information for features in your data based on a binary split into two classes. This can be a great method for identifying features that are most distinctive of one group versus another.
Functions:
|
Computes pointwise mutual information for a set of observations partitioned into two groups. |
|
Takes a mutual information table generated by |
|
Takes a mutual information table generated by |
- compute_mutual_info(y, x, weights=None, col_names=None, l=0, normalize=True)[source]
Computes pointwise mutual information for a set of observations partitioned into two groups.
- Parameters
y – An array or, preferably, a
pandas.Series
x – A matrix,
pandas.DataFrame
, or preferably ascipy.sparse.csr_matrix
weights – (Optional) An array of weights corresponding to each observation
col_names (list) – The feature names associated with the columns in matrix ‘x’
l (int or float) – An optional Laplace smoothing parameter
normalize (bool) – Toggle normalization on or off (to control for feature prevalance), on by default
- Returns
A
pandas.DataFrame
of features with a variety of computed metrics including mutual information.
The function expects
y
to correspond to a list or series of values indicating which partition an observation belongs to.y
must be a binary flag.x
is a set of features (either apandas.DataFrame
or sparse matrix) where the rows correspond to observations and the columns represent the presence of features (you can technically run this using non-binary features but the results will not be as readily interpretable.) The function returns apandas.DataFrame
of metrics computed for each feature, including the following columns:MI1
: The feature’s mutual information for the positive classMI0
: The feature’s mutual information for the negative classtotal
: The total number of times a feature appearedtotal_pos_with_term
: The total number of times a feature appeared in positive casestotal_neg_with_term
: The total number of times a feature appeared in negative casestotal_pos_neg_with_term_diff
: The raw difference in the number of times a feature appeared in positive cases relative to negative casespct_pos_with_term
: The proportion of positive cases that had the featurepct_neg_with_term
: The proportion of negative cases that had the featurepct_pos_neg_with_term_ratio
: A likelihood ratio indicating the degree to which a positive case was more likely to have the feature than a negative casepct_term_pos
: Of the cases that had a feature, the proportion that were in the positive classpct_term_neg
: Of the cases that had a feature, the proportion that were in the negative classpct_term_pos_neg_diff
: The percentage point difference between the proportion of cases with the feature that were positive vs. negativepct_term_pos_neg_ratio
: A likelihood ratio indicating the degree to which a feature was more likely to appear in a positive case relative to a negative one (may not be meaningful when classes are imbalanced)
Note
Note that
pct_term_pos
andpct_term_neg
may not be directly comparable if classes are imbalanced, and in such cases apct_term_pos_neg_diff
above zero orpct_term_pos_neg_ratio
above 1 may not indicate a true association with the positive class if positive cases outnumber negative ones.Note
Mutual information can be a difficult metric to explain to others. We’ve found that the
pct_pos_neg_with_term_ratio
can serve as a more interpretable alternative method for identifying meaningful differences between groups.Usage:
from pewanalytics.stats.mutual_info import compute_mutual_info import nltk import pandas as pd from sklearn.metrics.pairwise import linear_kernel from sklearn.feature_extraction.text import TfidfVectorizer nltk.download("inaugural") df = pd.DataFrame([ {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids() ]) df['year'] = df['speech'].map(lambda x: int(x.split("-")[0])) df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0) vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text']) tfidf = vec.transform(df['text']) # Here are the terms most distinctive of inaugural addresses in the 21st century vs. years prior >>> results = compute_mutual_info(df['21st_century'], tfidf, col_names=vec.get_feature_names()) >>> results.sort_values("MI1", ascending=False).index[:25] Index(['america', 'thank', 'bless', 'schools', 'ideals', 'americans', 'meaning', 'you', 'move', 'across', 'courage', 'child', 'birth', 'generation', 'families', 'build', 'hard', 'promise', 'choice', 'women', 'guided', 'words', 'blood', 'dignity', 'because'], dtype='object')
- mutual_info_bar_plot(mutual_info, filter_col='MI1', top_n=50, x_col='pct_term_pos_neg_ratio', color='grey', title=None, width=10)[source]
Takes a mutual information table generated by
pewanalytics.stats.mutual_info.compute_mutual_info()
, and generates a bar plot of top features. Allows for an easy visualization of feature differences. Can subsequently callplt.show()
orplt.savefig()
to display or save the plot.- Parameters
mutual_info – A mutual information table generated by
pewanalytics.stats.mutual_info.compute_mutual_info()
filter_col (str) – The column to use when selecting top features; sorts in descending order and picks the top
top_n
top_n (int) – The number of features to display
x_col (str) – The column by which to sort the final set of top features (after they have been selected by
filter_col
color (str) – The color of the bars
title (str) – The title of the plot
width (int) – The width of the plot
- Returns
A Matplotlib figure, which you can display via
plt.show()
or alternatively save to a file viaplt.savefig(FILEPATH)
- mutual_info_scatter_plot(mutual_info, filter_col='MI1', top_n=50, x_col='pct_term_pos_neg_ratio', xlabel=None, scale_x_even=True, y_col='MI1', ylabel=None, scale_y_even=True, color='grey', color_col='MI1', size_col='pct_pos_with_term', title=None, figsize=(10, 10), adjust_text=False)[source]
Takes a mutual information table generated by
pewanalytics.stats.mutual_info.compute_mutual_info()
, and generates a scatter plot of top features. The names of the features will be displayed with varying colors and sizes depending on the variables specified incolor_col
andsize_col
. Allows for an easy visualization of feature differences. Can subsequently callplt.show()
orplt.savefig()
to display or save the plot.- Parameters
mutual_info – A mutual information table generated by
pewanalytics.stats.mutual_info.compute_mutual_info()
filter_col (str) – The column to use when selecting top features; sorts in descending order and picks the top
top_n
top_n (int) – The number of features to display
x_col (str) – The column to use as the x-axis
xlabel (str) – Label for the x-axis
scale_x_even (bool) – If True, set values to their ordered rank (allows for even spacing)
y_col (str) – The column to use as the y-axis
ylabel (str) – Label for the y-axis
scale_y_even (bool) – If True, set values to their ordered rank (allows for even spacing)
color (str) – The color for the features
color_col (str) – The column to use when shading the features
size_col (str) – The column to use to size the features
title (str) – The title of the plot
figsize (tuple) – The size of the plot (tuple)
adjust_text (bool) – If True, attempts to adjusts the text so it doesn’t overlap
- Returns
A Matplotlib figure, which you can display via
plt.show()
or alternatively save to a file viaplt.savefig(FILEPATH)
Sampling
The pewanalytics.stats.sampling
submodule contains utilities for extracting and weighting samples based on a known sampling frame.
Functions:
|
Takes two |
|
Takes a |
Classes:
|
A helper class for extracting samples using various sampling methods. |
- compute_sample_weights_from_frame(frame, sample, weight_vars)[source]
Takes two
pandas.DataFrame
s and computes sampling weights for the second one, based on the first. The firstpandas.DataFrame
should be equivalent to the population that the secondpandas.DataFrame
, a sample, was drawn from. Weights will be calculated based on the differences in the distribution of one or more variables specified inweight_vars
(these should be the names of columns). Returns apandas.Series
equal in length to thesample
with the computed weights.- Parameters
frame –
pandas.DataFrame
(must contain all of the columns specified inweight_vars
)sample –
pandas.DataFrame
(must contain all of the columns specified inweight_vars
)weight_vars (list) – The names of the columns to use when computing weights.
- Returns
A
pandas.Series
containing the weights for each row in thesample
Usage:
from pewanalytics.stats.sampling import compute_sample_weights_from_frame import nltk import pandas as pd from sklearn.metrics.pairwise import linear_kernel from sklearn.feature_extraction.text import TfidfVectorizer nltk.download("inaugural") frame = pd.DataFrame([ {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids() ]) # Let's grab a sample of speeches - some that mention specific terms, and an additional random sample frame['economy'] = frame['text'].str.contains("economy").astype(int) frame['health'] = frame['text'].str.contains("health").astype(int) frame['immigration'] = frame['text'].str.contains("immigration").astype(int) frame['education'] = frame['text'].str.contains("education").astype(int) sample = pd.concat([ frame[frame['economy']==1].sample(5), frame[frame['health']==1].sample(5), frame[frame['immigration']==1].sample(5), frame[frame['education']==1].sample(5), frame.sample(5) ]) # Now we can get the sampling weights to adjust it back to the population based on those variables >>> sample['weight'] = compute_sample_weights_from_frame(frame, sample, ["economy", "health", "immigration", "education"]) >>> sample speech text economy health immigration education count weight 7 1817-Monroe.txt I should be destitute of feeling if I was not ... 1 1 0 0 1 1.005747 11 1833-Jackson.txt Fellow citizens, the will of the American peop... 1 0 0 0 1 2.370690 34 1925-Coolidge.txt My countrymen, no one can contemplate curre... 1 0 1 1 1 0.344828 35 1929-Hoover.txt My Countrymen: This occasion is not alone the ... 1 1 0 1 1 0.538793 28 1901-McKinley.txt My fellow-citizens, when we assembled here on ... 1 0 0 0 1 2.370690
- compute_balanced_sample_weights(sample, weight_vars, weight_column=None)[source]
Takes a
pandas.DataFrame
and one or more column names (weight_vars
) and computes weights such that every unique combination of values in the weighting columns are balanced (when weighted, the sum of the observations with each combination will be equal to one another). Useful for balancing important groups in training datasets, etc.- Parameters
sample –
pandas.DataFrame
(must contain all of the columns specified inweight_vars
)weight_vars (list) – The names of the columns to use when computing weights.
weight_column (str) – An option column containing existing weights, which can be factored into the new weights.
- Returns
A
pandas.Series
containing the weights for each row in thesample
Note
All weight variables must be binary flags (1 or 0); if you want to weight using a non-binary variable, you should convert it into a set of dummy variables and then pass those in as multiple columns.
Usage:
from pewanalytics.stats.sampling import compute_balanced_sample_weights import pandas as pd # Let's say we have a set of tweets from members of Congress df = pd.DataFrame([ {"politician_id": 1, "party": "R", "tweet": "Example document"}, {"politician_id": 1, "party": "R", "tweet": "Example document"}, {"politician_id": 2, "party": "D", "tweet": "Example document"}, {"politician_id": 2, "party": "D", "tweet": "Example document"}, {"politician_id": 3, "party": "D", "tweet": "Example document"}, ]) df['is_republican'] = (df['party']=="R").astype(int) # We can balance the parties like so: >>> df['weight'] = compute_balanced_sample_weights(df, ["is_republican"]) >>> df politician_id party tweet is_rep weight is_republican 0 1 R Example document 1 1.250000 1 1 1 R Example document 1 1.250000 1 2 2 D Example document 0 0.833333 0 3 2 D Example document 0 0.833333 0 4 3 D Example document 0 0.833333 0
- class SampleExtractor(df, id_col, verbose=False, seed=None)[source]
A helper class for extracting samples using various sampling methods.
- Parameters
df (
pandas.DataFrame
) – The sampling frameid_col (str) – Column in the
pandas.DataFrame
to be used as the unique ID of observationsverbose (bool) – Whether or not to print information during the sampling process (default=False)
seed (int) – Random seed (optional)
Methods:
extract
(sample_size[, sampling_strategy, ...])Extract a sample from a
pandas.DataFrame
using one of the following methods:- extract(sample_size, sampling_strategy='random', stratify_by=None)[source]
Extract a sample from a
pandas.DataFrame
using one of the following methods:all: Returns all of the IDs
random: Returns a random sample
stratify: Proportional stratification, method from Kish, Leslie. “Survey sampling.” (1965). Chapter 4.
stratify_even: Sample evenly from each strata (will obviously not be representative)
stratify_guaranteed: Proportional stratification, but the sample is guaranteed to contain at least one observation from each strata (if sample size is small and/or there are many small strata, the resulting sample may be far from representative)
- Parameters
sample_size (int) – The desired size of the sample
sampling_strategy (str) – The method to be used to extract samples. Options are: all, random, stratify, stratify_even, stratify_guaranteed
stratify_by (str, list) – Optional name of a column or list of columns in the
pandas.DataFrame
to stratify on
- Returns
A list of IDs reflecting the observations selected from the
pandas.DataFrame
during sampling- Return type
list
Usage:
from pewanalytics.stats.sampling import SampleExtractor import nltk import pandas as pd nltk.download("inaugural") frame = pd.DataFrame([ {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids() ]) frame["century"] = frame['speech'].map(lambda x: "{}00".format(x.split("-")[0][:2])) >>> sampler = SampleExtractor(frame, "speech", seed=42) >>> sample_index = sampler.extract(12, sampling_strategy="random") frame[frame["speech"].isin(sample_index)]['century'].value_counts() 1900 6 1800 5 1700 1 Name: century, dtype: int64 >>> sample_index = sampler.extract(12, sampling_strategy="stratify", stratify_by=['century']) frame[frame["speech"].isin(sample_index)]['century'].value_counts() 1800 5 1900 5 2000 1 1700 1 Name: century, dtype: int64 >>> sample_index = sampler.extract(12, sampling_strategy="stratify_even", stratify_by=['century']) frame[frame["speech"].isin(sample_index)]['century'].value_counts() 1800 3 2000 3 1700 3 1900 3 Name: century, dtype: int64 >>> sample_index = sampler.extract(12, sampling_strategy="stratify_guaranteed", stratify_by=['century']) frame[frame["speech"].isin(sample_index)]['century'].value_counts() 1900 5 1800 4 1700 2 2000 1 Name: century, dtype: int64