pewanalytics.stats: Statistical Tools

In the pewanalytics.stats module, you’ll find a variety of statistical utilities for weighting, clustering, dimensionality reduction, and inter-rater reliability.

Clustering

The pewanalytics.stats.clustering submodule contains several functions for extracting clusters from your data.

Functions:

`compute_kmeans_clusters`(features[, k, ...])	Uses K-Means to cluster an arbitrary set of features.
`compute_hdbscan_clusters`(features[, ...])	Uses HDBSCAN* to identify the best number of clusters and map each unit to one.

compute_kmeans_clusters(features, k=10, return_score=False)[source]

Uses K-Means to cluster an arbitrary set of features. This function expects input data where the rows are units and columns are features.

Parameters

features – TF-IDF sparse matrix or pandas.DataFrame
k (int) – The number of clusters to extract
return_score (bool) – If True, the function returns a tuple with the cluster assignments and the silhouette score of the clustering; otherwise the function just returns a list of cluster labels for each row. (Default=False)

Returns

A list with the cluster label for each row, or a tuple containing the labels followed by the silhouette score of the K-Means model.

Return type

list

Usage:

from pewanalytics.stats.clustering import compute_kmeans_clusters
from sklearn import datasets
import pandas as pd

# The iris dataset is a common example dataset included in scikit-learn with 3 main clusters
# Let's see if we can find them
df = pd.DataFrame(datasets.load_iris().data)

>>> df['cluster'] = compute_kmeans_clusters(df, k=3)
KMeans: n_clusters 3, score is 0.5576853964035263

>>> df['cluster'].value_counts()
1    62
0    50
2    38
Name: cluster, dtype: int64

compute_hdbscan_clusters(features, min_cluster_size=100, min_samples=1, **kwargs)[source]

Uses HDBSCAN* to identify the best number of clusters and map each unit to one. This function expects input data where the rows are units and columns are features. Additional keyword arguments are passed to HDBSCAN. Check out the official documentation for more: https://hdbscan.readthedocs.io/en/latest

Parameters

features – TF-IDF sparse matrix or pandas.DataFrame
min_cluster_size (int) – int - minimum number of documents/units that can exist in a cluster.
min_samples (int) – Minimum number of samples to draw (see HDBSCAN documentation for more)
kwargs – Additional HDBSCAN parameters: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html

Returns

A list with the cluster label for each row

Usage:

from pewanalytics.stats.clustering import compute_hdbscan_clusters
from sklearn import datasets
import pandas as pd

df = pd.DataFrame(datasets.load_iris().data)

>>> df['cluster'] = compute_hdbscan_clusters(df, min_cluster_size=10)
HDBSCAN: n_clusters 2

>>> df['cluster'].value_counts()
1    100
0     50
Name: cluster, dtype: int64

Dimensionality Reduction

The pewanalytics.stats.dimensionality_reduction submodule contains functions for collapsing your data into underlying dimensions using methods like PCA and correspondence analysis.

Functions:

`get_pca`(features[, feature_names, k])	Performs PCA on a set of features.
`get_lsa`(features[, feature_names, k])	Performs LSA on a set of features.
`correspondence_analysis`(edges[, n])	Performs correspondence analysis on a set of features.

get_pca(features, feature_names=None, k=20)[source]

Performs PCA on a set of features. This function expects input data where the rows are units and columns are features.

For more information about how PCA is implemented, visit the Scikit-Learn Documentation.

Parameters

features – A pandas.DataFrame or sparse matrix where rows are units/observations and columns are features
feature_names (list) – An optional list of feature names (for sparse matrices)
k (int) – Number of dimensions to extract

Returns

A tuple of two pandas.DataFrame s, (features x components, units x components)

Return type

tuple

Usage:

from pewanalytics.stats.dimensionality_reduction import get_pca
from sklearn import datasets
import pandas as pd

df = pd.DataFrame(datasets.load_iris().data)

>>> feature_weights, df_reduced  = get_pca(df, k=2)
Decomposition explained variance ratio: 0.977685206318795
Top features:
Component 0: [2 0 3 1]
Component 1: [1 0 3 2]

>>> feature_weights
      pca_0     pca_1
0  0.361387  0.656589
1 -0.084523  0.730161
2  0.856671 -0.173373
3  0.358289 -0.075481

>>> df_reduced.head()
      pca_0     pca_1    pca
0 -2.684126  0.319397  pca_1
1 -2.714142 -0.177001  pca_1
2 -2.888991 -0.144949  pca_1
3 -2.745343 -0.318299  pca_1
4 -2.728717  0.326755  pca_1

get_lsa(features, feature_names=None, k=20)[source]

Performs LSA on a set of features. This function expects input data where the rows are units and columns are features.

For more information about how LSA is implemented, visit the Scikit-Learn Documentation.

Parameters

features – A pandas.DataFrame or sparse matrix with rows are units/observations and columns are features
feature_names (list) – An optional list of feature names (for sparse matrices)
k (int) – Number of dimensions to extract

Returns

A tuple of two pandas.DataFrame s, (features x components, documents x components)

Return type

tuple

Usage:

from pewanalytics.stats.dimensionality_reduction import get_lsa
from sklearn import datasets
import pandas as pd

df = pd.DataFrame(datasets.load_iris().data)

>>> feature_weights, df_reduced  = get_lsa(df, k=2)
Decomposition explained variance ratio: 0.9772093692426493
Top features:
Component 0: [0 2 1 3]
Component 1: [1 0 3 2]

>>> feature_weights
      lsa_0     lsa_1
0  0.751108  0.284175
1  0.380086  0.546745
2  0.513009 -0.708665
3  0.167908 -0.343671

>>> df_reduced.head()
      lsa_0     lsa_1    lsa
0  5.912747  2.302033  lsa_0
1  5.572482  1.971826  lsa_0
2  5.446977  2.095206  lsa_0
3  5.436459  1.870382  lsa_0
4  5.875645  2.328290  lsa_0

correspondence_analysis(edges, n=1)[source]

Performs correspondence analysis on a set of features.

Most useful in the context of network analysis, where you might wish to, for example, identify the underlying dimension in a network of Twitter users by using a matrix representing whether or not they follow one another (when news and political accounts are included, the underlying dimension often appears to approximate the left-right political spectrum.)

Parameters

edges – A pandas.DataFrame of NxN where both the rows and columns are “nodes” and the values are some sort of closeness or similarity measure (like a cosine similarity matrix)
n (int) – The number of dimensions to extract

Returns

A pandas.DataFrame where rows are the units and the columns correspond to the extracted dimensions.

Usage:

from pewanalytics.stats.dimensionality_reduction import correspondence_analysis
import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])

vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text'])
tfidf = vec.transform(df['text'])

cosine_similarities = linear_kernel(tfidf)
matrix = pd.DataFrame(cosine_similarities, columns=df['speech'])

# Looks like the main source of variation in the language of inaugural speeches is time!

>>> mca = correspondence_analysis(matrix)

>>> mca.sort_values("mca_1").head()
                node     mca_1
57  1993-Clinton.txt -0.075508
56    2017-Trump.txt -0.068168
55  1997-Clinton.txt -0.061567
54    1973-Nixon.txt -0.060698
53     1989-Bush.txt -0.056305

>>> mca.sort_values("mca_1").tail()
               node     mca_1
4    1877-Hayes.txt  0.040037
3   1817-Monroe.txt  0.040540
2     1845-Polk.txt  0.042847
1   1849-Taylor.txt  0.050937
0  1829-Jackson.txt  0.056201

Inter-Rater Reliability

The pewanalytics.stats.irr submodule contains functions for computing measures of inter-rater reliability and model performance, including Cohen’s Kappa, Krippendorf’s Alpha, precision, recall, and much more.

Functions:

`kappa_sample_size_power`(rate1, rate2, k1, k0)	Python translation of the `N.cohen.kappa` function from the `irr` R package.
`kappa_sample_size_CI`(kappa0, kappaL, props)	Helps determine the required document sample size to confirm that Cohen's Kappa between coders is at or above a minimum threhsold.
`compute_scores`(coder_df, coder1, coder2, ...)	Computes a variety of inter-rater reliability scores, including Cohen's kappa, Krippendorf's alpha, precision, and recall.
`compute_overall_scores`(coder_df, ...)	Computes overall inter-rater reliability scores (Krippendorf's Alpha and Fleiss' Kappa).
`compute_overall_scores_multivariate`(...)	Computes overall inter-rater reliability scores (Krippendorf's Alpha and Fleiss' Kappa).

kappa_sample_size_power(rate1, rate2, k1, k0, alpha=0.05, power=0.8, twosided=False)[source]

Python translation of the N.cohen.kappa function from the irr R package.

Source: https://cran.r-project.org/web/packages/irr/irr.pdf

Parameters

rate1 (float) – The probability that the first rater will record a positive diagnosis
rate2 (float) – The probability that the second rater will record a positive diagnosis
k1 (float) – The true Cohen’s Kappa statistic
k0 (float) – The value of kappa under the null hypothesis
alpha (float) – Type I error of test
power (float) – The desired power to detect the difference between true kappa and hypothetical kappa
twosided – Set this to True if the test is two-sided
twosided – bool

Returns

Returns the required sample size

Return type

int

kappa_sample_size_CI(kappa0, kappaL, props, kappaU=None, alpha=0.05)[source]

Helps determine the required document sample size to confirm that Cohen’s Kappa between coders is at or above a minimum threhsold. Useful in situations where multiple coders code a set of documents for a binary outcome.

This function takes the observed kappa and proportion of positive cases from the sample, along with a lower-bound for the minimum acceptable kappa, and returns the sample size required to confirm that the coders’ agreement is truly above that minimum level of kappa with 95% certainty. If the current sample size is below the required sample size returned by this function, it can provide a rough estimate of how many additional documents need to be coded - assuming that the coders continue agreeing and observing positive cases at the same rate.

Translated from the kappaSize R package, CIBinary: https://github.com/cran/kappaSize/blob/master/R/CIBinary.R

Parameters

kappa0 – The preliminary value of kappa
kappa0 – float
kappaL (float) – The desired expected lower bound for a two-sided 100(1 - alpha) % confidence interval for kappa. Alternatively, if kappaU is set to NA, the procedure produces the number of required subjects for a one-sided confidence interval
props (float) – The anticipated prevalence of the desired trait
kappaU (float) – The desired expected upper confidence limit for kappa
alpha (float) – The desired type I error rate

Returns

Returns the required sample size

Usage:

from pewanalytics.stats.irr import kappa_sample_size_CI

observed_kappa = 0.8
desired_kappa = 0.7
observed_proportion = 0.5

>>> kappa_sample_size(observed_kappa, desired_kappa, observed_proportion)
140

compute_scores(coder_df, coder1, coder2, outcome_column, document_column, coder_column, weight_column=None, pos_label=None)[source]

Computes a variety of inter-rater reliability scores, including Cohen’s kappa, Krippendorf’s alpha, precision, and recall. The input data must consist of a pandas.DataFrame with the following columns:

A column with values that indicate the coder (like a name)

A column with values that indicate the document (like an ID)

A column with values that indicate the code value

(Optional) A column with document weights

This function will return a pandas.DataFrame with agreement scores between the two specified coders.

Parameters

coder_df (pandas.DataFrame) – A pandas.DataFrame of codes
coder1 (str or int) – The value in coder_column for rows corresponding to the first coder
coder2 (str or int) – The value in coder_column for rows corresponding to the second coder
outcome_column (str) – The column that contains the codes
document_column (str) – The column that contains IDs for the documents
coder_column (str) – The column containing values that indicate which coder assigned the code
weight_column (str) – The column that contains sampling weights
pos_label (str or int) – The value indicating a positive label (optional)

Returns

A dictionary of scores

Return type

dict

Note

If using a multi-class (non-binary) code, some scores may come back null or not compute as expected. We recommend running the function separately for each specific code value as a binary flag by providing each unique value to the pos_label argument. If pos_label is not provided for multi-class codes, this function will attempt to compute scores based on support-weighted averages.

Usage:

from pewanalytics.stats.irr import compute_scores
import pandas as pd

df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code": "2"},
    {"coder": "coder2", "document": 1, "code": "2"},
    {"coder": "coder1", "document": 2, "code": "1"},
    {"coder": "coder2", "document": 2, "code": "2"},
    {"coder": "coder1", "document": 3, "code": "0"},
    {"coder": "coder2", "document": 3, "code": "0"},
])

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': None,
 'coder1_mean_unweighted': 1.0,
 'coder1_std_unweighted': 0.5773502691896257,
 'coder2_mean_unweighted': 1.3333333333333333,
 'coder2_std_unweighted': 0.6666666666666666,
 'alpha_unweighted': 0.5454545454545454,
 'accuracy': 0.6666666666666666,
 'f1': 0.5555555555555555,
 'precision': 0.5,
 'recall': 0.6666666666666666,
 'precision_recall_min': 0.5,
 'matthews_corrcoef': 0.6123724356957946,
 'roc_auc': None,
 'pct_agree_unweighted': 0.6666666666666666}

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="0")
 {'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '0',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.3333333333333333,
 'coder2_std_unweighted': 0.3333333333333333,
 'alpha_unweighted': 1.0,
 'cohens_kappa': 1.0,
 'accuracy': 1.0,
 'f1': 1.0,
 'precision': 1.0,
 'recall': 1.0,
 'precision_recall_min': 1.0,
 'matthews_corrcoef': 1.0,
 'roc_auc': 1.0,
 'pct_agree_unweighted': 1.0}

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="1")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '1',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.0,
 'coder2_std_unweighted': 0.0,
 'alpha_unweighted': 0.0,
 'cohens_kappa': 0.0,
 'accuracy': 0.6666666666666666,
 'f1': 0.0,
 'precision': 0.0,
 'recall': 0.0,
 'precision_recall_min': 0.0,
 'matthews_corrcoef': 1.0,
 'roc_auc': None,
 'pct_agree_unweighted': 0.6666666666666666}

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="2")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '2',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.6666666666666666,
 'coder2_std_unweighted': 0.3333333333333333,
 'alpha_unweighted': 0.4444444444444444,
 'cohens_kappa': 0.3999999999999999,
 'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 0.5,
 'recall': 1.0,
 'precision_recall_min': 0.5,
 'matthews_corrcoef': 0.5,
 'roc_auc': 0.75,
 'pct_agree_unweighted': 0.6666666666666666}

compute_overall_scores(coder_df, outcome_column, document_column, coder_column)[source]

Computes overall inter-rater reliability scores (Krippendorf’s Alpha and Fleiss’ Kappa). Allows for more than two coders and code values. The input data must consist of a pandas.DataFrame with the following columns:

A column with values that indicate the coder (like a name)

A column with values that indicate the document (like an ID)

A column with values that indicate the code value

Parameters

coder_df (pandas.DataFrame) – A pandas.DataFrame of codes
outcome_column (str) – The column that contains the codes
document_column (str) – The column that contains IDs for the documents
coder_column (str) – The column containing values that indicate which coder assigned the code

Returns

A dictionary containing the scores

Return type

dict

Usage:

from pewanalytics.stats.irr import compute_overall_scores
import pandas as pd

df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code": "2"},
    {"coder": "coder2", "document": 1, "code": "2"},
    {"coder": "coder1", "document": 2, "code": "1"},
    {"coder": "coder2", "document": 2, "code": "2"},
    {"coder": "coder1", "document": 3, "code": "0"},
    {"coder": "coder2", "document": 3, "code": "0"},
])

>>> compute_overall_scores(df, "code", "document", "coder")
{'alpha': 0.5454545454545454, 'fleiss_kappa': 0.4545454545454544}

compute_overall_scores_multivariate(coder_df, document_column, coder_column, outcome_columns)[source]

Computes overall inter-rater reliability scores (Krippendorf’s Alpha and Fleiss’ Kappa). Allows for more than two coders, code values, AND variables. All variables and values will be converted into a matrix of dummy variables, and Alpha and Kappa will be computed using four different distance metrics:

Discrete agreement (exact agreement across all outcome columns)

Jaccard coefficient

MASI distance

Cosine similarity

The input data must consist of a pandas.DataFrame with the following columns:

A column with values that indicate the coder (like a name)

A column with values that indicate the document (like an ID)

One or more columns with values that indicate code values

This code was adapted from a very helpful StackExchange post: https://stats.stackexchange.com/questions/511927/interrater-reliability-with-multi-rater-multi-label-dataset

Parameters

coder_df (pandas.DataFrame) – A pandas.DataFrame of codes
document_column (str) – The column that contains IDs for the documents
coder_column (str) – The column containing values that indicate which coder assigned the code
outcome_columns (list) – The columns that contains the codes

Returns

A dictionary containing the scores

Return type

dict

Usage:

from pewanalytics.stats.irr import compute_overall_scores_multivariate
import pandas as pd

coder_df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code": "2"},
    {"coder": "coder2", "document": 1, "code": "2"},
    {"coder": "coder1", "document": 2, "code": "1"},
    {"coder": "coder2", "document": 2, "code": "2"},
    {"coder": "coder1", "document": 3, "code": "0"},
    {"coder": "coder2", "document": 3, "code": "0"},
])

>>> compute_overall_scores_multivariate(coder_df, 'document', 'coder', ["code"])
{'fleiss_kappa_discrete': 0.4545454545454544,
 'fleiss_kappa_jaccard': 0.49999999999999994,
 'fleiss_kappa_masi': 0.49999999999999994,
 'fleiss_kappa_cosine': 0.49999999999999994,
 'alpha_discrete': 0.5454545454545454,
 'alpha_jaccard': 0.5454545454545454,
 'alpha_masi': 0.5454545454545454,
 'alpha_cosine': 0.5454545454545454}

coder_df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code1": "2", "code2": "1"},
    {"coder": "coder2", "document": 1, "code1": "2", "code2": "1"},
    {"coder": "coder1", "document": 2, "code1": "1", "code2": "0"},
    {"coder": "coder2", "document": 2, "code1": "2", "code2": "1"},
    {"coder": "coder1", "document": 3, "code1": "0", "code2": "0"},
    {"coder": "coder2", "document": 3, "code1": "0", "code2": "0"},
])

>>> compute_overall_scores_multivariate(coder_df, 'document', 'coder', ["code1", "code2"])
{'fleiss_kappa_discrete': 0.4545454545454544,
 'fleiss_kappa_jaccard': 0.49999999999999994,
 'fleiss_kappa_masi': 0.49999999999999994,
 'fleiss_kappa_cosine': 0.49999999999999994,
 'alpha_discrete': 0.5454545454545454,
 'alpha_jaccard': 0.5161290322580645,
 'alpha_masi': 0.5361781076066792,
 'alpha_cosine': 0.5}

Mutual Information

The pewanalytics.stats.mutual_info submodule provides a function for extracting pointwise mutual information for features in your data based on a binary split into two classes. This can be a great method for identifying features that are most distinctive of one group versus another.

Functions:

`compute_mutual_info`(y, x[, weights, ...])	Computes pointwise mutual information for a set of observations partitioned into two groups.
`mutual_info_bar_plot`(mutual_info[, ...])	Takes a mutual information table generated by `pewanalytics.stats.mutual_info.compute_mutual_info()`, and generates a bar plot of top features.
`mutual_info_scatter_plot`(mutual_info[, ...])	Takes a mutual information table generated by `pewanalytics.stats.mutual_info.compute_mutual_info()`, and generates a scatter plot of top features.

compute_mutual_info(y, x, weights=None, col_names=None, l=0, normalize=True)[source]

Computes pointwise mutual information for a set of observations partitioned into two groups.

Parameters

y – An array or, preferably, a pandas.Series
x – A matrix, pandas.DataFrame, or preferably a scipy.sparse.csr_matrix
weights – (Optional) An array of weights corresponding to each observation
col_names (list) – The feature names associated with the columns in matrix ‘x’
l (int or float) – An optional Laplace smoothing parameter
normalize (bool) – Toggle normalization on or off (to control for feature prevalance), on by default

Returns

A pandas.DataFrame of features with a variety of computed metrics including mutual information.

The function expects y to correspond to a list or series of values indicating which partition an observation belongs to. y must be a binary flag. x is a set of features (either a pandas.DataFrame or sparse matrix) where the rows correspond to observations and the columns represent the presence of features (you can technically run this using non-binary features but the results will not be as readily interpretable.) The function returns a pandas.DataFrame of metrics computed for each feature, including the following columns:

MI1: The feature’s mutual information for the positive class
MI0: The feature’s mutual information for the negative class
total: The total number of times a feature appeared
total_pos_with_term: The total number of times a feature appeared in positive cases
total_neg_with_term: The total number of times a feature appeared in negative cases
total_pos_neg_with_term_diff: The raw difference in the number of times a feature appeared in positive cases relative to negative cases
pct_pos_with_term: The proportion of positive cases that had the feature
pct_neg_with_term: The proportion of negative cases that had the feature
pct_pos_neg_with_term_ratio: A likelihood ratio indicating the degree to which a positive case was more likely to have the feature than a negative case
pct_term_pos: Of the cases that had a feature, the proportion that were in the positive class
pct_term_neg: Of the cases that had a feature, the proportion that were in the negative class
pct_term_pos_neg_diff: The percentage point difference between the proportion of cases with the feature that were positive vs. negative
pct_term_pos_neg_ratio: A likelihood ratio indicating the degree to which a feature was more likely to appear in a positive case relative to a negative one (may not be meaningful when classes are imbalanced)

Note

Note that pct_term_pos and pct_term_neg may not be directly comparable if classes are imbalanced, and in such cases a pct_term_pos_neg_diff above zero or pct_term_pos_neg_ratio above 1 may not indicate a true association with the positive class if positive cases outnumber negative ones.

Note

Mutual information can be a difficult metric to explain to others. We’ve found that the pct_pos_neg_with_term_ratio can serve as a more interpretable alternative method for identifying meaningful differences between groups.

Usage:

from pewanalytics.stats.mutual_info import compute_mutual_info
import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
df['year'] = df['speech'].map(lambda x: int(x.split("-")[0]))
df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0)

vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text'])
tfidf = vec.transform(df['text'])

# Here are the terms most distinctive of inaugural addresses in the 21st century vs. years prior

>>> results = compute_mutual_info(df['21st_century'], tfidf, col_names=vec.get_feature_names())

>>> results.sort_values("MI1", ascending=False).index[:25]
Index(['america', 'thank', 'bless', 'schools', 'ideals', 'americans',
       'meaning', 'you', 'move', 'across', 'courage', 'child', 'birth',
       'generation', 'families', 'build', 'hard', 'promise', 'choice', 'women',
       'guided', 'words', 'blood', 'dignity', 'because'],
      dtype='object')

mutual_info_bar_plot(mutual_info, filter_col='MI1', top_n=50, x_col='pct_term_pos_neg_ratio', color='grey', title=None, width=10)[source]

Takes a mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info(), and generates a bar plot of top features. Allows for an easy visualization of feature differences. Can subsequently call plt.show() or plt.savefig() to display or save the plot.

Parameters

mutual_info – A mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info()
filter_col (str) – The column to use when selecting top features; sorts in descending order and picks the top top_n
top_n (int) – The number of features to display
x_col (str) – The column by which to sort the final set of top features (after they have been selected by filter_col
color (str) – The color of the bars
title (str) – The title of the plot
width (int) – The width of the plot

Returns

A Matplotlib figure, which you can display via plt.show() or alternatively save to a file via plt.savefig(FILEPATH)

mutual_info_scatter_plot(mutual_info, filter_col='MI1', top_n=50, x_col='pct_term_pos_neg_ratio', xlabel=None, scale_x_even=True, y_col='MI1', ylabel=None, scale_y_even=True, color='grey', color_col='MI1', size_col='pct_pos_with_term', title=None, figsize=(10, 10), adjust_text=False)[source]

Takes a mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info(), and generates a scatter plot of top features. The names of the features will be displayed with varying colors and sizes depending on the variables specified in color_col and size_col. Allows for an easy visualization of feature differences. Can subsequently call plt.show() or plt.savefig() to display or save the plot.

Parameters

mutual_info – A mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info()
filter_col (str) – The column to use when selecting top features; sorts in descending order and picks the top top_n
top_n (int) – The number of features to display
x_col (str) – The column to use as the x-axis
xlabel (str) – Label for the x-axis
scale_x_even (bool) – If True, set values to their ordered rank (allows for even spacing)
y_col (str) – The column to use as the y-axis
ylabel (str) – Label for the y-axis
scale_y_even (bool) – If True, set values to their ordered rank (allows for even spacing)
color (str) – The color for the features
color_col (str) – The column to use when shading the features
size_col (str) – The column to use to size the features
title (str) – The title of the plot
figsize (tuple) – The size of the plot (tuple)
adjust_text (bool) – If True, attempts to adjusts the text so it doesn’t overlap

Returns

A Matplotlib figure, which you can display via plt.show() or alternatively save to a file via plt.savefig(FILEPATH)

Sampling

The pewanalytics.stats.sampling submodule contains utilities for extracting and weighting samples based on a known sampling frame.

Functions:

`compute_sample_weights_from_frame`(frame, ...)	Takes two `pandas.DataFrame` s and computes sampling weights for the second one, based on the first.
`compute_balanced_sample_weights`(sample, ...)	Takes a `pandas.DataFrame` and one or more column names (`weight_vars`) and computes weights such that every unique combination of values in the weighting columns are balanced (when weighted, the sum of the observations with each combination will be equal to one another).

Classes:

SampleExtractor(df, id_col[, verbose, seed])

A helper class for extracting samples using various sampling methods.

compute_sample_weights_from_frame(frame, sample, weight_vars)[source]

Takes two pandas.DataFrame s and computes sampling weights for the second one, based on the first. The first pandas.DataFrame should be equivalent to the population that the second pandas.DataFrame, a sample, was drawn from. Weights will be calculated based on the differences in the distribution of one or more variables specified in weight_vars (these should be the names of columns). Returns a pandas.Series equal in length to the sample with the computed weights.

Parameters

frame – pandas.DataFrame (must contain all of the columns specified in weight_vars)
sample – pandas.DataFrame (must contain all of the columns specified in weight_vars)
weight_vars (list) – The names of the columns to use when computing weights.

Returns

A pandas.Series containing the weights for each row in the sample

Usage:

from pewanalytics.stats.sampling import compute_sample_weights_from_frame
import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
frame = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
# Let's grab a sample of speeches - some that mention specific terms, and an additional random sample
frame['economy'] = frame['text'].str.contains("economy").astype(int)
frame['health'] = frame['text'].str.contains("health").astype(int)
frame['immigration'] = frame['text'].str.contains("immigration").astype(int)
frame['education'] = frame['text'].str.contains("education").astype(int)
sample = pd.concat([
    frame[frame['economy']==1].sample(5),
    frame[frame['health']==1].sample(5),
    frame[frame['immigration']==1].sample(5),
    frame[frame['education']==1].sample(5),
    frame.sample(5)
])
# Now we can get the sampling weights to adjust it back to the population based on those variables

>>> sample['weight'] = compute_sample_weights_from_frame(frame, sample, ["economy", "health", "immigration", "education"])
>>> sample
               speech                                               text  economy  health  immigration  education  count    weight
7     1817-Monroe.txt  I should be destitute of feeling if I was not ...        1       1            0          0      1  1.005747
11   1833-Jackson.txt  Fellow citizens, the will of the American peop...        1       0            0          0      1  2.370690
34  1925-Coolidge.txt  My countrymen, no one can contemplate curre...           1       0            1          1      1  0.344828
35    1929-Hoover.txt  My Countrymen: This occasion is not alone the ...        1       1            0          1      1  0.538793
28  1901-McKinley.txt  My fellow-citizens, when we assembled here on ...        1       0            0          0      1  2.370690

compute_balanced_sample_weights(sample, weight_vars, weight_column=None)[source]

Takes a pandas.DataFrame and one or more column names (weight_vars) and computes weights such that every unique combination of values in the weighting columns are balanced (when weighted, the sum of the observations with each combination will be equal to one another). Useful for balancing important groups in training datasets, etc.

Parameters

sample – pandas.DataFrame (must contain all of the columns specified in weight_vars)
weight_vars (list) – The names of the columns to use when computing weights.
weight_column (str) – An option column containing existing weights, which can be factored into the new weights.

Returns

A pandas.Series containing the weights for each row in the sample

Note

All weight variables must be binary flags (1 or 0); if you want to weight using a non-binary variable, you should convert it into a set of dummy variables and then pass those in as multiple columns.

Usage:

from pewanalytics.stats.sampling import compute_balanced_sample_weights
import pandas as pd

# Let's say we have a set of tweets from members of Congress
df = pd.DataFrame([
    {"politician_id": 1, "party": "R", "tweet": "Example document"},
    {"politician_id": 1, "party": "R", "tweet": "Example document"},
    {"politician_id": 2, "party": "D", "tweet": "Example document"},
    {"politician_id": 2, "party": "D", "tweet": "Example document"},
    {"politician_id": 3, "party": "D", "tweet": "Example document"},
])
df['is_republican'] = (df['party']=="R").astype(int)

# We can balance the parties like so:

>>> df['weight'] = compute_balanced_sample_weights(df, ["is_republican"])

>>> df
   politician_id party             tweet  is_rep    weight  is_republican
0              1     R  Example document       1  1.250000              1
1              1     R  Example document       1  1.250000              1
2              2     D  Example document       0  0.833333              0
3              2     D  Example document       0  0.833333              0
4              3     D  Example document       0  0.833333              0

class SampleExtractor(df, id_col, verbose=False, seed=None)[source]

A helper class for extracting samples using various sampling methods.

Parameters

df (pandas.DataFrame) – The sampling frame
id_col (str) – Column in the pandas.DataFrame to be used as the unique ID of observations
verbose (bool) – Whether or not to print information during the sampling process (default=False)
seed (int) – Random seed (optional)

Methods:

extract(sample_size[, sampling_strategy, ...])

Extract a sample from a pandas.DataFrame using one of the following methods:

extract(sample_size, sampling_strategy='random', stratify_by=None)[source]

Extract a sample from a pandas.DataFrame using one of the following methods:

all: Returns all of the IDs
random: Returns a random sample
stratify: Proportional stratification, method from Kish, Leslie. “Survey sampling.” (1965). Chapter 4.
stratify_even: Sample evenly from each strata (will obviously not be representative)
stratify_guaranteed: Proportional stratification, but the sample is guaranteed to contain at least one observation from each strata (if sample size is small and/or there are many small strata, the resulting sample may be far from representative)

Parameters

sample_size (int) – The desired size of the sample
sampling_strategy (str) – The method to be used to extract samples. Options are: all, random, stratify, stratify_even, stratify_guaranteed
stratify_by (str, list) – Optional name of a column or list of columns in the pandas.DataFrame to stratify on

Returns

A list of IDs reflecting the observations selected from the pandas.DataFrame during sampling

Return type

list

Usage:

from pewanalytics.stats.sampling import SampleExtractor
import nltk
import pandas as pd

nltk.download("inaugural")
frame = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
frame["century"] = frame['speech'].map(lambda x: "{}00".format(x.split("-")[0][:2]))

>>> sampler = SampleExtractor(frame, "speech", seed=42)

>>> sample_index = sampler.extract(12, sampling_strategy="random")
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1900    6
1800    5
1700    1
Name: century, dtype: int64

>>> sample_index = sampler.extract(12, sampling_strategy="stratify", stratify_by=['century'])
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1800    5
1900    5
2000    1
1700    1
Name: century, dtype: int64

>>> sample_index = sampler.extract(12, sampling_strategy="stratify_even", stratify_by=['century'])
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1800    3
2000    3
1700    3
1900    3
Name: century, dtype: int64

>>> sample_index = sampler.extract(12, sampling_strategy="stratify_guaranteed", stratify_by=['century'])
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1900    5
1800    4
1700    2
2000    1
Name: century, dtype: int64