pewanalytics.stats: Statistical Tools

In the pewanalytics.stats module, you’ll find a variety of statistical utilities for weighting, clustering, dimensionality reduction, and inter-rater reliability.

Clustering

The pewanalytics.stats.clustering submodule contains several functions for extracting clusters from your data.

Functions

compute_hdbscan_clusters(features[, …])

Uses HDBSCAN* to identify the best number of clusters and map each unit to one.

compute_kmeans_clusters(features[, k, …])

Uses K-Means to cluster an arbitrary set of features.

compute_kmeans_clusters(features, k=10, return_score=False)[source]

Uses K-Means to cluster an arbitrary set of features. This function expects input data where the rows are units and columns are features.

Parameters
  • features – TF-IDF sparse matrix or pandas.DataFrame

  • k (int) – The number of clusters to extract

  • return_score (bool) – If True, the function returns a tuple with the cluster assignments and the silhouette score of the clustering; otherwise the function just returns a list of cluster labels for each row. (Default=False)

Returns

A list with the cluster label for each row, or a tuple containing the labels followed by the silhouette score of the K-Means model.

Return type

list

Usage:

from pewanalytics.stats.clustering import compute_kmeans_clusters
from sklearn import datasets
import pandas as pd

# The iris dataset is a common example dataset included in scikit-learn with 3 main clusters
# Let's see if we can find them
df = pd.DataFrame(datasets.load_iris().data)

>>> df['cluster'] = compute_kmeans_clusters(df, k=3)
KMeans: n_clusters 3, score is 0.5576853964035263

>>> df['cluster'].value_counts()
1    62
0    50
2    38
Name: cluster, dtype: int64
compute_hdbscan_clusters(features, min_cluster_size=100, min_samples=1, **kwargs)[source]

Uses HDBSCAN* to identify the best number of clusters and map each unit to one. This function expects input data where the rows are units and columns are features. Additional keyword arguments are passed to HDBSCAN. Check out the official documentation for more: https://hdbscan.readthedocs.io/en/latest

Parameters
  • features – TF-IDF sparse matrix or pandas.DataFrame

  • min_cluster_size (int) – int - minimum number of documents/units that can exist in a cluster.

  • min_samples (int) – Minimum number of samples to draw (see HDBSCAN documentation for more)

  • kwargs – Additional HDBSCAN parameters: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html

Returns

A list with the cluster label for each row

Usage:

from pewanalytics.stats.clustering import compute_hdbscan_clusters
from sklearn import datasets
import pandas as pd

df = pd.DataFrame(datasets.load_iris().data)

>>> df['cluster'] = compute_hdbscan_clusters(df, min_cluster_size=10)
HDBSCAN: n_clusters 2

>>> df['cluster'].value_counts()
1    100
0     50
Name: cluster, dtype: int64

Dimensionality Reduction

The pewanalytics.stats.dimensionality_reduction submodule contains functions for collapsing your data into underlying dimensions using methods like PCA and correspondence analysis.

Functions

correspondence_analysis(edges[, n])

Performs correspondence analysis on a set of features.

get_lsa(features[, feature_names, k])

Performs LSA on a set of features.

get_pca(features[, feature_names, k])

Performs PCA on a set of features.

get_pca(features, feature_names=None, k=20)[source]

Performs PCA on a set of features. This function expects input data where the rows are units and columns are features.

For more information about how PCA is implemented, visit the Scikit-Learn Documentation.

Parameters
  • features – A pandas.DataFrame or sparse matrix where rows are units/observations and columns are features

  • feature_names (list) – An optional list of feature names (for sparse matrices)

  • k (int) – Number of dimensions to extract

Returns

A tuple of two pandas.DataFrame s, (features x components, units x components)

Return type

tuple

Usage:

from pewanalytics.stats.dimensionality_reduction import get_pca
from sklearn import datasets
import pandas as pd

df = pd.DataFrame(datasets.load_iris().data)

>>> feature_weights, df_reduced  = get_pca(df, k=2)
Decomposition explained variance ratio: 0.977685206318795
Top features:
Component 0: [2 0 3 1]
Component 1: [1 0 3 2]

>>> feature_weights
      pca_0     pca_1
0  0.361387  0.656589
1 -0.084523  0.730161
2  0.856671 -0.173373
3  0.358289 -0.075481

>>> df_reduced.head()
      pca_0     pca_1    pca
0 -2.684126  0.319397  pca_1
1 -2.714142 -0.177001  pca_1
2 -2.888991 -0.144949  pca_1
3 -2.745343 -0.318299  pca_1
4 -2.728717  0.326755  pca_1
get_lsa(features, feature_names=None, k=20)[source]

Performs LSA on a set of features. This function expects input data where the rows are units and columns are features.

For more information about how LSA is implemented, visit the Scikit-Learn Documentation.

Parameters
  • features – A pandas.DataFrame or sparse matrix with rows are units/observations and columns are features

  • feature_names (list) – An optional list of feature names (for sparse matrices)

  • k (int) – Number of dimensions to extract

Returns

A tuple of two pandas.DataFrame s, (features x components, documents x components)

Return type

tuple

Usage:

from pewanalytics.stats.dimensionality_reduction import get_lsa
from sklearn import datasets
import pandas as pd

df = pd.DataFrame(datasets.load_iris().data)

>>> feature_weights, df_reduced  = get_lsa(df, k=2)
Decomposition explained variance ratio: 0.9772093692426493
Top features:
Component 0: [0 2 1 3]
Component 1: [1 0 3 2]

>>> feature_weights
      lsa_0     lsa_1
0  0.751108  0.284175
1  0.380086  0.546745
2  0.513009 -0.708665
3  0.167908 -0.343671

>>> df_reduced.head()
      lsa_0     lsa_1    lsa
0  5.912747  2.302033  lsa_0
1  5.572482  1.971826  lsa_0
2  5.446977  2.095206  lsa_0
3  5.436459  1.870382  lsa_0
4  5.875645  2.328290  lsa_0
correspondence_analysis(edges, n=1)[source]

Performs correspondence analysis on a set of features.

Most useful in the context of network analysis, where you might wish to, for example, identify the underlying dimension in a network of Twitter users by using a matrix representing whether or not they follow one another (when news and political accounts are included, the underlying dimension often appears to approximate the left-right political spectrum.)

Parameters
  • edges – A pandas.DataFrame of NxN where both the rows and columns are “nodes” and the values are some sort of closeness or similarity measure (like a cosine similarity matrix)

  • n (int) – The number of dimensions to extract

Returns

A pandas.DataFrame where rows are the units and the columns correspond to the extracted dimensions.

Usage:

from pewanalytics.stats.dimensionality_reduction import correspondence_analysis
import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])

vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text'])
tfidf = vec.transform(df['text'])

cosine_similarities = linear_kernel(tfidf)
matrix = pd.DataFrame(cosine_similarities, columns=df['speech'])

# Looks like the main source of variation in the language of inaugural speeches is time!

>>> mca = correspondence_analysis(matrix)

>>> mca.sort_values("mca_1").head()
                node     mca_1
57  1993-Clinton.txt -0.075508
56    2017-Trump.txt -0.068168
55  1997-Clinton.txt -0.061567
54    1973-Nixon.txt -0.060698
53     1989-Bush.txt -0.056305

>>> mca.sort_values("mca_1").tail()
               node     mca_1
4    1877-Hayes.txt  0.040037
3   1817-Monroe.txt  0.040540
2     1845-Polk.txt  0.042847
1   1849-Taylor.txt  0.050937
0  1829-Jackson.txt  0.056201

Inter-Rater Reliability

The pewanalytics.stats.irr submodule contains functions for computing measures of inter-rater reliability and model performance, including Cohen’s Kappa, Krippendorf’s Alpha, precision, recall, and much more.

Functions

compute_overall_scores(coder_df, …)

Computes overall inter-rater reliability scores (Krippendorf’s Alpha and Fleiss’ Kappa).

compute_scores(coder_df, coder1, coder2, …)

Computes a variety of inter-rater reliability scores, including Cohen’s kappa, Krippendorf’s alpha, precision, and recall.

kappa_sample_size_CI(kappa0, kappaL, props)

Helps determine the required document sample size to confirm that Cohen’s Kappa between coders is at or above a minimum threhsold.

kappa_sample_size_power(rate1, rate2, k1, k0)

Python translation of the N.cohen.kappa function from the irr R package.

kappa_sample_size_power(rate1, rate2, k1, k0, alpha=0.05, power=0.8, twosided=False)[source]

Python translation of the N.cohen.kappa function from the irr R package.

Source: https://cran.r-project.org/web/packages/irr/irr.pdf

Parameters
  • rate1 (float) – The probability that the first rater will record a positive diagnosis

  • rate2 (float) – The probability that the second rater will record a positive diagnosis

  • k1 (float) – The true Cohen’s Kappa statistic

  • k0 (float) – The value of kappa under the null hypothesis

  • alpha (float) – Type I error of test

  • power (float) – The desired power to detect the difference between true kappa and hypothetical kappa

  • twosided – Set this to True if the test is two-sided

  • twosided – bool

Returns

Returns the required sample size

Return type

int

kappa_sample_size_CI(kappa0, kappaL, props, kappaU=None, alpha=0.05)[source]

Helps determine the required document sample size to confirm that Cohen’s Kappa between coders is at or above a minimum threhsold. Useful in situations where multiple coders code a set of documents for a binary outcome.

This function takes the observed kappa and proportion of positive cases from the sample, along with a lower-bound for the minimum acceptable kappa, and returns the sample size required to confirm that the coders’ agreement is truly above that minimum level of kappa with 95% certainty. If the current sample size is below the required sample size returned by this function, it can provide a rough estimate of how many additional documents need to be coded - assuming that the coders continue agreeing and observing positive cases at the same rate.

Translated from the kappaSize R package, CIBinary: https://github.com/cran/kappaSize/blob/master/R/CIBinary.R

Parameters
  • kappa0 – The preliminary value of kappa

  • kappa0 – float

  • kappaL (float) – The desired expected lower bound for a two-sided 100(1 - alpha) % confidence interval for kappa. Alternatively, if kappaU is set to NA, the procedure produces the number of required subjects for a one-sided confidence interval

  • props (float) – The anticipated prevalence of the desired trait

  • kappaU (float) – The desired expected upper confidence limit for kappa

  • alpha (float) – The desired type I error rate

Returns

Returns the required sample size

Usage:

from pewanalytics.stats.irr import kappa_sample_size_CI

observed_kappa = 0.8
desired_kappa = 0.7
observed_proportion = 0.5

>>> kappa_sample_size(observed_kappa, desired_kappa, observed_proportion)
140
compute_scores(coder_df, coder1, coder2, outcome_column, document_column, coder_column, weight_column=None, pos_label=None)[source]

Computes a variety of inter-rater reliability scores, including Cohen’s kappa, Krippendorf’s alpha, precision, and recall. The input data must consist of a pandas.DataFrame with the following columns:

  • A column with values that indicate the coder (like a name)

  • A column with values that indicate the document (like an ID)

  • A column with values that indicate the code value

  • (Optional) A column with document weights

This function will return a pandas.DataFrame with agreement scores between the two specified coders.

Parameters
  • coder_df (pandas.DataFrame) – A pandas.DataFrame of codes

  • coder1 (str or int) – The value in coder_column for rows corresponding to the first coder

  • coder2 (str or int) – The value in coder_column for rows corresponding to the second coder

  • outcome_column (str) – The column that contains the codes

  • document_column (str) – The column that contains IDs for the documents

  • coder_column (str) – The column containing values that indicate which coder assigned the code

  • weight_column (str) – The column that contains sampling weights

  • pos_label (str or int) – The value indicating a positive label (optional)

Returns

A dictionary of scores

Return type

dict

Note

If using a multi-class (non-binary) code, some scores may come back null or not compute as expected. We recommend running the function separately for each specific code value as a binary flag by providing each unique value to the pos_label argument. If pos_label is not provided for multi-class codes, this function will attempt to compute scores based on support-weighted averages.

Usage:

from pewanalytics.stats.irr import compute_scores
import pandas as pd

df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code": "2"},
    {"coder": "coder2", "document": 1, "code": "2"},
    {"coder": "coder1", "document": 2, "code": "1"},
    {"coder": "coder2", "document": 2, "code": "2"},
    {"coder": "coder1", "document": 3, "code": "0"},
    {"coder": "coder2", "document": 3, "code": "0"},
])

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': None,
 'coder1_mean_unweighted': 1.0,
 'coder1_std_unweighted': 0.5773502691896257,
 'coder2_mean_unweighted': 1.3333333333333333,
 'coder2_std_unweighted': 0.6666666666666666,
 'alpha_unweighted': 0.5454545454545454,
 'accuracy': 0.6666666666666666,
 'f1': 0.5555555555555555,
 'precision': 0.5,
 'recall': 0.6666666666666666,
 'precision_recall_min': 0.5,
 'matthews_corrcoef': 0.6123724356957946,
 'roc_auc': None,
 'pct_agree_unweighted': 0.6666666666666666}

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="0")
 {'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '0',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.3333333333333333,
 'coder2_std_unweighted': 0.3333333333333333,
 'alpha_unweighted': 1.0,
 'cohens_kappa': 1.0,
 'accuracy': 1.0,
 'f1': 1.0,
 'precision': 1.0,
 'recall': 1.0,
 'precision_recall_min': 1.0,
 'matthews_corrcoef': 1.0,
 'roc_auc': 1.0,
 'pct_agree_unweighted': 1.0}

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="1")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '1',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.0,
 'coder2_std_unweighted': 0.0,
 'alpha_unweighted': 0.0,
 'cohens_kappa': 0.0,
 'accuracy': 0.6666666666666666,
 'f1': 0.0,
 'precision': 0.0,
 'recall': 0.0,
 'precision_recall_min': 0.0,
 'matthews_corrcoef': 1.0,
 'roc_auc': None,
 'pct_agree_unweighted': 0.6666666666666666}

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="2")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '2',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.6666666666666666,
 'coder2_std_unweighted': 0.3333333333333333,
 'alpha_unweighted': 0.4444444444444444,
 'cohens_kappa': 0.3999999999999999,
 'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 0.5,
 'recall': 1.0,
 'precision_recall_min': 0.5,
 'matthews_corrcoef': 0.5,
 'roc_auc': 0.75,
 'pct_agree_unweighted': 0.6666666666666666}
compute_overall_scores(coder_df, document_column, outcome_column, coder_column)[source]

Computes overall inter-rater reliability scores (Krippendorf’s Alpha and Fleiss’ Kappa). Allows for more than two coders and code values. The input data must consist of a pandas.DataFrame with the following columns:

  • A column with values that indicate the coder (like a name)

  • A column with values that indicate the document (like an ID)

  • A column with values that indicate the code value

Parameters
  • coder_df (pandas.DataFrame) – A pandas.DataFrame of codes

  • document_column (str) – The column that contains IDs for the documents

  • outcome_column (str) – The column that contains the codes

  • coder_column (str) – The column containing values that indicate which coder assigned the code

Returns

A dictionary containing the scores

Return type

dict

Usage:

from pewanalytics.stats.irr import compute_overall_scores
import pandas as pd

df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code": "2"},
    {"coder": "coder2", "document": 1, "code": "2"},
    {"coder": "coder1", "document": 2, "code": "1"},
    {"coder": "coder2", "document": 2, "code": "2"},
    {"coder": "coder1", "document": 3, "code": "0"},
    {"coder": "coder2", "document": 3, "code": "0"},
])

>>> compute_overall_scores(df, "document", "code", "coder")
{'alpha': 0.5454545454545454, 'fleiss_kappa': 0.4545454545454544}

Mutual Information

The pewanalytics.stats.mutual_info submodule provides a function for extracting pointwise mutual information for features in your data based on a binary split into two classes. This can be a great method for identifying features that are most distinctive of one group versus another.

Functions

compute_mutual_info(y, x[, weights, …])

Computes pointwise mutual information for a set of observations partitioned into two groups.

mutual_info_bar_plot(mutual_info[, …])

Takes a mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info(), and generates a bar plot of top features.

mutual_info_scatter_plot(mutual_info[, …])

Takes a mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info(), and generates a scatter plot of top features.

compute_mutual_info(y, x, weights=None, col_names=None, l=0, normalize=True)[source]

Computes pointwise mutual information for a set of observations partitioned into two groups.

Parameters
  • y – An array or, preferably, a pandas.Series

  • x – A matrix, pandas.DataFrame, or preferably a scipy.sparse.csr_matrix

  • weights – (Optional) An array of weights corresponding to each observation

  • col_names (list) – The feature names associated with the columns in matrix ‘x’

  • l (int or float) – An optional Laplace smoothing parameter

  • normalize (bool) – Toggle normalization on or off (to control for feature prevalance), on by default

Returns

A pandas.DataFrame of features with a variety of computed metrics including mutual information.

The function expects y to correspond to a list or series of values indicating which partition an observation belongs to. y must be a binary flag. x is a set of features (either a pandas.DataFrame or sparse matrix) where the rows correspond to observations and the columns represent the presence of features (you can technically run this using non-binary features but the results will not be as readily interpretable.) The function returns a pandas.DataFrame of metrics computed for each feature, including the following columns:

  • MI1: The feature’s mutual information for the positive class

  • MI0: The feature’s mutual information for the negative class

  • total: The total number of times a feature appeared

  • total_pos_with_term: The total number of times a feature appeared in positive cases

  • total_neg_with_term: The total number of times a feature appeared in negative cases

  • total_pos_neg_with_term_diff: The raw difference in the number of times a feature appeared in positive cases relative to negative cases

  • pct_pos_with_term: The proportion of positive cases that had the feature

  • pct_neg_with_term: The proportion of negative cases that had the feature

  • pct_pos_neg_with_term_ratio: A likelihood ratio indicating the degree to which a positive case was more likely to have the feature than a negative case

  • pct_term_pos: Of the cases that had a feature, the proportion that were in the positive class

  • pct_term_neg: Of the cases that had a feature, the proportion that were in the negative class

  • pct_term_pos_neg_diff: The percentage point difference between the proportion of cases with the feature that were positive vs. negative

  • pct_term_pos_neg_ratio: A likelihood ratio indicating the degree to which a feature was more likely to appear in a positive case relative to a negative one (may not be meaningful when classes are imbalanced)

Note

Note that pct_term_pos and pct_term_neg may not be directly comparable if classes are imbalanced, and in such cases a pct_term_pos_neg_diff above zero or pct_term_pos_neg_ratio above 1 may not indicate a true association with the positive class if positive cases outnumber negative ones.

Note

Mutual information can be a difficult metric to explain to others. We’ve found that the pct_pos_neg_with_term_ratio can serve as a more interpretable alternative method for identifying meaningful differences between groups.

Usage:

from pewanalytics.stats.mutual_info import compute_mutual_info
import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
df['year'] = df['speech'].map(lambda x: int(x.split("-")[0]))
df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0)

vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text'])
tfidf = vec.transform(df['text'])

# Here are the terms most distinctive of inaugural addresses in the 21st century vs. years prior

>>> results = compute_mutual_info(df['21st_century'], tfidf, col_names=vec.get_feature_names())

>>> results.sort_values("MI1", ascending=False).index[:25]
Index(['america', 'thank', 'bless', 'schools', 'ideals', 'americans',
       'meaning', 'you', 'move', 'across', 'courage', 'child', 'birth',
       'generation', 'families', 'build', 'hard', 'promise', 'choice', 'women',
       'guided', 'words', 'blood', 'dignity', 'because'],
      dtype='object')
mutual_info_bar_plot(mutual_info, filter_col='MI1', top_n=50, x_col='pct_term_pos_neg_ratio', color='grey', title=None, width=10)[source]

Takes a mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info(), and generates a bar plot of top features. Allows for an easy visualization of feature differences. Can subsequently call plt.show() or plt.savefig() to display or save the plot.

Parameters
  • mutual_info – A mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info()

  • filter_col (str) – The column to use when selecting top features; sorts in descending order and picks the top top_n

  • top_n (int) – The number of features to display

  • x_col (str) – The column by which to sort the final set of top features (after they have been selected by filter_col

  • color (str) – The color of the bars

  • title (str) – The title of the plot

  • width (int) – The width of the plot

Returns

A Matplotlib figure, which you can display via plt.show() or alternatively save to a file via plt.savefig(FILEPATH)

mutual_info_scatter_plot(mutual_info, filter_col='MI1', top_n=50, x_col='pct_term_pos_neg_ratio', xlabel=None, scale_x_even=True, y_col='MI1', ylabel=None, scale_y_even=True, color='grey', color_col='MI1', size_col='pct_pos_with_term', title=None, figsize=(10, 10))[source]

Takes a mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info(), and generates a scatter plot of top features. The names of the features will be displayed with varying colors and sizes depending on the variables specified in color_col and size_col. Allows for an easy visualization of feature differences. Can subsequently call plt.show() or plt.savefig() to display or save the plot.

Parameters
  • mutual_info – A mutual information table generated by pewanalytics.stats.mutual_info.compute_mutual_info()

  • filter_col (str) – The column to use when selecting top features; sorts in descending order and picks the top top_n

  • top_n (int) – The number of features to display

  • x_col (str) – The column to use as the x-axis

  • xlabel (str) – Label for the x-axis

  • scale_x_even (bool) – If True, set values to their ordered rank (allows for even spacing)

  • y_col (str) – The column to use as the y-axis

  • ylabel (str) – Label for the y-axis

  • scale_y_even (bool) – If True, set values to their ordered rank (allows for even spacing)

  • color (str) – The color for the features

  • color_col (str) – The column to use when shading the features

  • size_col (str) – The column to use to size the features

  • title (str) – The title of the plot

  • figsize (tuple) – The size of the plot (tuple)

Returns

A Matplotlib figure, which you can display via plt.show() or alternatively save to a file via plt.savefig(FILEPATH)

Sampling

The pewanalytics.stats.sampling submodule contains utilities for extracting and weighting samples based on a known sampling frame.

Classes

SampleExtractor(df, id_col[, verbose, seed])

A helper class for extracting samples using various sampling methods.

Functions

compute_balanced_sample_weights(sample, …)

Takes a pandas.DataFrame and one or more column names (weight_vars) and computes weights such that every unique combination of values in the weighting columns are balanced (when weighted, the sum of the observations with each combination will be equal to one another).

compute_sample_weights_from_frame(frame, …)

Takes two pandas.DataFrame s and computes sampling weights for the second one, based on the first.

compute_sample_weights_from_frame(frame, sample, weight_vars)[source]

Takes two pandas.DataFrame s and computes sampling weights for the second one, based on the first. The first pandas.DataFrame should be equivalent to the population that the second pandas.DataFrame, a sample, was drawn from. Weights will be calculated based on the differences in the distribution of one or more variables specified in weight_vars (these should be the names of columns). Returns a pandas.Series equal in length to the sample with the computed weights.

Parameters
  • framepandas.DataFrame (must contain all of the columns specified in weight_vars)

  • samplepandas.DataFrame (must contain all of the columns specified in weight_vars)

  • weight_vars (list) – The names of the columns to use when computing weights.

Returns

A pandas.Series containing the weights for each row in the sample

Usage:

from pewanalytics.stats.sampling import compute_sample_weights_from_frame
import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
frame = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
# Let's grab a sample of speeches - some that mention specific terms, and an additional random sample
frame['economy'] = frame['text'].str.contains("economy").astype(int)
frame['health'] = frame['text'].str.contains("health").astype(int)
frame['immigration'] = frame['text'].str.contains("immigration").astype(int)
frame['education'] = frame['text'].str.contains("education").astype(int)
sample = pd.concat([
    frame[frame['economy']==1].sample(5),
    frame[frame['health']==1].sample(5),
    frame[frame['immigration']==1].sample(5),
    frame[frame['education']==1].sample(5),
    frame.sample(5)
])
# Now we can get the sampling weights to adjust it back to the population based on those variables

>>> sample['weight'] = compute_sample_weights_from_frame(frame, sample, ["economy", "health", "immigration", "education"])
>>> sample
               speech                                               text  economy  health  immigration  education  count    weight
7     1817-Monroe.txt  I should be destitute of feeling if I was not ...        1       1            0          0      1  1.005747
11   1833-Jackson.txt  Fellow citizens, the will of the American peop...        1       0            0          0      1  2.370690
34  1925-Coolidge.txt  My countrymen, no one can contemplate curre...           1       0            1          1      1  0.344828
35    1929-Hoover.txt  My Countrymen: This occasion is not alone the ...        1       1            0          1      1  0.538793
28  1901-McKinley.txt  My fellow-citizens, when we assembled here on ...        1       0            0          0      1  2.370690
compute_balanced_sample_weights(sample, weight_vars, weight_column=None)[source]

Takes a pandas.DataFrame and one or more column names (weight_vars) and computes weights such that every unique combination of values in the weighting columns are balanced (when weighted, the sum of the observations with each combination will be equal to one another). Useful for balancing important groups in training datasets, etc.

Parameters
  • samplepandas.DataFrame (must contain all of the columns specified in weight_vars)

  • weight_vars (list) – The names of the columns to use when computing weights.

  • weight_column (str) – An option column containing existing weights, which can be factored into the new weights.

Returns

A pandas.Series containing the weights for each row in the sample

Note

All weight variables must be binary flags (1 or 0); if you want to weight using a non-binary variable, you should convert it into a set of dummy variables and then pass those in as multiple columns.

Usage:

from pewanalytics.stats.sampling import compute_balanced_sample_weights
import pandas as pd

# Let's say we have a set of tweets from members of Congress
df = pd.DataFrame([
    {"politician_id": 1, "party": "R", "tweet": "Example document"},
    {"politician_id": 1, "party": "R", "tweet": "Example document"},
    {"politician_id": 2, "party": "D", "tweet": "Example document"},
    {"politician_id": 2, "party": "D", "tweet": "Example document"},
    {"politician_id": 3, "party": "D", "tweet": "Example document"},
])
df['is_republican'] = (df['party']=="R").astype(int)

# We can balance the parties like so:

>>> df['weight'] = compute_balanced_sample_weights(df, ["is_republican"])

>>> df
   politician_id party             tweet  is_rep    weight  is_republican
0              1     R  Example document       1  1.250000              1
1              1     R  Example document       1  1.250000              1
2              2     D  Example document       0  0.833333              0
3              2     D  Example document       0  0.833333              0
4              3     D  Example document       0  0.833333              0
class SampleExtractor(df, id_col, verbose=False, seed=None)[source]

A helper class for extracting samples using various sampling methods.

Parameters
  • df (pandas.DataFrame) – The sampling frame

  • id_col (str) – Column in the pandas.DataFrame to be used as the unique ID of observations

  • verbose (bool) – Whether or not to print information during the sampling process (default=False)

  • seed (int) – Random seed (optional)

Methods

extract(sample_size[, sampling_strategy, …])

Extract a sample from a pandas.DataFrame using one of the following methods:

extract(sample_size, sampling_strategy='random', stratify_by=None)[source]

Extract a sample from a pandas.DataFrame using one of the following methods:

  • all: Returns all of the IDs

  • random: Returns a random sample

  • stratify: Proportional stratification, method from Kish, Leslie. “Survey sampling.” (1965). Chapter 4.

  • stratify_even: Sample evenly from each strata (will obviously not be representative)

  • stratify_guaranteed: Proportional stratification, but the sample is guaranteed to contain at least one observation from each strata (if sample size is small and/or there are many small strata, the resulting sample may be far from representative)

Parameters
  • sample_size (int) – The desired size of the sample

  • sampling_strategy (str) – The method to be used to extract samples. Options are: all, random, stratify, stratify_even, stratify_guaranteed

  • stratify_by (str, list) – Optional name of a column or list of columns in the pandas.DataFrame to stratify on

Returns

A list of IDs reflecting the observations selected from the pandas.DataFrame during sampling

Return type

list

Usage:

from pewanalytics.stats.sampling import SampleExtractor
import nltk
import pandas as pd

nltk.download("inaugural")
frame = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
frame["century"] = frame['speech'].map(lambda x: "{}00".format(x.split("-")[0][:2]))

>>> sampler = SampleExtractor(frame, "speech", seed=42)

>>> sample_index = sampler.extract(12, sampling_strategy="random")
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1900    6
1800    5
1700    1
Name: century, dtype: int64

>>> sample_index = sampler.extract(12, sampling_strategy="stratify", stratify_by=['century'])
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1800    5
1900    5
2000    1
1700    1
Name: century, dtype: int64

>>> sample_index = sampler.extract(12, sampling_strategy="stratify_even", stratify_by=['century'])
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1800    3
2000    3
1700    3
1900    3
Name: century, dtype: int64

>>> sample_index = sampler.extract(12, sampling_strategy="stratify_guaranteed", stratify_by=['century'])
frame[frame["speech"].isin(sample_index)]['century'].value_counts()
1900    5
1800    4
1700    2
2000    1
Name: century, dtype: int64