Examples

Sampling

The pewanalytics.stats.sampling module has several useful tools for extracting samples and computing sampling weights. Given a sampling frame stored in a pandas.DataFrame, you can draw a sample using a variety of different sampling methods, and then compute sampling weights for any combination of one or more binary variables.

from pewanalytics.stats.sampling import SampleExtractor, compute_sample_weights_from_frame

import nltk
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download("inaugural")
frame = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])

# Let's set some flags what we'll use in sampling
frame['economy'] = frame['text'].str.contains("economy").astype(int)
frame['health'] = frame['text'].str.contains("health").astype(int)
frame['immigration'] = frame['text'].str.contains("immigration").astype(int)
frame['education'] = frame['text'].str.contains("education").astype(int)

# Now we can grab a sample of speeches, stratifying by these variables
# This will ensure that our sample will contain speeches that mention each term
stratification_variables = ["economy", "health", "immigration", "education"]
extractor = SampleExtractor(frame, "speech")
sample_index = extractor.extract(
    10,
    sampling_strategy="stratify",
    stratify_by=stratification_variables
)
sample = frame[frame['speech'].isin(sample_index)]
>>> sample[stratification_variables].sum()
economy        5
health         3
immigration    1
education      4
dtype: int64

# Now we can get sampling weights to adjust our sample back to the population based on our stratification variables
sample['weight'] = compute_sample_weights_from_frame(
    frame,
    sample,
    stratification_variables
)
>>> sample
                speech                                               text  \
2       1797-Adams.txt  When it was first perceived, in early times, t...
10    1829-Jackson.txt  Fellow citizens, about to undertake the arduou...
17   1857-Buchanan.txt  Fellow citizens, I appear before you this day ...
18    1861-Lincoln.txt  Fellow-Citizens of the United States: In compl...
27   1897-McKinley.txt  Fellow citizens, In obedience to the will of t...
37  1937-Roosevelt.txt  When four years ago we met to inaugurate a Pre...
46      1973-Nixon.txt  Mr. Vice President, Mr. Speaker, Mr. Chief Jus...
49     1985-Reagan.txt  Senator Mathias, Chief Justice Burger, Vice Pr...
51    1993-Clinton.txt  My fellow citizens, today we celebrate the mys...
52    1997-Clinton.txt  My fellow citizens: At this last presidential ...

    economy  health  immigration  education  count    weight
2         0       0            0          0      1  0.919540
10        1       0            0          0      1  0.948276
17        0       0            0          0      1  0.919540
18        0       0            0          0      1  0.919540
27        1       0            1          1      1  0.689655
37        0       0            0          1      1  0.862069
46        0       1            0          1      1  0.172414
49        1       0            0          0      1  0.948276
51        1       1            0          0      1  1.206897
52        1       1            0          1      1  0.862069

Inter-rater reliability

The pewanalytics.stats.irr module has several useful functions for computing a wide variety of inter-rater reliability and model performance metrics. The pewanalytics.stats.irr.compute_scores() function provides a one-stop shop for assessing agreement between two classifiers - whether you’re comparing coders or machine learning models.

from pewanalytics.stats.irr import compute_scores

import pandas as pd

# Let's create a DataFrame with some fake classification decisions. We'll make one with two coders,
# three documents, and three possible codes
df = pd.DataFrame([
    {"coder": "coder1", "document": 1, "code": "2"},
    {"coder": "coder2", "document": 1, "code": "2"},
    {"coder": "coder1", "document": 2, "code": "1"},
    {"coder": "coder2", "document": 2, "code": "2"},
    {"coder": "coder1", "document": 3, "code": "0"},
    {"coder": "coder2", "document": 3, "code": "0"},
])

# To get overall average performance metrics, we can pass the DataFrame in like so:
>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': None,
 'coder1_mean_unweighted': 1.0,
 'coder1_std_unweighted': 0.5773502691896257,
 'coder2_mean_unweighted': 1.3333333333333333,
 'coder2_std_unweighted': 0.6666666666666666,
 'alpha_unweighted': 0.5454545454545454,
 'accuracy': 0.6666666666666666,
 'f1': 0.5555555555555555,
 'precision': 0.5,
 'recall': 0.6666666666666666,
 'precision_recall_min': 0.5,
 'matthews_corrcoef': 0.6123724356957946,
 'roc_auc': None,
 'pct_agree_unweighted': 0.6666666666666666}

# And if we want to get the scores for a specific code/label (comparing it against all other possible codes)
# Then we can specify it using the `pos_label` keyword argument:

>>> compute_scores(df, "coder1", "coder2", "code", "document", "coder", pos_label="1")
{'coder1': 'coder1',
 'coder2': 'coder2',
 'n': 3,
 'outcome_column': 'code',
 'pos_label': '1',
 'coder1_mean_unweighted': 0.3333333333333333,
 'coder1_std_unweighted': 0.3333333333333333,
 'coder2_mean_unweighted': 0.0,
 'coder2_std_unweighted': 0.0,
 'alpha_unweighted': 0.0,
 'cohens_kappa': 0.0,
 'accuracy': 0.6666666666666666,
 'f1': 0.0,
 'precision': 0.0,
 'recall': 0.0,
 'precision_recall_min': 0.0,
 'matthews_corrcoef': 1.0,
 'roc_auc': None,
 'pct_agree_unweighted': 0.6666666666666666}

Cleaning text

When working with text data, pre-processing is an essential first task. The pewanalytics.text module contains a wide range of tools for working with text data, among them the pewanalytics.text.TextCleaner class that provides a wide range of pre-processing options for cleaning your text.

from pewanalytics.text import TextCleaner

text = """
    <body>
    Here's some example text.</br>It isn't a great example, but it'll do.
    Of course, there are plenty of other examples we could use though.
    http://example.com
    </body>
"""

>>> cleaner = TextCleaner(process_method="stem")
>>> cleaner.clean(text)
'exampl is_not great exampl cours plenti exampl could use though'

>>> cleaner = TextCleaner(process_method="stem", stopwords=["my_custom_stopword"], strip_html=True)
>>> cleaner.clean(text)
'here some exampl is_not great exampl but will cours there are plenti other exampl could use though'

>>> cleaner = TextCleaner(process_method="lemmatize", strip_html=True)
>>> cleaner.clean(text)
'example is_not great example course plenty example could use though'

>>> cleaner = TextCleaner(process_method="lemmatize", remove_urls=False, strip_html=True)
>>> cleaner.clean(text)
'example text is_not great example course plenty example could use though http example com'

>>> cleaner = TextCleaner(process_method="stem", strip_html=False)
>>> cleaner.clean(text)
'example text is_not great example course plenty example could use though http example com'

>>> cleaner = TextCleaner(process_method="stem", filter_pos=["JJ"], strip_html=True)
>>> cleaner.clean(text)
'great though'

The TextDataFrame class

In some of the following examples, we’ll be making use of the pewanalytics.text.TextDataFrame class, which provides a variety of useful functions for working with a Pandas DataFrame that contains a column of text that you want to analyze. To set up a TextDataFrame, you just need to pass a DataFrame and specify the name of the column that contains the text. The TextDataFrame will automatically convert your corpus into a TF-IDF representation; you can pass additional keyword arguments to control this vectorization process, which get forwarded to a Scikit-Learn TfidfVectorizer class. In the following examples, we’ll be using a TextDataFrame containing inaugural speeches:

from pewanalytics.text import TextDataFrame
import pandas as pd
import nltk

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
# Let's remove new line characters so we can print the output in the docstrings
df['text'] = df['text'].str.replace("\n", " ")

# And now let's create some additional variables to group our data
df['year'] = df['speech'].map(lambda x: int(x.split("-")[0]))
df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0)

# And we'll also create some artificial duplicates in the dataset
df = df.append(df.tail(2)).reset_index()

# We'll use this TextDataFrame in a variety of examples below
tdf = TextDataFrame(df, "text", stop_words="english", ngram_range=(1, 2))

Finding repeating fragments

When working with text, sometimes it can be useful to identify repeating segments of text that occur in multiple documents. For example, you might be interested in identifying common phraess, or perhaps you want to look for common boilerplate text that you want to clear out in order to facilitate more accurate document comparison. In these cases, Pew Analytics provides several functions that can help.

The TextOverlapExtractor can identify overlaps between two pieces of text:

from pewanalytics.text import TextOverlapExtractor

text1 = "This is a sentence. This is another sentence. And a third sentence. And yet a fourth sentence."
text2 = "This is a different sentence. This is another sentence. And a third sentence. But the fourth \
sentence is different too."

>>> extractor = TextOverlapExtractor()

>>> extractor.get_text_overlaps(text1, text2, min_length=10, tokenize=False)
[' sentence. This is another sentence. And a third sentence. ', ' fourth sentence']

>>> extractor.get_text_overlaps(text1, text2, min_length=10, tokenize=True)
['This is another sentence.', 'And a third sentence.']

If you want to apply this function at scale, you can make use of the TextDataFrame to search for repeating fragments of text that occur across a large corpus. This function uses the TextOverlapExtractor, which tokenizes your text into complete sentences by default. In our example, there aren’t any unique sentences that recur, but we can disable tokenization to get raw overlapping segments of text like so:

>>> tdf.extract_corpus_fragments(scan_top_n_matches_per_doc=20, min_fragment_length=25, tokenize=False)
['s. Equal and exact justice ',
 'd by the General Government',
 ' of the American people, ',
 'ent of the United States ',
 ' the office of President of the United States ',
 ' preserve, protect, and defend the Constitution of the United States."  ',
 ' to "preserve, protect, and defend',
 ' of the United States are ',
 'e of my countrymen I am about to ',
 'Vice President, Mr. Chief Justice, ',
 ' 200th anniversary as a nation',
 ', and my fellow citizens: ',
 'e United States of America']

Finding duplicates

Text corpora also often contain duplicates that we want to remove prior to analysis. To efficiently identify these duplicates, the TextDataFrame provides a two-step function that uses TF-IDF to identify potential duplicate pairs, which are then filtered down by using more precise Levenshtein ratios:

>>> tdf.find_duplicates()
[           speech                                               text  year
56  2013-Obama.txt  Thank you. Thank you so much.    Vice Presiden...  2013
56  2013-Obama.txt  Thank you. Thank you so much.    Vice Presiden...  2013

    21st_century
56             1
56             1  ,
            speech                                               text  year
57  2017-Trump.txt  Chief Justice Roberts, President Carter, Presi...  2017
57  2017-Trump.txt  Chief Justice Roberts, President Carter, Presi...  2017

    21st_century
57             1
57             1  ]

Mutual information

Pointwise mutual information can be an enormously useful tool for identifying words and phrases that distinguish one group of documents from another. The pewanalytics.stats.mutual_info module contains a mutual_info function for computing mutual information along with a variety of other ratios that identify features that distinguish between two different sets of observations. While you can run this function on any set of features, it’s particularly informative when working with text data. Accordingly, the TextDataFrame has a shortcut function that allows you to easily run mutual information on your corpus. In this example, we can find the phrases that most distinguish 21st century inaugural speeches from those given in prior years:

results = tdf.mutual_info("21st_century")

# Pointwise mutual information for our positive class is stored in the "M1" column
>>> results.sort_values("MI1", ascending=False).index[:25]
Index(['journey complete', 'jobs', 'make america', 've', 'obama', 'workers',
       'xand', 'states america', 'america best', 'debates', 'clinton',
       'president clinton', 'trillions', 'stops right', 'transferring',
       'president obama', 'stops', 'protected protected', 'transferring power',
       'nation capital', 'american workers', 'politicians', 'people believe',
       'borders', 'victories'],
       dtype='object')

Topic modeling

Just like the TextDataFrame, pewanalytics also provides a wrapper class for training a variety of different topic models. The pewanalytics.text.topics.TopicModel class accepts a Pandas DataFrame and the name of a text column, and allows you to train and apply Gensim, Scikit-Learn, and Corex topic models using a standardized interface:

from pewanalytics.text.topics import TopicModel
import pandas as pd
import nltk

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
>>> model = TopicModel(df, "text", "sklearn_nfm", num_topics=5, min_df=25, max_df=.5, use_tfidf=False)
Initialized sklearn_nmf topic model with 3285 features
1600 training documents, 400 testing documents

>>> model.fit()

>>> model.print_topics()
0: bad, really, know, don, plot, people, scene, movies, action, scenes
1: star, trek, star trek, effects, wars, star wars, special, special effects, movies, series
2: jackie, films, chan, jackie chan, hong, master, drunken, action, tarantino, brown
3: life, man, best, characters, new, love, world, little, does, great
4: alien, series, aliens, characters, films, television, files, quite, mars, action

>>> doc_topics = model.get_document_topics(df)

>>> doc_topics
       topic_0   topic_1   topic_2   topic_3   topic_4
0     0.723439  0.000000  0.000000  0.000000  0.000000
1     0.289801  0.050055  0.000000  0.000000  0.000000
2     0.375149  0.000000  0.030691  0.059088  0.143679
3     0.152961  0.010386  0.000000  0.121412  0.015865
4     0.294005  0.100426  0.000000  0.137630  0.051241
...        ...       ...       ...       ...       ...
1995  0.480983  0.070431  0.135178  0.256951  0.000000
1996  0.139986  0.000000  0.000000  0.107430  0.000000
1997  0.141545  0.005990  0.081986  0.387859  0.057025
1998  0.029228  0.023342  0.043713  0.280877  0.107551
1999  0.044863  0.000000  0.000000  0.718677  0.000000