pewanalytics.text: Text Tools

In the pewanalytics.text module, you’ll find a variety of utilities for working with text data.

General Text Processing Tools

The main pewanalytics.text module contains a variety of general tools for processing text.

Functions:

`has_fragment`(text, fragment)	Checks whether a substring ("fragment") is contained within a larger string ("text").
`remove_fragments`(text, fragments[, ...])	Iteratively remove fragments from a string.
`filter_parts_of_speech`(text[, filter_pos, ...])	Retain words associated with parts of speech in the text if `exclude=False`.
`get_fuzzy_ratio`(text1, text2[, throw_loud_fail])	Uses Levenshtein Distance to calculate similarity of two strings.
`get_fuzzy_partial_ratio`(text1, text2[, ...])	Useful to calculate similarity of two strings that are of noticeably different lengths.
`is_probable_stopword`(word)	Determine if a word is likely to be a stop word (like a name of a person or location) by the following rules:

Classes:

`SentenceTokenizer`([base_tokenizer, ...])	Initializes a tokenizer that can be be used to break text into tokens using the `tokenize` function
`TextOverlapExtractor`([tokenizer])	A helper class designed to identify overlapping sections between two strings.
`TextCleaner`([process_method, processor, ...])	A class for cleaning text up, in preparation for NLP, etc.
`TextDataFrame`(df, text_column, ...)	This is a class full of functions for working with dataframes of documents.

has_fragment(text, fragment)[source]

Checks whether a substring (“fragment”) is contained within a larger string (“text”). Uses the pewtils.decode_text() function to process both the text and the fragment when running this check.

Parameters

text (str) – The text to search
fragment (str) – The fragment to search for

Returns

Whether or not the text contains the fragment

Return type

bool

Usage:

from pewanalytics.text import has_fragment

text = "testing one two three"

>>> has_fragment(text, "one two")
True

>>> has_fragment(text, "four")
False

remove_fragments(text, fragments, throw_loud_fail=False)[source]

Iteratively remove fragments from a string.

Parameters

text (str) – The text toremove the fragments from
fragments (list) – A list of string fragments to search for and remove
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)

Returns

The original string, minus any parts that matched the fragments provided

Return type

str

Usage:

from pewanalytics.text import remove_fragments

text = "testing one two three"

>>> remove_fragments(text, ["one two"])
"testing  three"

>>> remove_fragments(text, ["testing", "three"])
" one two "

filter_parts_of_speech(text, filter_pos=None, exclude=False)[source]

Retain words associated with parts of speech in the text if exclude=False. If exclude=True, exclude words associated with parts of speech. Default is Noun (NN), Proper Noun (NNP) and Adjective (JJ)

The full list of POS is here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Parameters

text (str) – The string to process
filter_pos (list) – Array of part of speech tags (default is ‘NN’, ‘NNP’, and ‘JJ’)
exclude – If True, the function will remove words that match to the specified parts of speech; by default this function filters to POS matches instead.

Returns

A string comprised solely of words that matched (or did not match) to the specified parts of speech, depending on the value of exclude

Return type

str

Usage:

from pewanalytics.text import filter_parts_of_speech

text = "This is a very exciting sentence that can serve as a functional example"

>>> filter_parts_of_speech(text, filter_pos=["NN"])
'sentence example'

>>> filter_parts_of_speech(text, filter_pos=["JJ"], exclude=True)
'This is a very sentence that can serve as a example'

get_fuzzy_ratio(text1, text2, throw_loud_fail=False)[source]

Uses Levenshtein Distance to calculate similarity of two strings. Measures how the edit distance compares to the overall length of the texts. Uses the fuzzywuzzy library in Python 2, and the rapidfuzz library in Python 3.

Parameters

text1 (str) – First string
text2 – Second string
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)

Returns

The Levenshtein ratio between the two strings

Return type

float

Usage:

from pewanalytics.text import get_fuzzy_ratio

text1 = "This is a sentence."
text2 = "This is a slightly difference sentence."

>>> get_fuzzy_ratio(text1, text2)
64.28571428571428

get_fuzzy_partial_ratio(text1, text2, throw_loud_fail=False, timeout=5)[source]

Useful to calculate similarity of two strings that are of noticeably different lengths. Allows for the possibility that one text is a subset of the other; finds the largest overlap and computes the Levenshtein ratio on that.

Parameters

text1 (str) – First string
text2 (str) – Second string
timeout (int) – The number of seconds to wait before giving up
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)

Returns

The partial Levenshtein ratio between the two texts

Return type

float

Accepts kwarg timeout

Usage:

from pewanalytics.text import get_partial_fuzzy_ratio

text1 = "This is a sentence."
text2 = "This is a sentence, but with more text."

>>> get_partial_fuzzy_ratio(text1, text2)
100.0

class SentenceTokenizer(base_tokenizer=None, regex_split_trailing=None, regex_split_leading=None)[source]

Initializes a tokenizer that can be be used to break text into tokens using the tokenize function

Parameters

base_tokenizer – The tokenizer to use (default = NLTK’s English Punkt tokenizer)
regex_split_trailing – A compiled regex object used to define the end of sentences
regex_split_leading – A compiled regex object used to define the beginning of sentences

Usage:

from pewanalytics.text import SentenceTokenizer
import re

text = "This is a sentence. This is another sentence - and maybe a third sentence. And yet a fourth sentence."

>>> tokenizer = SentenceTokenizer()
>>> tokenizer.tokenize(text)
['This is a sentence.',
 'This is another sentence - and maybe a third sentence.',
 'And yet a fourth sentence.']

>>> tokenizer = SentenceTokenizer(regex_split_leading=re.compile(r"\-"))
>>> tokenizer.tokenize(text)
['This is a sentence.',
 'This is another sentence',
 'and maybe a third sentence.',
 'And yet a fourth sentence.']

Methods:

tokenize(text[, throw_loud_fail, min_length])

Tokenizes the text.

tokenize(text, throw_loud_fail=False, min_length=None)[source]

Tokenizes the text.

Parameters

text (str) – The text to tokenize
throw_loud_fail (bool) – Whether or not to raise an error if text decoding fails (default=False)
min_length (int) – The minimum acceptable length of a sentence (if a token is shorter than this, it will be considered part of the preceding sentence) (default=None)

Returns

A list of sentences

Return type

list

class TextOverlapExtractor(tokenizer=None)[source]

A helper class designed to identify overlapping sections between two strings.

Parameters: tokenizer – The tokenizer to use (default = SentenceTokenizer())

Methods:

`get_text_overlaps`(text1, text2[, ...])	Extracts all overlapping segments of at least `min_length` characters between the two texts.
`get_largest_overlap`(text1, text2)	Returns the largest overlapping segment of text between the two texts (this doesn't use the tokenizer).

get_text_overlaps(text1, text2, min_length=20, tokenize=True)[source]

Extracts all overlapping segments of at least min_length characters between the two texts. If tokenize=True then only tokens that appear fully in both texts will be extracted. For example:

Parameters

text1 (str) – A piece of text
text2 (str) – Another piece of text to compare against the first
min_length (int) – The minimum size of the overlap to be considered (number of characters)
tokenize (bool) – If True, overlapping segments will only be included if they consist of atomic tokens; overlaps that consist of only part of a token will be excluded. By default, the text is tokenize into sentences based on punctuation. (default=True)

Returns

A list of all of the identified overlapping segments

Return type

list

Usage:

from pewanalytics.text import TextOverlapExtractor

text1 = "This is a sentence. This is another sentence. And a third sentence. And yet a fourth sentence."
text2 = "This is a different sentence. This is another sentence. And a third sentence. But the fourth             sentence is different too."

>>> extractor = TextOverlapExtractor()

>>> extractor.get_text_overlaps(text1, text2, min_length=10, tokenize=False)
[' sentence. This is another sentence. And a third sentence. ', ' fourth sentence']

>>> extractor.get_text_overlaps(text1, text2, min_length=10, tokenize=True)
['This is another sentence.', 'And a third sentence.']

get_largest_overlap(text1, text2)[source]

Returns the largest overlapping segment of text between the two texts (this doesn’t use the tokenizer).

Parameters

text1 (str) – A piece of text
text2 (str) – Another piece of text to compare against the first

Returns

The largest substring that occurs in both texts

Return type

str

Usage:

from pewanalytics.text import TextOverlapExtractor

text1 = "Overlaping section, unique text another overlapping section"
text2 = "Overlapping section, another overlapping section"


>>> extractor = TextOverlapExtractor()

>>> extractor.get_largest_overlap(text1, text2)
' another overlapping section'

class TextCleaner(process_method='lemmatize', processor=None, filter_pos=None, lowercase=True, remove_urls=True, replacers=None, stopwords=None, strip_html=False, tokenizer=None, throw_loud_fail=False)[source]

A class for cleaning text up, in preparation for NLP, etc. Attempts to decode the text.

This function performs for the following cleaning tasks, in sequence:

Removes HTML tags (optional)

Decodes the text

Filters out specified parts of speech (optional)

Converts text to lowercase (optional)

Removes URLs (optional)

Expands contractions

Removes stopwords

Lemmatizes or stems (optional)

Removes words less than three characters

Removes punctuation

Consolidates whitespace

Parameters

process_method (str) – Options are “lemmatize”, “stem”, or None (default = “lemmatize”)
processor – A lemmatizer or stemmer with a “lemmatize” or “stem” function (default for process_method=”lemmatize” is nltk.WordNetLemmatizer(); default for process_method=”stem” is nltk.SnowballStemmer())
filter_pos (list) – A list of WordNet parts-of-speech tags to keep; if provided, all other words will be removed (default = None)
lowercase (bool) – Whether or not to lowercase the string (default = True)
remove_urls (bool) – Whether or not to remove URLs and links from the text (default = True)
replacers (list) – A list of tuples, each with a regex pattern followed by the string/pattern to replace them with. Anything passed here will be used in addition to a set of built-in replacement patterns for common contractions.
stopwords (set) – The set of stopwords to remove (default = nltk.corpus.stopwords.words(‘english’) combined with sklearn.feature_extraction.stop_words.ENGLISH_STOP_WORDS). If an empty list is passed, no stopwords will be used.
strip_html (bool) – Whether or not to remove contents wrapped in HTML tags (default = False)
tokenizer – Tokenizer to use (default = nltk.WhitespaceTokenizer())
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)

Usage:

from pewanalytics.text import TextCleaner

text = "<body>             Here's some example text.</br>It isn't a great example, but it'll do.             Of course, there are plenty of other examples we could use though.             http://example.com             </body>"

>>> cleaner = TextCleaner(process_method="stem")
>>> cleaner.clean(text)
'exampl is_not great exampl cours plenti exampl could use though'

>>> cleaner = TextCleaner(process_method="stem", stopwords=["my_custom_stopword"], strip_html=True)
>>> cleaner.clean(text)
'here some exampl is_not great exampl but will cours there are plenti other exampl could use though'

>>> cleaner = TextCleaner(process_method="lemmatize", strip_html=True)
>>> cleaner.clean(text)
'example is_not great example course plenty example could use though'

>>> cleaner = TextCleaner(process_method="lemmatize", remove_urls=False, strip_html=True)
>>> cleaner.clean(text)
'example text is_not great example course plenty example could use though http example com'

>>> cleaner = TextCleaner(process_method="stem", strip_html=False)
>>> cleaner.clean(text)
'example text is_not great example course plenty example could use though http example com'

>>> cleaner = TextCleaner(process_method="stem", filter_pos=["JJ"], strip_html=True)
>>> cleaner.clean(text)
'great though'

Methods:

clean(text)

Cleans the text.

clean(text)[source]

Cleans the text.

Parameters: text (str) – The string to clean
Returns: The cleaned string
Return type: str

class TextDataFrame(df, text_column, **vectorizer_kwargs)[source]

This is a class full of functions for working with dataframes of documents. It contains utilities for identifying potential duplicates, identifying recurring segments of text, computing metrics like mutual information, extracting clusters of documents, and more.

Given a pandas.DataFrame and the name of the column that contains the text to be analyzed, the TextDataFrame will automatically produce a TF-IDF sparse matrix representation of the text upon initialization. All other parameters are passed along to the scikit-learn TfidfVectorizer.

Tip

For more info on the parameters it excepts, refer to the official scikit-learn TfidfVectorizer documentation.

Parameters

df – A pandas.DataFrame of documents. Must contain a column with text.
text_column (str) – The name of the column in the pandas.DataFrame that contains the text
vectorizer_kwargs – All remaining keyword arguments are passed to TfidfVectorizer

Usage:

from pewanalytics.text import TextDataFrame
import pandas as pd
import nltk

nltk.download("inaugural")
df = pd.DataFrame([
    {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
])
# Let's remove new line characters so we can print the output in the docstrings
df['text'] = df['text'].str.replace("\n", " ")

# And now let's create some additional variables to group our data
df['year'] = df['speech'].map(lambda x: int(x.split("-")[0]))
df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0)

# And we'll also create some artificial duplicates in the dataset
df = df.append(df.tail(2)).reset_index()

>>> tdf = TextDataFrame(df, "text", stop_words="english", ngram_range=(1, 2))
>>> tdf_dense = pd.DataFrame(tdf.tfidf.todense(), columns=tdf.vectorizer.get_feature_names()).head(5)
>>> tdf_dense.loc[:, (tdf_dense != 0).any(axis=0)]
            14th        14th day         abandon  abandon government... zeal inspires   zeal purity     zeal rely       zeal wisdom
0       0.034014        0.034014        0.000000               0.000000 ...      0.000000          0.000000      0.000000          0.000000
1       0.000000        0.000000        0.000000               0.000000 ...      0.000000          0.000000      0.000000          0.000000
2       0.000000        0.000000        0.000000               0.000000 ...      0.000000          0.000000      0.000000          0.000000
3       0.000000        0.000000        0.020984               0.030686 ...      0.000000          0.000000      0.030686          0.000000
4       0.000000        0.000000        0.000000               0.000000 ...      0.026539          0.026539      0.000000          0.026539

Methods:

`search_corpus`(text)	Compares the provided text against the documents in the corpus and returns the most similar documents.
`match_text_to_corpus`(match_list[, ...])	Takes a list of text values and attempts to match them to the documents in the `pandas.DataFrame`.
`extract_corpus_fragments`([...])	Iterate over the corpus `pandas.DataFrame` and, for each document, scan the most similar other documents in the corpus using TF-IDF cosine similarity.
`find_duplicates`([tfidf_threshold, ...])	Search for duplicates by using cosine similarity and Levenshtein ratios.
`find_related_keywords`(keyword[, n])	Given a particular keyword, looks for related terms in the corpus using mutual information.
`mutual_info`(outcome_col[, weight_col, ...])	A wrapper around `pewanalytics.stats.mutual_info.compute_mutual_info()`
`kmeans_clusters`([k])	A wrapper around `pewanalytics.stats.clustering.compute_kmeans_clusters()`.
`hdbscan_clusters`([min_cluster_size, min_samples])	A wrapper around `pewanalytics.stats.clustering.compute_hdbscan_clusters()`.
`top_cluster_terms`(cluster_col[, min_size, top_n])	Extracts the top terms for each cluster, based on a column of cluster IDs saved to `self.corpus`, using mutual information.
`pca_components`([k])	A wrapper around `pewanalytics.stats.dimensionality_reduction.get_pca()`.
`lsa_components`([k])	A wrapper around `pewanalytics.stats.dimensionality_reduction.get_lsa()`.
`get_top_documents`([component_prefix, top_n])	Use after running `pewanalytics.text.TextDataFrame.get_pca_components()` or `pewanalytics.text.TextDataFrame.get_lsa_components()`.
`make_word_cooccurrence_matrix`([normalize, ...])	Use to produce word co-occurrence matrices.
`make_document_cooccurrence_matrix`([normalize])	Use to produce document co-occurrence matrices.

search_corpus(text)[source]

Compares the provided text against the documents in the corpus and returns the most similar documents. A new column called ‘cosine_similarity’ is generated, which is used to sort and return the pandas.DataFrame.

Parameters: text (str) – The text to compare documents against
Returns: The corpus pandas.DataFrame sorted by cosine similarity

Usage:

>>> tdf.search_corpus('upright zeal')[:5]
                                                text        search_cosine_similarity
 Proceeding, fellow citizens, to that qualifica...       0.030856
 Fellow citizens, I shall not attempt to descri...       0.025041
 In compliance with an usage coeval with the ex...       0.024922
Fellow citizens, In obedience to the will of t...       0.021272
Fellow citizens, about to undertake the arduou...       0.014791

match_text_to_corpus(match_list, allow_multiple=False, min_similarity=0.9)[source]

Takes a list of text values and attempts to match them to the documents in the pandas.DataFrame. Each document will be matched to the value in the list to which it is most similar, based on cosine similarity.

Parameters

match_list (str) – A list of strings (other documents) to be matched to documents in the pandas.DataFrame
allow_multiple (bool) – If set to True, each document in your corpus will be matched with its closes valid match in the list. If set to False (default), documents in the list will only be matched to their best match in the corpus.
min_similarity (float) – Minimum cosine similarity required for any match to be made.

Returns

Your corpus pandas.DataFrame, with new columns match_text, match_index, and cosine_similarity

Usage:

>>> match_df = tdf.match_text_to_corpus(test_excerpt, min_similarity=0.05)
>>> match_df.sort_values('cosine_similarity')[:2]
                                                 text                                              match_text       match_index     cosine_similarity
48  Senator Hatfield, Mr. Chief Justice, Mr. Presi...       In this present crisis, government is not the ...       1               0.0699283
43  Vice President Johnson, Mr. Speaker, Mr. Chief...       And so, my fellow Americans: ask not what your...       0               0.166681

extract_corpus_fragments(scan_top_n_matches_per_doc=20, min_fragment_length=15, tokenize=True, tokenizer=None)[source]

Iterate over the corpus pandas.DataFrame and, for each document, scan the most similar other documents in the corpus using TF-IDF cosine similarity. During each comparison, overlapping fragments are identified. This can be useful for identifying common boilerplate sentences, repeated paragraphs, etc. By default, the text is tokenized into complete sentences (so only complete sentences that recur will be returned), but you can set tokenize=False to get raw segments of text that occur multiple times.

Parameters

scan_top_n_matches_per_doc (int) – The number of other documents to compare each document against.
min_fragment_length (int) – The minimum character length a fragment must have to be extracted.
tokenize (bool) – If True, overlapping segments will only be included if they consist of atomic tokens; overlaps that consist of only part of a token will be excluded. Uses sentence tokenization by default. (default=True)
tokenizer (object) – The tokenizer to use, if tokenizing isn’t disabled (default = SentenceTokenizer())

Returns

A list of fragments that were found.

Note

This function will skip over duplicates if they exist in your data; it only compares documents that have less than .997 cosine similarity.

Usage:

>>> tdf.extract_corpus_fragments(scan_top_n_matches_per_doc=20, min_fragment_length=25, tokenize=False)
['s. Equal and exact justice ',
 'd by the General Government',
 ' of the American people, ',
 'ent of the United States ',
 ' the office of President of the United States ',
 ' preserve, protect, and defend the Constitution of the United States."  ',
 ' to "preserve, protect, and defend',
 ' of the United States are ',
 'e of my countrymen I am about to ',
 'Vice President, Mr. Chief Justice, ',
 ' 200th anniversary as a nation',
 ', and my fellow citizens: ',
 'e United States of America']

find_duplicates(tfidf_threshold=0.9, fuzzy_ratio_threshold=90, allow_partial=False, max_partial_difference=40, filter_function=None, partial_ratio_timeout=5, decode_text=False)[source]

Search for duplicates by using cosine similarity and Levenshtein ratios. This will struggle with large corpora, so we recommend trying to filter down to potential duplicates first. The corpus will first be scanned for document pairs with a cosine similarity greater or equal to the tfidf_threshold. Then, each of these pairs will be compared using the more stringent fuzzy_ratio_threshold.

Parameters

tfidf_threshold (float) – Minimum cosine similarity for two documents to be considered potential dupes.
fuzzy_ratio_threshold (int) – The required Levenshtein ratio to consider two documents duplicates.
allow_partial (bool) – Whether or not to allow a partial ratio (if False, absolute ratios will be used)
max_partial_diff (int) – The maximum partial ratio difference allowed for a potential duplicate pair
filter_function – An optional function that allows for more complex filtering. The function must accept the following parameters: text1, text2, cosine_similarity, fuzzy_ratio. Must return True or False, indicating whether the two documents should be considered duplicates.
partial_ratio_timeout (int) – How long, in seconds, that the partial ratio is allowed to compute
decode_text (bool) – Whether to decode the text prior to making comparisons

Returns

A list of lists, containing groups of duplicate documents (represented as rows from the corpus pandas.DataFrame)

Usage:

>>> tdf.find_duplicates()
[           speech                                               text  year
56  2013-Obama.txt  Thank you. Thank you so much.    Vice Presiden...  2013
56  2013-Obama.txt  Thank you. Thank you so much.    Vice Presiden...  2013

    21st_century
56             1
56             1  ,
            speech                                               text  year
57  2017-Trump.txt  Chief Justice Roberts, President Carter, Presi...  2017
57  2017-Trump.txt  Chief Justice Roberts, President Carter, Presi...  2017

    21st_century
57             1
57             1  ]

find_related_keywords(keyword, n=25)[source]

Given a particular keyword, looks for related terms in the corpus using mutual information.

Parameters

keyword (str) – The keyword to use
n (int) – Number of related terms to return

Returns

Terms associated with the keyword

Return type

list

Usage:

>>> tdf.find_related_keywords("war")[:2]
['war', 'peace']

>>> tdf.find_related_keywords("economy")[:2]
['economy', 'expenditures']

mutual_info(outcome_col, weight_col=None, sample_size=None, l=0, normalize=True)[source]

A wrapper around pewanalytics.stats.mutual_info.compute_mutual_info()

Parameters

outcome_col (str) – The name of the column with the binary outcome variable
weight_col (str) – (Optional) Name of the column to use in weighting
sample_size (int) – (Optional) If provided, a random sample of this size will be used instead of the full pandas.DataFrame
l (float) – An optional Laplace smoothing parameter
normalize (bool) – Toggle normalization on or off (to control for feature prevalence), on by default

Returns

A pandas.DataFrame of ngrams and various metrics about them, including mutual information

Usage:

>>> results = tdf.mutual_info('21st_century')
>>> results.sort_values("MI1", ascending=False).index[:25]
Index(['journey complete', 'jobs', 'make america', 've', 'obama', 'workers',
       'xand', 'states america', 'america best', 'debates', 'clinton',
       'president clinton', 'trillions', 'stops right', 'transferring',
       'president obama', 'stops', 'protected protected', 'transferring power',
       'nation capital', 'american workers', 'politicians', 'people believe',
       'borders', 'victories'],
       dtype='object')

kmeans_clusters(k=10)[source]

A wrapper around pewanalytics.stats.clustering.compute_kmeans_clusters(). Will compute clusters of documents. The resulting cluster IDs for each document are saved in the TextDataFrame’s corpus in a new column called “kmeans”.

Parameters: k (int) – The number of clusters to extract

Usage:

>>> tdf.kmeans_clusters(5)
KMeans: n_clusters 5, score is 0.019735248210503934
KMeans clusters saved to self.corpus['kmeans']

>>> df['kmeans'].value_counts()
2    26
3    15
4    11
0     5
1     3
Name: kmeans, dtype: int64

hdbscan_clusters(min_cluster_size=100, min_samples=1)[source]

A wrapper around pewanalytics.stats.clustering.compute_hdbscan_clusters(). Will compute clusters of documents. The resulting cluster IDs for each document are saved in the TextDataFrame’s corpus in a new column called “hdbscan”.

Parameters

min_cluster_size (int) – The minimum number of documents that a cluster must contain.
min_samples (int) – An HDBSCAN parameter; refer to the documentation for more information

Usage:

>>> tdf.hdbscan_clusters(min_cluster_size=10)
HDBSCAN: n_clusters 2
HDBSCAN clusters saved to self.corpus['hdbscan']

top_cluster_terms(cluster_col, min_size=50, top_n=10)[source]

Extracts the top terms for each cluster, based on a column of cluster IDs saved to self.corpus, using mutual information. Returns the top_n terms for each cluster.

Parameters

cluster_col (str) – The name of the column that contains the document cluster IDs
min_size (int) – Ignore clusters that have fewer than this number of documents
top_n (int) – The number of top terms to identify for each cluster

Returns

A dictionary; keys are the cluster IDs and values are the top terms for the cluster

Return type

dict

Usage:

>>> df_top_cluster = tdf.top_cluster_terms('kmeans', min_size=10)
Cluster #2, 26 documents: ['constitution' 'union' 'states' 'friendly' 'liberal' 'revenue'
 'general government' 'confederacy' 'whilst' 'authorities']
Cluster #4, 10 documents: ['shall strive' 'let sides' 'woe' 'offenses' 'breeze' 'war let'
 'nuclear weapons' 'learned live' 'mistakes' 'mr speaker']
Cluster #0, 12 documents: ['activities' 'realization' 'interstate' 'wished' 'industrial' 'major'
 'counsel action' 'conditions' 'natural resources' 'eighteenth amendment']

pca_components(k=20)[source]

A wrapper around pewanalytics.stats.dimensionality_reduction.get_pca(). Saves the PCA components to self.corpus as new columns (‘pca_1’, ‘pca_2’, etc.), saves the top component for each document as self.corpus[‘pca’], and returns the features-component matrix.

Parameters: k (int) – Number of dimensions to extract
Returns: A pandas.DataFrame of (features x components)

Usage:

>>> df_pca = tdf.pca_components(2)
Decomposition explained variance ratio: 0.07488529151231405
Component 0: ['america' 'today' 'americans' 'world' 'new' 'freedom' 'thank' 'nation'
 'god' 'journey']
Component 1: ['america' 'make america' 'dreams' 'protected' 'obama' 'borders'
 'factories' 'american' 'transferring' 'stops']
Top PCA dimensions saved as clusters to self.corpus['pca']

>>> df.sample(5)
                 speech                                                  text       year    21st_century        pca_0      pca_1      pca
0   1789-Washington.txt     Fellow-Citizens of the Senate and of the House...       1789    0                   -0.129094   0.016984        pca_1
21  1873-Grant.txt      Fellow-Citizens: Under Providence I have been ...   1873    0                   -0.097430   0.009559        pca_1
49  1985-Reagan.txt         Senator Mathias, Chief Justice Burger, Vice Pr...       1985    0                   0.163833    -0.020259       pca_0
2   1797-Adams.txt          When it was first perceived, in early times, t...       1797    0                   -0.140250   0.024844        pca_1
20  1869-Grant.txt          Citizens of the United States:    Your suffrag...       1869    0                   -0.114444   0.014419        pca_1

lsa_components(k=20)[source]

A wrapper around pewanalytics.stats.dimensionality_reduction.get_lsa(). Saves the LSA components to self.corpus as new columns (‘lsa_1’, ‘lsa_2’, etc.), saves the top component for each document as self.corpus[‘lsa’], and returns the features-component matrix

Parameters: k (int) – Number of dimensions to extract
Returns: A pandas.DataFrame of (features x components)

Usage:

>>> df_lsa = tdf.lsa_components(2)
Decomposition explained variance ratio: 0.04722850124656694
Top features:
Component 0: ['government' 'people' 'america' 'states' 'world' 'nation' 'shall'
 'country' 'great' 'peace']
Component 1: ['america' 'today' 'americans' 'world' 'new' 'freedom' 'thank' 'nation'
 'god' 'journey']
Top LSA dimensions saved as clusters to self.corpus['lsa_'] columns

>>> df.sample(5)
                speech                                                 text    year 21st_century    lsa_0      lsa_1          lsa
37  1937-Roosevelt.txt    When four years ago we met to inaugurate a Pre...    1937            0 0.293068   0.040802        lsa_0
8   1821-Monroe.txt       Fellow citizens, I shall not attempt to descri...    1821            0 0.348465   -0.212382       lsa_0
7   1817-Monroe.txt       I should be destitute of feeling if I was not ...    1817            0 0.369249   -0.237231       lsa_0
26  1893-Cleveland.txt    My Fellow citizens, in obedience of the mandat...    1893            0 0.275778   -0.128497       lsa_0
59  2017-Trump.txt        Chief Justice Roberts, President Carter, Presi...    2017            1 0.342111   0.511687        lsa_1

get_top_documents(component_prefix='cluster', top_n=5)[source]

Use after running pewanalytics.text.TextDataFrame.get_pca_components() or pewanalytics.text.TextDataFrame.get_lsa_components(). Returns the top_n documents with the highest scores for each components.

Parameters

component_prefix (str) – ‘lsa’ or ‘pca’ (you must first run get_pca_components or get_lsa_components)
top_n (int) – Number of documents to return for each component

Returns

A dictionary where keys are the component, and values are the text values for the component’s top_n documents

Return type

dict

Usage:

>>> df_lsa_topdoc = tdf.get_top_documents("lsa")
>>> {key: len(value) for key, value in lsa_topdoc.items()}
{'lsa_0': 5, 'lsa_1': 4}

>>> lsa_topdoc['lsa_1'][0]
'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow             Americans, and people of the world: Thank you.  We, the citizens of America...'

make_word_cooccurrence_matrix(normalize=False, min_frequency=10, max_frequency=0.5)[source]

Use to produce word co-occurrence matrices. Based on a helpful StackOverflow post: https://stackoverflow.com/questions/35562789/how-do-i-calculate-a-word-word-co-occurrence-matrix-with-sklearn

Parameters

normalize (bool) – If True, will be normalized
min_frequency (int) – The minimum document frequency required for a term to be included
max_frequency (int) – The maximum proportion of documents containing a term allowed to include the term

Returns

A matrix of (terms x terms) whose values indicate the number of documents in which two terms co-occurred

Usage:

>>> wcm = tdf.make_word_cooccurrence_matrix(min_frequency=25, normalize=True)
# Find the top cooccurring pair of words
>>> wcm.stack().index[np.argmax(wcm.values)]
('protection', 'policy')

make_document_cooccurrence_matrix(normalize=False)[source]

Use to produce document co-occurrence matrices. Based on a helpful StackOverflow post: https://stackoverflow.com/questions/35562789/how-do-i-calculate-a-word-word-co-occurrence-matrix-with-sklearn

Parameters: normalize (bool) – If True, will be normalized
Returns: A matrix of (documents x documents) whose values indicate the number of terms they had in common

Usage:

>>> dcm = tdf.make_document_cooccurrence_matrix(normalize=True)

# Remove artifical duplicates and insert document names
>>> dcm = dcm.iloc[:-2, :-2]
>>> dcm.rename(columns=df['speech'][:-2],
               index=df['speech'][:-2],
               inplace=True)

# Find documents with the highest coocurrence score
>>> dcm.stack().index[np.argmax(dcm.values)]
('1793-Washington.txt', '1841-Harrison.txt')

is_probable_stopword(word)[source]

Determine if a word is likely to be a stop word (like a name of a person or location) by the following rules:

Number of synset (words with similar meaning) is less than 3
The min_depth (number of edges between a word and the top of the hierarchy) is > 5
The number of lemma (similar to term definition in dictionary) is less than 2

If more than one of these conditions is true, then this function will return False, because the word likely has one or more meanings in English and is likely to be more than just a proper name.

This function was developed through trial and error, and your mileage may vary. It’s intended to help you identify potential stopwords when extracting features from a database. For example, on one of our projects we wanted to remove names from our text data, and pulled a list of names from our database of politicians. However, some politicians have last names that are also common English words, like “White” and “Black” - and in those cases, we didn’t want to add those to our list of stopwords. This function was useful in scanning through our list of names to identify names that we wanted to “whitelist”.

Parameters: word (string) – A word, usually a name of a person or location or something that you might want to add as a stopword
Returns: Whether or not the word is (probably) a stopword aka a proper noun with no common English meaning
Return type: bool

Usage:

>>> is_probable_stopword("Chicago")
True

>>> is_probable_stopword("Chicago")
False

>>> is_probable_stopword("Orange")
False

>>> is_probable_stopword("Johnny")
True

Date Extraction

The pewanalytics.text.dates submodule contains a helper class for extracting dates from text.

Classes:

DateFinder([preprocessing_patterns])

A helper class to search for dates in text using a series of regular expressions and a parser from dateutil.

class DateFinder(preprocessing_patterns=None)[source]

A helper class to search for dates in text using a series of regular expressions and a parser from dateutil. Verifies that dateutil did not auto-fill missing values in the date. Time information will be automatically cleared out, but you can also pass a list of additional regular expression patterns (as strings) to define other patterns that should be cleared out before scanning for dates.

Parameters: preprocessing_patterns (list) – Optional list of additional patterns to clear out prior to searching for dates.

Usage:

from pewanalytics.text.dates import DateFinder

text = "January 1, 2018 and 02/01/2019 and Mar. 1st 2020"
low_bound = datetime.datetime(2017, 1, 1)
high_bound = datetime.datetime(2021, 1, 1)

>>> finder = DateFinder()
>>> dates = finder.find_dates(text, low_bound, high_bound)
>>> dates
[
    (datetime.datetime(2018, 1, 1, 0, 0), 'January 1, 2018 '),
    (datetime.datetime(2020, 3, 1, 0, 0), 'Mar. 1st 2020'),
    (datetime.datetime(2019, 2, 1, 0, 0), '02/01/2019 ')
]

Methods:

find_dates(text, cutoff_date_start, ...)

Return all of the dates (in text form and as datetime) in the text variable that fall within the specified window of dates (inclusive).

find_dates(text, cutoff_date_start, cutoff_date_end)[source]

Return all of the dates (in text form and as datetime) in the text variable that fall within the specified window of dates (inclusive).

Parameters

text (str) – The text to scan for dates
cutoff_date_start (datetime.date) – No dates will be returned if they fall before this date
cutoff_date_end (datetime.date) – No dates will be returned if they fall after this date

Returns

A list of tuples containing (datetime object, raw date text)

Return type

list

Named Entity Recognition

The pewanalytics.text.ner submodule contains a helper class for extracting named entities from text.

Classes:

NamedEntityExtractor([method])

A wrapper around NLTK and SpaCy for named entity extraction.

class NamedEntityExtractor(method='spacy')[source]

A wrapper around NLTK and SpaCy for named entity extraction. May be expanded to include more libraries in the future.

Parameters: method (str) – Specify the library to use when extracting methods. Options are ‘nltk’, ‘spacy’, ‘all’. If ‘all’ is selected, both libraries will be used and the union will be returned. (Default=’spacy’)

Usage:

from pewanalytics.text.ner import NamedEntityExtractor
import nltk

nltk.download("inaugural")
fileid = nltk.corpus.inaugural.fileids()[0]
text = nltk.corpus.inaugural.raw(fileid)

>>> ner = NamedEntityExtractor(method="nltk")
>>> ner.extract(text)
{
    'ORGANIZATION': [
        'Parent', 'Invisible Hand', 'Great Author', 'House', 'Constitution', 'Senate',
        'Human Race', 'Representatives'
    ],
    'PERSON': ['Almighty Being'],
    'GPE': ['Heaven', 'United States', 'American']
}

>>> ner = NamedEntityExtractor(method="spacy")
>>> ner.extract(text)
{
    'ORGANIZATION': ['House of Representatives', 'Senate', 'Parent of the Human Race'],
    'DATE': ['present month', 'every day', '14th day', 'years'],
    'ORDINAL': ['first', 'fifth'],
    'GPE': ['United States'],
    'NORP': ['republican', 'American'],
    'LAW': ['Constitution']
}

>>> ner = NamedEntityExtractor(method="all")
>>> ner.extract(text)
{
    'ORGANIZATION': [
        'Representatives', 'Great Author', 'House', 'Parent', 'House of Representatives',
        'Parent of the Human Race', 'Invisible Hand', 'Human Race', 'Senate', 'Constitution'
    ],
    'PERSON': ['Almighty Being'],
    'GPE': ['Heaven', 'United States', 'American'],
    'DATE': ['every day', 'present month', '14th day', 'years'],
    'ORDINAL': ['first', 'fifth'],
    'NORP': ['republican', 'American'],
    'LAW': ['Constitution']
}

Methods:

extract(text)

param text: a string from which to extract named entities

extract(text)[source]

Parameters: text (str) – a string from which to extract named entities
Returns: dictionary of entities organized by their category
Return type: dict

Topic Modeling

The pewanalytics.text.topics submodule contains a standardized class for training and applying topic models using several different libraries.

Classes:

TopicModel(df, text_col, method[, ...])

A wrapper around various topic modeling algorithms and libraries, intended to provide a standardized way to train and apply models.

class TopicModel(df, text_col, method, num_topics=None, max_ngram_size=2, holdout_pct=0.25, use_tfidf=False, **vec_kwargs)[source]

A wrapper around various topic modeling algorithms and libraries, intended to provide a standardized way to train and apply models. When you initialize a TopicModel, it will fit a vectorizer, and split the data into a train and test set if holdout_pct is provided. For more information about the available implementations, refer to the documentation for the fit() method below.

Parameters

df – A pandas.DataFrame
text_col (str) – Name of the column containing text
method (str) – The topic model implementation to use. Options are: sklearn_lda, sklearn_nmf, gensim_lda, gensim_hdp, corex
num_topics (int) – The number of topics to extract. Required for every method except gensim_hdp.
max_ngram_size (int) – Maximum ngram size (2=bigrams, 3=trigrams, etc)
holdout_pct (float) – Proportion of the documents to hold out for goodness-of-fit scoring
use_tfidf (bool) – Whether to use binary counts or a TF-IDF representation
vec_kwargs – All remaining arguments get passed to TfidfVectorizer or CountVectorizer

Usage:

from pewanalytics.text.topics import TopicModel

import nltk
import pandas as pd
nltk.download("movie_reviews")
reviews = [{"fileid": fileid, "text": nltk.corpus.movie_reviews.raw(fileid)} for fileid in nltk.corpus.movie_reviews.fileids()]
df = pd.DataFrame(reviews)

>>> model = TopicModel(df, "text", "sklearn_nmf", num_topics=5, min_df=25, max_df=.5, use_tfidf=False)
Initialized sklearn_nmf topic model with 3285 features
1600 training documents, 400 testing documents

>>> model.fit()

>>> model.print_topics()
0: bad, really, know, don, plot, people, scene, movies, action, scenes
1: star, trek, star trek, effects, wars, star wars, special, special effects, movies, series
2: jackie, films, chan, jackie chan, hong, master, drunken, action, tarantino, brown
3: life, man, best, characters, new, love, world, little, does, great
4: alien, series, aliens, characters, films, television, files, quite, mars, action

>>> doc_topics = model.get_document_topics(df)

>>> doc_topics
       topic_0   topic_1   topic_2   topic_3   topic_4
0     0.723439  0.000000  0.000000  0.000000  0.000000
1     0.289801  0.050055  0.000000  0.000000  0.000000
2     0.375149  0.000000  0.030691  0.059088  0.143679
3     0.152961  0.010386  0.000000  0.121412  0.015865
4     0.294005  0.100426  0.000000  0.137630  0.051241
...        ...       ...       ...       ...       ...
1995  0.480983  0.070431  0.135178  0.256951  0.000000
1996  0.139986  0.000000  0.000000  0.107430  0.000000
1997  0.141545  0.005990  0.081986  0.387859  0.057025
1998  0.029228  0.023342  0.043713  0.280877  0.107551
1999  0.044863  0.000000  0.000000  0.718677  0.000000

Methods:

`get_features`(df[, keep_sparse])	Uses the trained vectorizer to process a `pandas.DataFrame` and return a feature matrix.
`get_fit_params`(**kwargs)	Internal helper function to set defaults depending on the specified model.
`fit`([df])	Fits a model using the method specified when initializing the `TopicModel`.
`get_score`()	Returns goodness-of-fit scores for certain models, based on the holdout documents.
`get_document_topics`(df, **kwargs)	Takes a `pandas.DataFrame` and returns a document-topic `pandas.DataFrame` (rows=documents, columns=topics)
`get_topics`([include_weights, top_n])	Returns a list, equal in length to the number of topics, where each item is a list of words or word-weight tuples.
`print_topics`([include_weights, top_n])	Prints the top words for each topic from a trained model.

get_features(df, keep_sparse=False)[source]

Uses the trained vectorizer to process a pandas.DataFrame and return a feature matrix.

Parameters

df – The pandas.DataFrame to vectorize (must have self.text_col in it)
keep_sparse (bool) – Whether or not to keep the feature matrix in sparse format (default=False)

Returns

A pandas.DataFrame of features or a sparse matrix, depending on the value of keep_sparse

get_fit_params(**kwargs)[source]

Internal helper function to set defaults depending on the specified model.

Parameters: kwargs – Arguments passed to self.fit()
Returns: Arguments to pass to the model

fit(df=None, **kwargs)[source]

Fits a model using the method specified when initializing the TopicModel. Details on model-specific parameters are below:

sklearn_lda

Fits a model using sklearn.decomposition.LatentDirichletAllocation. For more information on available parameters, please refer to the official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Parameters

df – The pandas.DataFrame to train the model on (must contain self.text_col)
alpha – Represents document-topic density. When values are higher, documents will be comprised of more topics; when values are lower, documents will be primarily comprised of only a few topics. This parameter is used instead of the doc_topic_prior sklearn parameter, and will be passed along to sklearn using the formula: doc_topic_prior = alpha / num_topics
beta – Represents topic-word density. When values are higher, topics will be comprised of more words; when values are lower, only a few words will be loaded onto each topic. This parameter is used instead of the topic_word_prior sklearn parameter, and will be passed along to sklearn using the formula: topic_word_prior = beta / num_topics.
learning_decay – See sklearn documentation.
learning_offset – See sklearn documentation.
learning_method – See sklearn documentation.
max_iter – See sklearn documentation.
batch_size – See sklearn documentation.
verbose – See sklearn documentation.

sklearn_nmf

Fits a model using sklearn.decomposition.NMF. For more information on available parameters, please refer to the official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

Parameters

df – The pandas.DataFrame to train the model on (must contain self.text_col)
alpha – See sklearn documentation.
l1_ratio – See sklearn documentation.
tol – See sklearn documentation.
max_iter – See sklearn documentation.
shuffle – See sklearn documentation.

gensim_lda

Fits an LDA model using gensim.models.LdaModel or gensim.models.ldamulticore.LdaMulticore. When use_multicore is set to True, the multicore implementation will be used, otherwise the standard LDA implementation will be used. For more information on available parameters, please refer to the official documentation below:

use_multicore=True: https://radimrehurek.com/gensim/models/ldamulticore.html

use_multicore=False: https://radimrehurek.com/gensim/models/ldamodel.html

Parameters

df – The pandas.DataFrame to train the model on (must contain self.text_col)
alpha – Represents document-topic density. When values are higher, documents will be comprised of more topics; when values are lower, documents will be primarily comprised of only a few topics. Gensim options are a bit different than sklearn though; refer to the documentation for the accepted values here.
beta – Represents topic-word density. When values are higher, topics will be comprised of more words; when values are lower, only a few words will be loaded onto each topic. Gensim options are a bit different than sklearn though; refer to the documentation for the accepted values here. Gensim calls this parameter eta. We renamed it to be consistent with the sklearn implementations.
chunksize – See gensim documentation.
passes – See gensim documentation.
decay – See gensim documentation.
offset – See gensim documentation.
workers – Number of cores to use (if using multicore)
use_multicore – Whether or not to use multicore

gensim_hdp

Fits an HDP model using the gensim implementation. Contrary to LDA and NMF, HDP attempts to auto-detect the correct number of topics. In practice, it actually fits T topics (default is 150) but many are extremely rare or occur only in a very few number of documents. To identify the topics that are actually useful, this function passes the original pandas.DataFrame through the trained model after fitting, and identifies topics that compose at least 1% of a document in at least 1% of all documents in the corpus. In other words, topics are thrown out if the number of documents they appear in at a rate of at least 1% are fewer than 1% of the total number of documents. Subsequent use of the model will only make use of topics that meet this threshold. For more information on available parameters, please refer to the official documentation: https://radimrehurek.com/gensim/models/hdpmodel.html

Parameters

df – The pandas.DataFrame to train the model on (must contain self.text_col)
max_chunks – See gensim documentation.
max_time – See gensim documentation.
chunksize – See gensim documentation.
kappa – See gensim documentation.
tau – See gensim documentation.
T – See gensim documentation.
K – See gensim documentation.
alpha – See gensim documentation.
beta – See gensim documentation.
gamma – See gensim documentation.
scale – See gensim documentation.
var_converge – See gensim documentation.

corex

Fits a CorEx topic model. Anchors can be provided in the form of a list of lists, with each item corresponding to a set of words to be used to seed a topic. For example:

anchors=[
    ['cat', 'kitten'],
    ['dog', 'puppy']
]

The list of anchors cannot be longer than the specified number of topics, and all of the words must exist in the vocabulary. The anchor_strength parameter determines the degree to which the model is able to override the suggested words based on the data; providing higher values are a way of “insisting” more strongly that the model keep the provided words together in a single topic. For more information on available parameters, please refer to the official documentation: https://github.com/gregversteeg/corex_topic

Parameters

df – The pandas.DataFrame to train the model on (must contain self.text_col)
anchors – A list of lists that contain words that the model should try to group together into topics
anchor_strength – The degree to which the provided anchors should be preserved regardless of the data

get_score()[source]

Returns goodness-of-fit scores for certain models, based on the holdout documents.

Note

The following scores are available for the following methods:

perplexity: (sklearn_lda only) The model’s perplexity
score: (sklearn_lda only) The model’s log-likelihood score
total_correlation: (corex only) The model’s total correlation score

Returns: A dictionary with goodness-of-fit scores
Return type: dict

get_document_topics(df, **kwargs)[source]

Takes a pandas.DataFrame and returns a document-topic pandas.DataFrame (rows=documents, columns=topics)

Parameters

df – The pandas.DataFrame to process (must have self.text_col in it)
min_probability (float) – (gensim_lda use_multicore=False only) Topics with a probability lower than this threshold will be filtered out (Default=0.0)

Returns

A document-topic matrix

get_topics(include_weights=False, top_n=10, **kwargs)[source]

Returns a list, equal in length to the number of topics, where each item is a list of words or word-weight tuples.

Parameters

include_weights (bool) – Whether or not to include weights along with the ngrams
top_n (init) – The number of words to include for each topic

Returns

A list of lists, where each item is a list of ngrams or ngram-weight tuples

print_topics(include_weights=False, top_n=10)[source]

Prints the top words for each topic from a trained model.

Parameters

include_weights (bool) – Whether or not to include weights along with the ngrams
top_n (int) – The number of words to include for each topic