pewanalytics.text: Text Tools
In the pewanalytics.text
module, you’ll find a variety of utilities for working with text data.
General Text Processing Tools
The main pewanalytics.text
module contains a variety of general tools for processing text.
Functions:
|
Checks whether a substring ("fragment") is contained within a larger string ("text"). |
|
Iteratively remove fragments from a string. |
|
Retain words associated with parts of speech in the text if |
|
Uses Levenshtein Distance to calculate similarity of two strings. |
|
Useful to calculate similarity of two strings that are of noticeably different lengths. |
|
Determine if a word is likely to be a stop word (like a name of a person or location) by the following rules: |
Classes:
|
Initializes a tokenizer that can be be used to break text into tokens using the |
|
A helper class designed to identify overlapping sections between two strings. |
|
A class for cleaning text up, in preparation for NLP, etc. |
|
This is a class full of functions for working with dataframes of documents. |
- has_fragment(text, fragment)[source]
Checks whether a substring (“fragment”) is contained within a larger string (“text”). Uses the
pewtils.decode_text()
function to process both the text and the fragment when running this check.- Parameters
text (str) – The text to search
fragment (str) – The fragment to search for
- Returns
Whether or not the text contains the fragment
- Return type
bool
Usage:
from pewanalytics.text import has_fragment text = "testing one two three" >>> has_fragment(text, "one two") True >>> has_fragment(text, "four") False
- remove_fragments(text, fragments, throw_loud_fail=False)[source]
Iteratively remove fragments from a string.
- Parameters
text (str) – The text toremove the fragments from
fragments (list) – A list of string fragments to search for and remove
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)
- Returns
The original string, minus any parts that matched the fragments provided
- Return type
str
Usage:
from pewanalytics.text import remove_fragments text = "testing one two three" >>> remove_fragments(text, ["one two"]) "testing three" >>> remove_fragments(text, ["testing", "three"]) " one two "
- filter_parts_of_speech(text, filter_pos=None, exclude=False)[source]
Retain words associated with parts of speech in the text if
exclude=False
. Ifexclude=True
, exclude words associated with parts of speech. Default is Noun (NN), Proper Noun (NNP) and Adjective (JJ)The full list of POS is here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html- Parameters
text (str) – The string to process
filter_pos (list) – Array of part of speech tags (default is ‘NN’, ‘NNP’, and ‘JJ’)
exclude – If
True
, the function will remove words that match to the specified parts of speech; by default this function filters to POS matches instead.
- Returns
A string comprised solely of words that matched (or did not match) to the specified parts of speech, depending on the value of
exclude
- Return type
str
Usage:
from pewanalytics.text import filter_parts_of_speech text = "This is a very exciting sentence that can serve as a functional example" >>> filter_parts_of_speech(text, filter_pos=["NN"]) 'sentence example' >>> filter_parts_of_speech(text, filter_pos=["JJ"], exclude=True) 'This is a very sentence that can serve as a example'
- get_fuzzy_ratio(text1, text2, throw_loud_fail=False)[source]
Uses Levenshtein Distance to calculate similarity of two strings. Measures how the edit distance compares to the overall length of the texts. Uses the
fuzzywuzzy
library in Python 2, and therapidfuzz
library in Python 3.- Parameters
text1 (str) – First string
text2 – Second string
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)
- Returns
The Levenshtein ratio between the two strings
- Return type
float
Usage:
from pewanalytics.text import get_fuzzy_ratio text1 = "This is a sentence." text2 = "This is a slightly difference sentence." >>> get_fuzzy_ratio(text1, text2) 64.28571428571428
- get_fuzzy_partial_ratio(text1, text2, throw_loud_fail=False, timeout=5)[source]
Useful to calculate similarity of two strings that are of noticeably different lengths. Allows for the possibility that one text is a subset of the other; finds the largest overlap and computes the Levenshtein ratio on that.
- Parameters
text1 (str) – First string
text2 (str) – Second string
timeout (int) – The number of seconds to wait before giving up
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)
- Returns
The partial Levenshtein ratio between the two texts
- Return type
float
- Accepts kwarg timeout
Usage:
from pewanalytics.text import get_partial_fuzzy_ratio text1 = "This is a sentence." text2 = "This is a sentence, but with more text." >>> get_partial_fuzzy_ratio(text1, text2) 100.0
- class SentenceTokenizer(base_tokenizer=None, regex_split_trailing=None, regex_split_leading=None)[source]
Initializes a tokenizer that can be be used to break text into tokens using the
tokenize
function- Parameters
base_tokenizer – The tokenizer to use (default = NLTK’s English Punkt tokenizer)
regex_split_trailing – A compiled regex object used to define the end of sentences
regex_split_leading – A compiled regex object used to define the beginning of sentences
Usage:
from pewanalytics.text import SentenceTokenizer import re text = "This is a sentence. This is another sentence - and maybe a third sentence. And yet a fourth sentence." >>> tokenizer = SentenceTokenizer() >>> tokenizer.tokenize(text) ['This is a sentence.', 'This is another sentence - and maybe a third sentence.', 'And yet a fourth sentence.'] >>> tokenizer = SentenceTokenizer(regex_split_leading=re.compile(r"\-")) >>> tokenizer.tokenize(text) ['This is a sentence.', 'This is another sentence', 'and maybe a third sentence.', 'And yet a fourth sentence.']
Methods:
tokenize
(text[, throw_loud_fail, min_length])Tokenizes the text.
- tokenize(text, throw_loud_fail=False, min_length=None)[source]
Tokenizes the text.
- Parameters
text (str) – The text to tokenize
throw_loud_fail (bool) – Whether or not to raise an error if text decoding fails (default=False)
min_length (int) – The minimum acceptable length of a sentence (if a token is shorter than this, it will be considered part of the preceding sentence) (default=None)
- Returns
A list of sentences
- Return type
list
- class TextOverlapExtractor(tokenizer=None)[source]
A helper class designed to identify overlapping sections between two strings.
- Parameters
tokenizer – The tokenizer to use (default = SentenceTokenizer())
Methods:
get_text_overlaps
(text1, text2[, ...])Extracts all overlapping segments of at least
min_length
characters between the two texts.get_largest_overlap
(text1, text2)Returns the largest overlapping segment of text between the two texts (this doesn't use the tokenizer).
- get_text_overlaps(text1, text2, min_length=20, tokenize=True)[source]
Extracts all overlapping segments of at least
min_length
characters between the two texts. Iftokenize=True
then only tokens that appear fully in both texts will be extracted. For example:- Parameters
text1 (str) – A piece of text
text2 (str) – Another piece of text to compare against the first
min_length (int) – The minimum size of the overlap to be considered (number of characters)
tokenize (bool) – If True, overlapping segments will only be included if they consist of atomic tokens; overlaps that consist of only part of a token will be excluded. By default, the text is tokenize into sentences based on punctuation. (default=True)
- Returns
A list of all of the identified overlapping segments
- Return type
list
Usage:
from pewanalytics.text import TextOverlapExtractor text1 = "This is a sentence. This is another sentence. And a third sentence. And yet a fourth sentence." text2 = "This is a different sentence. This is another sentence. And a third sentence. But the fourth sentence is different too." >>> extractor = TextOverlapExtractor() >>> extractor.get_text_overlaps(text1, text2, min_length=10, tokenize=False) [' sentence. This is another sentence. And a third sentence. ', ' fourth sentence'] >>> extractor.get_text_overlaps(text1, text2, min_length=10, tokenize=True) ['This is another sentence.', 'And a third sentence.']
- get_largest_overlap(text1, text2)[source]
Returns the largest overlapping segment of text between the two texts (this doesn’t use the tokenizer).
- Parameters
text1 (str) – A piece of text
text2 (str) – Another piece of text to compare against the first
- Returns
The largest substring that occurs in both texts
- Return type
str
Usage:
from pewanalytics.text import TextOverlapExtractor text1 = "Overlaping section, unique text another overlapping section" text2 = "Overlapping section, another overlapping section" >>> extractor = TextOverlapExtractor() >>> extractor.get_largest_overlap(text1, text2) ' another overlapping section'
- class TextCleaner(process_method='lemmatize', processor=None, filter_pos=None, lowercase=True, remove_urls=True, replacers=None, stopwords=None, strip_html=False, tokenizer=None, throw_loud_fail=False)[source]
A class for cleaning text up, in preparation for NLP, etc. Attempts to decode the text.
This function performs for the following cleaning tasks, in sequence:
Removes HTML tags (optional)
Decodes the text
Filters out specified parts of speech (optional)
Converts text to lowercase (optional)
Removes URLs (optional)
Expands contractions
Removes stopwords
Lemmatizes or stems (optional)
Removes words less than three characters
Removes punctuation
Consolidates whitespace
- Parameters
process_method (str) – Options are “lemmatize”, “stem”, or None (default = “lemmatize”)
processor – A lemmatizer or stemmer with a “lemmatize” or “stem” function (default for process_method=”lemmatize” is nltk.WordNetLemmatizer(); default for process_method=”stem” is nltk.SnowballStemmer())
filter_pos (list) – A list of WordNet parts-of-speech tags to keep; if provided, all other words will be removed (default = None)
lowercase (bool) – Whether or not to lowercase the string (default = True)
remove_urls (bool) – Whether or not to remove URLs and links from the text (default = True)
replacers (list) – A list of tuples, each with a regex pattern followed by the string/pattern to replace them with. Anything passed here will be used in addition to a set of built-in replacement patterns for common contractions.
stopwords (set) – The set of stopwords to remove (default = nltk.corpus.stopwords.words(‘english’) combined with sklearn.feature_extraction.stop_words.ENGLISH_STOP_WORDS). If an empty list is passed, no stopwords will be used.
strip_html (bool) – Whether or not to remove contents wrapped in HTML tags (default = False)
tokenizer – Tokenizer to use (default = nltk.WhitespaceTokenizer())
throw_loud_fail (bool) – bool; whether or not to raise an error if text decoding fails (default=False)
Usage:
from pewanalytics.text import TextCleaner text = "<body> Here's some example text.</br>It isn't a great example, but it'll do. Of course, there are plenty of other examples we could use though. http://example.com </body>" >>> cleaner = TextCleaner(process_method="stem") >>> cleaner.clean(text) 'exampl is_not great exampl cours plenti exampl could use though' >>> cleaner = TextCleaner(process_method="stem", stopwords=["my_custom_stopword"], strip_html=True) >>> cleaner.clean(text) 'here some exampl is_not great exampl but will cours there are plenti other exampl could use though' >>> cleaner = TextCleaner(process_method="lemmatize", strip_html=True) >>> cleaner.clean(text) 'example is_not great example course plenty example could use though' >>> cleaner = TextCleaner(process_method="lemmatize", remove_urls=False, strip_html=True) >>> cleaner.clean(text) 'example text is_not great example course plenty example could use though http example com' >>> cleaner = TextCleaner(process_method="stem", strip_html=False) >>> cleaner.clean(text) 'example text is_not great example course plenty example could use though http example com' >>> cleaner = TextCleaner(process_method="stem", filter_pos=["JJ"], strip_html=True) >>> cleaner.clean(text) 'great though'
Methods:
clean
(text)Cleans the text.
- class TextDataFrame(df, text_column, **vectorizer_kwargs)[source]
This is a class full of functions for working with dataframes of documents. It contains utilities for identifying potential duplicates, identifying recurring segments of text, computing metrics like mutual information, extracting clusters of documents, and more.
Given a
pandas.DataFrame
and the name of the column that contains the text to be analyzed, the TextDataFrame will automatically produce a TF-IDF sparse matrix representation of the text upon initialization. All other parameters are passed along to the scikit-learn TfidfVectorizer.Tip
For more info on the parameters it excepts, refer to the official scikit-learn TfidfVectorizer documentation.
- Parameters
df – A
pandas.DataFrame
of documents. Must contain a column with text.text_column (str) – The name of the column in the
pandas.DataFrame
that contains the textvectorizer_kwargs – All remaining keyword arguments are passed to TfidfVectorizer
Usage:
from pewanalytics.text import TextDataFrame import pandas as pd import nltk nltk.download("inaugural") df = pd.DataFrame([ {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids() ]) # Let's remove new line characters so we can print the output in the docstrings df['text'] = df['text'].str.replace("\n", " ") # And now let's create some additional variables to group our data df['year'] = df['speech'].map(lambda x: int(x.split("-")[0])) df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0) # And we'll also create some artificial duplicates in the dataset df = df.append(df.tail(2)).reset_index() >>> tdf = TextDataFrame(df, "text", stop_words="english", ngram_range=(1, 2)) >>> tdf_dense = pd.DataFrame(tdf.tfidf.todense(), columns=tdf.vectorizer.get_feature_names()).head(5) >>> tdf_dense.loc[:, (tdf_dense != 0).any(axis=0)] 14th 14th day abandon abandon government... zeal inspires zeal purity zeal rely zeal wisdom 0 0.034014 0.034014 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 1 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 2 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 3 0.000000 0.000000 0.020984 0.030686 ... 0.000000 0.000000 0.030686 0.000000 4 0.000000 0.000000 0.000000 0.000000 ... 0.026539 0.026539 0.000000 0.026539
Methods:
search_corpus
(text)Compares the provided text against the documents in the corpus and returns the most similar documents.
match_text_to_corpus
(match_list[, ...])Takes a list of text values and attempts to match them to the documents in the
pandas.DataFrame
.extract_corpus_fragments
([...])Iterate over the corpus
pandas.DataFrame
and, for each document, scan the most similar other documents in the corpus using TF-IDF cosine similarity.find_duplicates
([tfidf_threshold, ...])Search for duplicates by using cosine similarity and Levenshtein ratios.
find_related_keywords
(keyword[, n])Given a particular keyword, looks for related terms in the corpus using mutual information.
mutual_info
(outcome_col[, weight_col, ...])A wrapper around
pewanalytics.stats.mutual_info.compute_mutual_info()
kmeans_clusters
([k])A wrapper around
pewanalytics.stats.clustering.compute_kmeans_clusters()
.hdbscan_clusters
([min_cluster_size, min_samples])A wrapper around
pewanalytics.stats.clustering.compute_hdbscan_clusters()
.top_cluster_terms
(cluster_col[, min_size, top_n])Extracts the top terms for each cluster, based on a column of cluster IDs saved to
self.corpus
, using mutual information.pca_components
([k])A wrapper around
pewanalytics.stats.dimensionality_reduction.get_pca()
.lsa_components
([k])A wrapper around
pewanalytics.stats.dimensionality_reduction.get_lsa()
.get_top_documents
([component_prefix, top_n])Use after running
pewanalytics.text.TextDataFrame.get_pca_components()
orpewanalytics.text.TextDataFrame.get_lsa_components()
.make_word_cooccurrence_matrix
([normalize, ...])Use to produce word co-occurrence matrices.
make_document_cooccurrence_matrix
([normalize])Use to produce document co-occurrence matrices.
- search_corpus(text)[source]
Compares the provided text against the documents in the corpus and returns the most similar documents. A new column called ‘cosine_similarity’ is generated, which is used to sort and return the
pandas.DataFrame
.- Parameters
text (str) – The text to compare documents against
- Returns
The corpus
pandas.DataFrame
sorted by cosine similarity
Usage:
>>> tdf.search_corpus('upright zeal')[:5] text search_cosine_similarity 4 Proceeding, fellow citizens, to that qualifica... 0.030856 8 Fellow citizens, I shall not attempt to descri... 0.025041 9 In compliance with an usage coeval with the ex... 0.024922 27 Fellow citizens, In obedience to the will of t... 0.021272 10 Fellow citizens, about to undertake the arduou... 0.014791
- match_text_to_corpus(match_list, allow_multiple=False, min_similarity=0.9)[source]
Takes a list of text values and attempts to match them to the documents in the
pandas.DataFrame
. Each document will be matched to the value in the list to which it is most similar, based on cosine similarity.- Parameters
match_list (str) – A list of strings (other documents) to be matched to documents in the
pandas.DataFrame
allow_multiple (bool) – If set to True, each document in your corpus will be matched with its closes valid match in the list. If set to False (default), documents in the list will only be matched to their best match in the corpus.
min_similarity (float) – Minimum cosine similarity required for any match to be made.
- Returns
Your corpus
pandas.DataFrame
, with new columns match_text, match_index, and cosine_similarity
Usage:
>>> match_df = tdf.match_text_to_corpus(test_excerpt, min_similarity=0.05) >>> match_df.sort_values('cosine_similarity')[:2] text match_text match_index cosine_similarity 48 Senator Hatfield, Mr. Chief Justice, Mr. Presi... In this present crisis, government is not the ... 1 0.0699283 43 Vice President Johnson, Mr. Speaker, Mr. Chief... And so, my fellow Americans: ask not what your... 0 0.166681
- extract_corpus_fragments(scan_top_n_matches_per_doc=20, min_fragment_length=15, tokenize=True, tokenizer=None)[source]
Iterate over the corpus
pandas.DataFrame
and, for each document, scan the most similar other documents in the corpus using TF-IDF cosine similarity. During each comparison, overlapping fragments are identified. This can be useful for identifying common boilerplate sentences, repeated paragraphs, etc. By default, the text is tokenized into complete sentences (so only complete sentences that recur will be returned), but you can settokenize=False
to get raw segments of text that occur multiple times.- Parameters
scan_top_n_matches_per_doc (int) – The number of other documents to compare each document against.
min_fragment_length (int) – The minimum character length a fragment must have to be extracted.
tokenize (bool) – If True, overlapping segments will only be included if they consist of atomic tokens; overlaps that consist of only part of a token will be excluded. Uses sentence tokenization by default. (default=True)
tokenizer (object) – The tokenizer to use, if tokenizing isn’t disabled (default = SentenceTokenizer())
- Returns
A list of fragments that were found.
Note
This function will skip over duplicates if they exist in your data; it only compares documents that have less than .997 cosine similarity.
Usage:
>>> tdf.extract_corpus_fragments(scan_top_n_matches_per_doc=20, min_fragment_length=25, tokenize=False) ['s. Equal and exact justice ', 'd by the General Government', ' of the American people, ', 'ent of the United States ', ' the office of President of the United States ', ' preserve, protect, and defend the Constitution of the United States." ', ' to "preserve, protect, and defend', ' of the United States are ', 'e of my countrymen I am about to ', 'Vice President, Mr. Chief Justice, ', ' 200th anniversary as a nation', ', and my fellow citizens: ', 'e United States of America']
- find_duplicates(tfidf_threshold=0.9, fuzzy_ratio_threshold=90, allow_partial=False, max_partial_difference=40, filter_function=None, partial_ratio_timeout=5, decode_text=False)[source]
Search for duplicates by using cosine similarity and Levenshtein ratios. This will struggle with large corpora, so we recommend trying to filter down to potential duplicates first. The corpus will first be scanned for document pairs with a cosine similarity greater or equal to the
tfidf_threshold
. Then, each of these pairs will be compared using the more stringentfuzzy_ratio_threshold
.- Parameters
tfidf_threshold (float) – Minimum cosine similarity for two documents to be considered potential dupes.
fuzzy_ratio_threshold (int) – The required Levenshtein ratio to consider two documents duplicates.
allow_partial (bool) – Whether or not to allow a partial ratio (if False, absolute ratios will be used)
max_partial_diff (int) – The maximum partial ratio difference allowed for a potential duplicate pair
filter_function – An optional function that allows for more complex filtering. The function must accept the following parameters: text1, text2, cosine_similarity, fuzzy_ratio. Must return True or False, indicating whether the two documents should be considered duplicates.
partial_ratio_timeout (int) – How long, in seconds, that the partial ratio is allowed to compute
decode_text (bool) – Whether to decode the text prior to making comparisons
- Returns
A list of lists, containing groups of duplicate documents (represented as rows from the corpus
pandas.DataFrame
)
Usage:
>>> tdf.find_duplicates() [ speech text year 56 2013-Obama.txt Thank you. Thank you so much. Vice Presiden... 2013 56 2013-Obama.txt Thank you. Thank you so much. Vice Presiden... 2013 21st_century 56 1 56 1 , speech text year 57 2017-Trump.txt Chief Justice Roberts, President Carter, Presi... 2017 57 2017-Trump.txt Chief Justice Roberts, President Carter, Presi... 2017 21st_century 57 1 57 1 ]
Given a particular keyword, looks for related terms in the corpus using mutual information.
- Parameters
keyword (str) – The keyword to use
n (int) – Number of related terms to return
- Returns
Terms associated with the keyword
- Return type
list
Usage:
>>> tdf.find_related_keywords("war")[:2] ['war', 'peace'] >>> tdf.find_related_keywords("economy")[:2] ['economy', 'expenditures']
- mutual_info(outcome_col, weight_col=None, sample_size=None, l=0, normalize=True)[source]
A wrapper around
pewanalytics.stats.mutual_info.compute_mutual_info()
- Parameters
outcome_col (str) – The name of the column with the binary outcome variable
weight_col (str) – (Optional) Name of the column to use in weighting
sample_size (int) – (Optional) If provided, a random sample of this size will be used instead of the full
pandas.DataFrame
l (float) – An optional Laplace smoothing parameter
normalize (bool) – Toggle normalization on or off (to control for feature prevalence), on by default
- Returns
A
pandas.DataFrame
of ngrams and various metrics about them, including mutual information
Usage:
>>> results = tdf.mutual_info('21st_century') >>> results.sort_values("MI1", ascending=False).index[:25] Index(['journey complete', 'jobs', 'make america', 've', 'obama', 'workers', 'xand', 'states america', 'america best', 'debates', 'clinton', 'president clinton', 'trillions', 'stops right', 'transferring', 'president obama', 'stops', 'protected protected', 'transferring power', 'nation capital', 'american workers', 'politicians', 'people believe', 'borders', 'victories'], dtype='object')
- kmeans_clusters(k=10)[source]
A wrapper around
pewanalytics.stats.clustering.compute_kmeans_clusters()
. Will compute clusters of documents. The resulting cluster IDs for each document are saved in the TextDataFrame’scorpus
in a new column called “kmeans”.- Parameters
k (int) – The number of clusters to extract
Usage:
>>> tdf.kmeans_clusters(5) KMeans: n_clusters 5, score is 0.019735248210503934 KMeans clusters saved to self.corpus['kmeans'] >>> df['kmeans'].value_counts() 2 26 3 15 4 11 0 5 1 3 Name: kmeans, dtype: int64
- hdbscan_clusters(min_cluster_size=100, min_samples=1)[source]
A wrapper around
pewanalytics.stats.clustering.compute_hdbscan_clusters()
. Will compute clusters of documents. The resulting cluster IDs for each document are saved in the TextDataFrame’scorpus
in a new column called “hdbscan”.- Parameters
min_cluster_size (int) – The minimum number of documents that a cluster must contain.
min_samples (int) – An HDBSCAN parameter; refer to the documentation for more information
Usage:
>>> tdf.hdbscan_clusters(min_cluster_size=10) HDBSCAN: n_clusters 2 HDBSCAN clusters saved to self.corpus['hdbscan']
- top_cluster_terms(cluster_col, min_size=50, top_n=10)[source]
Extracts the top terms for each cluster, based on a column of cluster IDs saved to
self.corpus
, using mutual information. Returns thetop_n
terms for each cluster.- Parameters
cluster_col (str) – The name of the column that contains the document cluster IDs
min_size (int) – Ignore clusters that have fewer than this number of documents
top_n (int) – The number of top terms to identify for each cluster
- Returns
A dictionary; keys are the cluster IDs and values are the top terms for the cluster
- Return type
dict
Usage:
>>> df_top_cluster = tdf.top_cluster_terms('kmeans', min_size=10) Cluster #2, 26 documents: ['constitution' 'union' 'states' 'friendly' 'liberal' 'revenue' 'general government' 'confederacy' 'whilst' 'authorities'] Cluster #4, 10 documents: ['shall strive' 'let sides' 'woe' 'offenses' 'breeze' 'war let' 'nuclear weapons' 'learned live' 'mistakes' 'mr speaker'] Cluster #0, 12 documents: ['activities' 'realization' 'interstate' 'wished' 'industrial' 'major' 'counsel action' 'conditions' 'natural resources' 'eighteenth amendment']
- pca_components(k=20)[source]
A wrapper around
pewanalytics.stats.dimensionality_reduction.get_pca()
. Saves the PCA components to self.corpus as new columns (‘pca_1’, ‘pca_2’, etc.), saves the top component for each document as self.corpus[‘pca’], and returns the features-component matrix.- Parameters
k (int) – Number of dimensions to extract
- Returns
A
pandas.DataFrame
of (features x components)
Usage:
>>> df_pca = tdf.pca_components(2) Decomposition explained variance ratio: 0.07488529151231405 Component 0: ['america' 'today' 'americans' 'world' 'new' 'freedom' 'thank' 'nation' 'god' 'journey'] Component 1: ['america' 'make america' 'dreams' 'protected' 'obama' 'borders' 'factories' 'american' 'transferring' 'stops'] Top PCA dimensions saved as clusters to self.corpus['pca'] >>> df.sample(5) speech text year 21st_century pca_0 pca_1 pca 0 1789-Washington.txt Fellow-Citizens of the Senate and of the House... 1789 0 -0.129094 0.016984 pca_1 21 1873-Grant.txt Fellow-Citizens: Under Providence I have been ... 1873 0 -0.097430 0.009559 pca_1 49 1985-Reagan.txt Senator Mathias, Chief Justice Burger, Vice Pr... 1985 0 0.163833 -0.020259 pca_0 2 1797-Adams.txt When it was first perceived, in early times, t... 1797 0 -0.140250 0.024844 pca_1 20 1869-Grant.txt Citizens of the United States: Your suffrag... 1869 0 -0.114444 0.014419 pca_1
- lsa_components(k=20)[source]
A wrapper around
pewanalytics.stats.dimensionality_reduction.get_lsa()
. Saves the LSA components to self.corpus as new columns (‘lsa_1’, ‘lsa_2’, etc.), saves the top component for each document as self.corpus[‘lsa’], and returns the features-component matrix- Parameters
k (int) – Number of dimensions to extract
- Returns
A
pandas.DataFrame
of (features x components)
Usage:
>>> df_lsa = tdf.lsa_components(2) Decomposition explained variance ratio: 0.04722850124656694 Top features: Component 0: ['government' 'people' 'america' 'states' 'world' 'nation' 'shall' 'country' 'great' 'peace'] Component 1: ['america' 'today' 'americans' 'world' 'new' 'freedom' 'thank' 'nation' 'god' 'journey'] Top LSA dimensions saved as clusters to self.corpus['lsa_'] columns >>> df.sample(5) speech text year 21st_century lsa_0 lsa_1 lsa 37 1937-Roosevelt.txt When four years ago we met to inaugurate a Pre... 1937 0 0.293068 0.040802 lsa_0 8 1821-Monroe.txt Fellow citizens, I shall not attempt to descri... 1821 0 0.348465 -0.212382 lsa_0 7 1817-Monroe.txt I should be destitute of feeling if I was not ... 1817 0 0.369249 -0.237231 lsa_0 26 1893-Cleveland.txt My Fellow citizens, in obedience of the mandat... 1893 0 0.275778 -0.128497 lsa_0 59 2017-Trump.txt Chief Justice Roberts, President Carter, Presi... 2017 1 0.342111 0.511687 lsa_1
- get_top_documents(component_prefix='cluster', top_n=5)[source]
Use after running
pewanalytics.text.TextDataFrame.get_pca_components()
orpewanalytics.text.TextDataFrame.get_lsa_components()
. Returns thetop_n
documents with the highest scores for each components.- Parameters
component_prefix (str) – ‘lsa’ or ‘pca’ (you must first run get_pca_components or get_lsa_components)
top_n (int) – Number of documents to return for each component
- Returns
A dictionary where keys are the component, and values are the text values for the component’s
top_n
documents- Return type
dict
Usage:
>>> df_lsa_topdoc = tdf.get_top_documents("lsa") >>> {key: len(value) for key, value in lsa_topdoc.items()} {'lsa_0': 5, 'lsa_1': 4} >>> lsa_topdoc['lsa_1'][0] 'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: Thank you. We, the citizens of America...'
- make_word_cooccurrence_matrix(normalize=False, min_frequency=10, max_frequency=0.5)[source]
Use to produce word co-occurrence matrices. Based on a helpful StackOverflow post: https://stackoverflow.com/questions/35562789/how-do-i-calculate-a-word-word-co-occurrence-matrix-with-sklearn
- Parameters
normalize (bool) – If True, will be normalized
min_frequency (int) – The minimum document frequency required for a term to be included
max_frequency (int) – The maximum proportion of documents containing a term allowed to include the term
- Returns
A matrix of (terms x terms) whose values indicate the number of documents in which two terms co-occurred
Usage:
>>> wcm = tdf.make_word_cooccurrence_matrix(min_frequency=25, normalize=True) # Find the top cooccurring pair of words >>> wcm.stack().index[np.argmax(wcm.values)] ('protection', 'policy')
- make_document_cooccurrence_matrix(normalize=False)[source]
Use to produce document co-occurrence matrices. Based on a helpful StackOverflow post: https://stackoverflow.com/questions/35562789/how-do-i-calculate-a-word-word-co-occurrence-matrix-with-sklearn
- Parameters
normalize (bool) – If True, will be normalized
- Returns
A matrix of (documents x documents) whose values indicate the number of terms they had in common
Usage:
>>> dcm = tdf.make_document_cooccurrence_matrix(normalize=True) # Remove artifical duplicates and insert document names >>> dcm = dcm.iloc[:-2, :-2] >>> dcm.rename(columns=df['speech'][:-2], index=df['speech'][:-2], inplace=True) # Find documents with the highest coocurrence score >>> dcm.stack().index[np.argmax(dcm.values)] ('1793-Washington.txt', '1841-Harrison.txt')
- is_probable_stopword(word)[source]
Determine if a word is likely to be a stop word (like a name of a person or location) by the following rules:
Number of synset (words with similar meaning) is less than 3
The min_depth (number of edges between a word and the top of the hierarchy) is > 5
The number of lemma (similar to term definition in dictionary) is less than 2
If more than one of these conditions is true, then this function will return False, because the word likely has one or more meanings in English and is likely to be more than just a proper name.
This function was developed through trial and error, and your mileage may vary. It’s intended to help you identify potential stopwords when extracting features from a database. For example, on one of our projects we wanted to remove names from our text data, and pulled a list of names from our database of politicians. However, some politicians have last names that are also common English words, like “White” and “Black” - and in those cases, we didn’t want to add those to our list of stopwords. This function was useful in scanning through our list of names to identify names that we wanted to “whitelist”.
- Parameters
word (string) – A word, usually a name of a person or location or something that you might want to add as a stopword
- Returns
Whether or not the word is (probably) a stopword aka a proper noun with no common English meaning
- Return type
bool
Usage:
>>> is_probable_stopword("Chicago") True >>> is_probable_stopword("Chicago") False >>> is_probable_stopword("Orange") False >>> is_probable_stopword("Johnny") True
Date Extraction
The pewanalytics.text.dates
submodule contains a helper class for extracting dates from text.
Classes:
|
A helper class to search for dates in text using a series of regular expressions and a parser from |
- class DateFinder(preprocessing_patterns=None)[source]
A helper class to search for dates in text using a series of regular expressions and a parser from
dateutil
. Verifies thatdateutil
did not auto-fill missing values in the date. Time information will be automatically cleared out, but you can also pass a list of additional regular expression patterns (as strings) to define other patterns that should be cleared out before scanning for dates.- Parameters
preprocessing_patterns (list) – Optional list of additional patterns to clear out prior to searching for dates.
Usage:
from pewanalytics.text.dates import DateFinder text = "January 1, 2018 and 02/01/2019 and Mar. 1st 2020" low_bound = datetime.datetime(2017, 1, 1) high_bound = datetime.datetime(2021, 1, 1) >>> finder = DateFinder() >>> dates = finder.find_dates(text, low_bound, high_bound) >>> dates [ (datetime.datetime(2018, 1, 1, 0, 0), 'January 1, 2018 '), (datetime.datetime(2020, 3, 1, 0, 0), 'Mar. 1st 2020'), (datetime.datetime(2019, 2, 1, 0, 0), '02/01/2019 ') ]
Methods:
find_dates
(text, cutoff_date_start, ...)Return all of the dates (in text form and as datetime) in the text variable that fall within the specified window of dates (inclusive).
- find_dates(text, cutoff_date_start, cutoff_date_end)[source]
Return all of the dates (in text form and as datetime) in the text variable that fall within the specified window of dates (inclusive).
- Parameters
text (str) – The text to scan for dates
cutoff_date_start (datetime.date) – No dates will be returned if they fall before this date
cutoff_date_end (datetime.date) – No dates will be returned if they fall after this date
- Returns
A list of tuples containing (datetime object, raw date text)
- Return type
list
Named Entity Recognition
The pewanalytics.text.ner
submodule contains a helper class for extracting named entities from text.
Classes:
|
A wrapper around NLTK and SpaCy for named entity extraction. |
- class NamedEntityExtractor(method='spacy')[source]
A wrapper around NLTK and SpaCy for named entity extraction. May be expanded to include more libraries in the future.
- Parameters
method (str) – Specify the library to use when extracting methods. Options are ‘nltk’, ‘spacy’, ‘all’. If ‘all’ is selected, both libraries will be used and the union will be returned. (Default=’spacy’)
Usage:
from pewanalytics.text.ner import NamedEntityExtractor import nltk nltk.download("inaugural") fileid = nltk.corpus.inaugural.fileids()[0] text = nltk.corpus.inaugural.raw(fileid) >>> ner = NamedEntityExtractor(method="nltk") >>> ner.extract(text) { 'ORGANIZATION': [ 'Parent', 'Invisible Hand', 'Great Author', 'House', 'Constitution', 'Senate', 'Human Race', 'Representatives' ], 'PERSON': ['Almighty Being'], 'GPE': ['Heaven', 'United States', 'American'] } >>> ner = NamedEntityExtractor(method="spacy") >>> ner.extract(text) { 'ORGANIZATION': ['House of Representatives', 'Senate', 'Parent of the Human Race'], 'DATE': ['present month', 'every day', '14th day', 'years'], 'ORDINAL': ['first', 'fifth'], 'GPE': ['United States'], 'NORP': ['republican', 'American'], 'LAW': ['Constitution'] } >>> ner = NamedEntityExtractor(method="all") >>> ner.extract(text) { 'ORGANIZATION': [ 'Representatives', 'Great Author', 'House', 'Parent', 'House of Representatives', 'Parent of the Human Race', 'Invisible Hand', 'Human Race', 'Senate', 'Constitution' ], 'PERSON': ['Almighty Being'], 'GPE': ['Heaven', 'United States', 'American'], 'DATE': ['every day', 'present month', '14th day', 'years'], 'ORDINAL': ['first', 'fifth'], 'NORP': ['republican', 'American'], 'LAW': ['Constitution'] }
Methods:
extract
(text)- param text
a string from which to extract named entities
Topic Modeling
The pewanalytics.text.topics
submodule contains a standardized class for training and applying topic models using several different libraries.
Classes:
|
A wrapper around various topic modeling algorithms and libraries, intended to provide a standardized way to train and apply models. |
- class TopicModel(df, text_col, method, num_topics=None, max_ngram_size=2, holdout_pct=0.25, use_tfidf=False, **vec_kwargs)[source]
A wrapper around various topic modeling algorithms and libraries, intended to provide a standardized way to train and apply models. When you initialize a
TopicModel
, it will fit a vectorizer, and split the data into a train and test set ifholdout_pct
is provided. For more information about the available implementations, refer to the documentation for thefit()
method below.- Parameters
df – A
pandas.DataFrame
text_col (str) – Name of the column containing text
method (str) – The topic model implementation to use. Options are: sklearn_lda, sklearn_nmf, gensim_lda, gensim_hdp, corex
num_topics (int) – The number of topics to extract. Required for every method except
gensim_hdp
.max_ngram_size (int) – Maximum ngram size (2=bigrams, 3=trigrams, etc)
holdout_pct (float) – Proportion of the documents to hold out for goodness-of-fit scoring
use_tfidf (bool) – Whether to use binary counts or a TF-IDF representation
vec_kwargs – All remaining arguments get passed to TfidfVectorizer or CountVectorizer
Usage:
from pewanalytics.text.topics import TopicModel import nltk import pandas as pd nltk.download("movie_reviews") reviews = [{"fileid": fileid, "text": nltk.corpus.movie_reviews.raw(fileid)} for fileid in nltk.corpus.movie_reviews.fileids()] df = pd.DataFrame(reviews) >>> model = TopicModel(df, "text", "sklearn_nmf", num_topics=5, min_df=25, max_df=.5, use_tfidf=False) Initialized sklearn_nmf topic model with 3285 features 1600 training documents, 400 testing documents >>> model.fit() >>> model.print_topics() 0: bad, really, know, don, plot, people, scene, movies, action, scenes 1: star, trek, star trek, effects, wars, star wars, special, special effects, movies, series 2: jackie, films, chan, jackie chan, hong, master, drunken, action, tarantino, brown 3: life, man, best, characters, new, love, world, little, does, great 4: alien, series, aliens, characters, films, television, files, quite, mars, action >>> doc_topics = model.get_document_topics(df) >>> doc_topics topic_0 topic_1 topic_2 topic_3 topic_4 0 0.723439 0.000000 0.000000 0.000000 0.000000 1 0.289801 0.050055 0.000000 0.000000 0.000000 2 0.375149 0.000000 0.030691 0.059088 0.143679 3 0.152961 0.010386 0.000000 0.121412 0.015865 4 0.294005 0.100426 0.000000 0.137630 0.051241 ... ... ... ... ... ... 1995 0.480983 0.070431 0.135178 0.256951 0.000000 1996 0.139986 0.000000 0.000000 0.107430 0.000000 1997 0.141545 0.005990 0.081986 0.387859 0.057025 1998 0.029228 0.023342 0.043713 0.280877 0.107551 1999 0.044863 0.000000 0.000000 0.718677 0.000000
Methods:
get_features
(df[, keep_sparse])Uses the trained vectorizer to process a
pandas.DataFrame
and return a feature matrix.get_fit_params
(**kwargs)Internal helper function to set defaults depending on the specified model.
fit
([df])Fits a model using the method specified when initializing the
TopicModel
.Returns goodness-of-fit scores for certain models, based on the holdout documents.
get_document_topics
(df, **kwargs)Takes a
pandas.DataFrame
and returns a document-topicpandas.DataFrame
(rows=documents, columns=topics)get_topics
([include_weights, top_n])Returns a list, equal in length to the number of topics, where each item is a list of words or word-weight tuples.
print_topics
([include_weights, top_n])Prints the top words for each topic from a trained model.
- get_features(df, keep_sparse=False)[source]
Uses the trained vectorizer to process a
pandas.DataFrame
and return a feature matrix.- Parameters
df – The
pandas.DataFrame
to vectorize (must haveself.text_col
in it)keep_sparse (bool) – Whether or not to keep the feature matrix in sparse format (default=False)
- Returns
A
pandas.DataFrame
of features or a sparse matrix, depending on the value ofkeep_sparse
- get_fit_params(**kwargs)[source]
Internal helper function to set defaults depending on the specified model.
- Parameters
kwargs – Arguments passed to
self.fit()
- Returns
Arguments to pass to the model
- fit(df=None, **kwargs)[source]
Fits a model using the method specified when initializing the
TopicModel
. Details on model-specific parameters are below:sklearn_lda
Fits a model using
sklearn.decomposition.LatentDirichletAllocation
. For more information on available parameters, please refer to the official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html- Parameters
df – The
pandas.DataFrame
to train the model on (must containself.text_col
)alpha – Represents document-topic density. When values are higher, documents will be comprised of more topics; when values are lower, documents will be primarily comprised of only a few topics. This parameter is used instead of the doc_topic_prior sklearn parameter, and will be passed along to sklearn using the formula:
doc_topic_prior = alpha / num_topics
beta – Represents topic-word density. When values are higher, topics will be comprised of more words; when values are lower, only a few words will be loaded onto each topic. This parameter is used instead of the topic_word_prior sklearn parameter, and will be passed along to sklearn using the formula:
topic_word_prior = beta / num_topics
.learning_decay – See sklearn documentation.
learning_offset – See sklearn documentation.
learning_method – See sklearn documentation.
max_iter – See sklearn documentation.
batch_size – See sklearn documentation.
verbose – See sklearn documentation.
sklearn_nmf
Fits a model using
sklearn.decomposition.NMF
. For more information on available parameters, please refer to the official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html- Parameters
df – The
pandas.DataFrame
to train the model on (must containself.text_col
)alpha – See sklearn documentation.
l1_ratio – See sklearn documentation.
tol – See sklearn documentation.
max_iter – See sklearn documentation.
shuffle – See sklearn documentation.
gensim_lda
Fits an LDA model using
gensim.models.LdaModel
orgensim.models.ldamulticore.LdaMulticore
. Whenuse_multicore
is set to True, the multicore implementation will be used, otherwise the standard LDA implementation will be used. For more information on available parameters, please refer to the official documentation below:use_multicore=True: https://radimrehurek.com/gensim/models/ldamulticore.html
use_multicore=False: https://radimrehurek.com/gensim/models/ldamodel.html
- Parameters
df – The
pandas.DataFrame
to train the model on (must containself.text_col
)alpha – Represents document-topic density. When values are higher, documents will be comprised of more topics; when values are lower, documents will be primarily comprised of only a few topics. Gensim options are a bit different than sklearn though; refer to the documentation for the accepted values here.
beta – Represents topic-word density. When values are higher, topics will be comprised of more words; when values are lower, only a few words will be loaded onto each topic. Gensim options are a bit different than sklearn though; refer to the documentation for the accepted values here. Gensim calls this parameter
eta
. We renamed it to be consistent with the sklearn implementations.chunksize – See gensim documentation.
passes – See gensim documentation.
decay – See gensim documentation.
offset – See gensim documentation.
workers – Number of cores to use (if using multicore)
use_multicore – Whether or not to use multicore
gensim_hdp
Fits an HDP model using the gensim implementation. Contrary to LDA and NMF, HDP attempts to auto-detect the correct number of topics. In practice, it actually fits
T
topics (default is 150) but many are extremely rare or occur only in a very few number of documents. To identify the topics that are actually useful, this function passes the originalpandas.DataFrame
through the trained model after fitting, and identifies topics that compose at least 1% of a document in at least 1% of all documents in the corpus. In other words, topics are thrown out if the number of documents they appear in at a rate of at least 1% are fewer than 1% of the total number of documents. Subsequent use of the model will only make use of topics that meet this threshold. For more information on available parameters, please refer to the official documentation: https://radimrehurek.com/gensim/models/hdpmodel.html- Parameters
df – The
pandas.DataFrame
to train the model on (must containself.text_col
)max_chunks – See gensim documentation.
max_time – See gensim documentation.
chunksize – See gensim documentation.
kappa – See gensim documentation.
tau – See gensim documentation.
T – See gensim documentation.
K – See gensim documentation.
alpha – See gensim documentation.
beta – See gensim documentation.
gamma – See gensim documentation.
scale – See gensim documentation.
var_converge – See gensim documentation.
corex
Fits a CorEx topic model. Anchors can be provided in the form of a list of lists, with each item corresponding to a set of words to be used to seed a topic. For example:
anchors=[ ['cat', 'kitten'], ['dog', 'puppy'] ]
The list of anchors cannot be longer than the specified number of topics, and all of the words must exist in the vocabulary. The
anchor_strength
parameter determines the degree to which the model is able to override the suggested words based on the data; providing higher values are a way of “insisting” more strongly that the model keep the provided words together in a single topic. For more information on available parameters, please refer to the official documentation: https://github.com/gregversteeg/corex_topic- Parameters
df – The
pandas.DataFrame
to train the model on (must containself.text_col
)anchors – A list of lists that contain words that the model should try to group together into topics
anchor_strength – The degree to which the provided anchors should be preserved regardless of the data
- get_score()[source]
Returns goodness-of-fit scores for certain models, based on the holdout documents.
Note
The following scores are available for the following methods:
perplexity: (sklearn_lda only) The model’s perplexity
score: (sklearn_lda only) The model’s log-likelihood score
total_correlation: (corex only) The model’s total correlation score
- Returns
A dictionary with goodness-of-fit scores
- Return type
dict
- get_document_topics(df, **kwargs)[source]
Takes a
pandas.DataFrame
and returns a document-topicpandas.DataFrame
(rows=documents, columns=topics)- Parameters
df – The
pandas.DataFrame
to process (must haveself.text_col
in it)min_probability (float) – (gensim_lda use_multicore=False only) Topics with a probability lower than this threshold will be filtered out (Default=0.0)
- Returns
A document-topic matrix
- get_topics(include_weights=False, top_n=10, **kwargs)[source]
Returns a list, equal in length to the number of topics, where each item is a list of words or word-weight tuples.
- Parameters
include_weights (bool) – Whether or not to include weights along with the ngrams
top_n (init) – The number of words to include for each topic
- Returns
A list of lists, where each item is a list of ngrams or ngram-weight tuples