One a Day One Liners with Python — Week 2
Text Processing
Welcome Back!
Originally, I didn’t think I’d have so much to say about so little code, but that hasn’t been the case thus far and since I’m planning to keep this going for the rest of the year, it’s probably wise to split up the One Liners into weekly installments. Also, I’m gonna experiment with setting a theme for each week. This week’s theme is “Text Processing”.
Last week’s One Liners can be found here
Jan 14, 2023
Find occurrences of terms or phrases 🔎
terms = ['species of a genus', 'differs', 'mutatation']
result = [(term, i) for term in terms for i in range(len(text)) if text.startswith(term, i)]
Behold, our first nested list comprehension. It iterates through text and creates a list of tuples containing a term and the start index of that term in the text. The real magic here is the startswith
method on the built in str
type. It’s parameters are the term or phrase to match and the index to look at. It’s possible to write this nested list comprehension the other way around, where first we iterate over the text and then over the terms. Interestingly, I found this way to be consistently faster. Check out the code in the Github repo to view some simple benchmarking.
Jan 13, 2023
Count most common phrases of length n
💯
n = 6
top_n = 50
phrases = Counter([' '.join(tokens[i:i+n]) for i in range(0, len(tokens))]).most_common(top_n)
Given a body of text, find the n
most common phrases of a certain length. This combines the n-gram and bag of word One Liners from a few days ago into a mega One Liner that is surprisingly fun to tune and view the results of. In the repo on Github, we’re using the text of Darwin’s “On the Origin of Species” in many the demos. Here are the top 10 phrases of length 8 from that book:
1: however much they may differ from each other - (4 times)
2: succession of the same types within the same - (4 times)
3: the nature of the organism and of the - (3 times)
4: where many species of a genus have been - (3 times)
5: on the view that species are only strongly - (3 times)
6: the view that species are only strongly marked - (3 times)
7: view that species are only strongly marked and - (3 times)
8: i have not space here to enter on - (3 times)
9: all the species of the same genus are - (3 times)
10: the ordinary view of each species having been - (3 times)
Jan 12, 2023
Remove stop words 🫣
text = ' '.join([t for t in re.sub(r'[^\w\s]', '', text.lower()).split(' ') if word not in stop])
Discussion
“Stop words” are generally the most common words in a given language. In English, words such as “the”, “a”, “he”, “she”, “it” can introduce noise into some modeling tasks. In these cases it is preferable to remove them prior to passing the text to downstream processing.
In this One Liner, we
- lowercase the text
- strip out punctuation
- split the string on spaces
- filter stop words a
- join the parts into one whole body of text
Jan 11, 2023
Remove all that pesky punctuation with str.translate
🔤
import string
punc_free_text = text.translate(str.maketrans('', '', string.punctuation))
Discussion
str.maketrans
is a static function that creates a translation mapping. When the first two arguments are empty strings, the third argument, which is expected to be a string, is converted to a mapping of chars to None
. Conveniently the string module provides a list of punctuation marks. Supplying string.punctuation
as the third parameter, effectively translates all puncation marks to None
. I think well have some more fun with translate
and maketrans
in a future One Liner.
Jan 10, 2023
Redact a top secret document 🕵🏻♂️
redacted = ' '.join([t if t not in secrets else '*' * len(t) for t in text])
Discussion
There are some cases where you need to obfuscate entities in text data before passing it along to some other *******. Perhaps you need to anonymize someone’s **** or keep sensitive information like ***** or credit card numbers hidden, but also want to maintain some kind of reference in place for the words that have been redacted.
This One Liner isn’t a complete solution to the problem, but it could be a good starting place. text
is expected to be a list of words or tokens and secrets
is a list of strings to be obscured.
Jan 9, 2023
Create a Bag of Words model with the Counter
🔢
from collections import Counter
bag_of_words = Counter(text.split(' '))
Discussion
Too easy right? The Counter
class is pretty cool. Pass in an iterable and get back a dict
like object with item » item count mappings. You can also get the most common n elements in the collection, using the most_common(n)
method. Counter
also supports the addition +
, subtraction -
, intersection &
and difference |
of one or more instances, using overloaded operators.
Jan 8, 2023
Generate n-grams ✂️
n = 5
n_grams = [(i, i+n, ' '.join(items[i:i+n])) for i in range(0, len(items))]
Discussion
n-grams are subsets of a string, consisting of “n” consecutive characters, syllables or words. In the example above, items
is expected to be an array of strings. n-grams are often useful in Natural Language Processing tasks like predicting the next word in a sequence, based on the words around it.
In this One Liner, we generate a list of 3-tuples, each containing the start and end indices of the n-gram in the original text and the n-gram itself.
Further Reading
Join in…
Feel free to leave comments here or clone the repo on Github and make a pull request if you think you’ve got a better solution. Benchmarks are welcome too!
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
Interested in scaling your software startup? Check out Circuit.