One a Day One Liners with Python — Week 2

Text Processing

Published in

Python in Plain English

5 min readJan 7, 2023

Welcome Back!

Originally, I didn’t think I’d have so much to say about so little code, but that hasn’t been the case thus far and since I’m planning to keep this going for the rest of the year, it’s probably wise to split up the One Liners into weekly installments. Also, I’m gonna experiment with setting a theme for each week. This week’s theme is “Text Processing”.

Last week’s One Liners can be found here

Jan 14, 2023

Find occurrences of terms or phrases 🔎

terms = ['species of a genus', 'differs', 'mutatation']
result = [(term, i) for term in terms for i in range(len(text)) if text.startswith(term, i)]

Behold, our first nested list comprehension. It iterates through text and creates a list of tuples containing a term and the start index of that term in the text. The real magic here is the startswith method on the built in str type. It’s parameters are the term or phrase to match and the index to look at. It’s possible to write this nested list comprehension the other way around, where first we iterate over the text and then over the terms. Interestingly, I found this way to be consistently faster. Check out the code in the Github repo to view some simple benchmarking.

Jan 13, 2023

Count most common phrases of length n 💯

n = 6
top_n = 50
phrases = Counter([' '.join(tokens[i:i+n]) for i in range(0, len(tokens))]).most_common(top_n)

Given a body of text, find the n most common phrases of a certain length. This combines the n-gram and bag of word One Liners from a few days ago into a mega One Liner that is surprisingly fun to tune and view the results of. In the repo on Github, we’re using the text of Darwin’s “On the Origin of Species” in many the demos. Here are the top 10 phrases of length 8 from that book:

1: however much they may differ from each other - (4 times)
2: succession of the same types within the same - (4 times)
3: the nature of the organism and of the - (3 times)
4: where many species of a genus have been - (3 times)
5: on the view that species are only strongly - (3 times)
6: the view that species are only strongly marked - (3 times)
7: view that species are only strongly marked and - (3 times)
8: i have not space here to enter on - (3 times)
9: all the species of the same genus are - (3 times)
10: the ordinary view of each species having been - (3 times)

Jan 12, 2023

Remove stop words 🫣

text = ' '.join([t for t in re.sub(r'[^\w\s]', '', text.lower()).split(' ') if word not in stop])

Discussion

“Stop words” are generally the most common words in a given language. In English, words such as “the”, “a”, “he”, “she”, “it” can introduce noise into some modeling tasks. In these cases it is preferable to remove them prior to passing the text to downstream processing.

In this One Liner, we

lowercase the text
strip out punctuation
split the string on spaces
filter stop words a
join the parts into one whole body of text

Jan 11, 2023

Remove all that pesky punctuation with str.translate 🔤

import string
punc_free_text = text.translate(str.maketrans('', '', string.punctuation))

Discussion

str.maketrans is a static function that creates a translation mapping. When the first two arguments are empty strings, the third argument, which is expected to be a string, is converted to a mapping of chars to None. Conveniently the string module provides a list of punctuation marks. Supplying string.punctuation as the third parameter, effectively translates all puncation marks to None. I think well have some more fun with translate and maketrans in a future One Liner.

Jan 10, 2023

Redact a top secret document 🕵🏻‍♂️

redacted = ' '.join([t if t not in secrets else '*' * len(t) for t in text])

Discussion

There are some cases where you need to obfuscate entities in text data before passing it along to some other *******. Perhaps you need to anonymize someone’s **** or keep sensitive information like ***** or credit card numbers hidden, but also want to maintain some kind of reference in place for the words that have been redacted.

This One Liner isn’t a complete solution to the problem, but it could be a good starting place. text is expected to be a list of words or tokens and secrets is a list of strings to be obscured.

Jan 9, 2023

Create a Bag of Words model with the Counter 🔢

from collections import Counter
bag_of_words = Counter(text.split(' '))

Discussion

Too easy right? The Counter class is pretty cool. Pass in an iterable and get back a dict like object with item » item count mappings. You can also get the most common n elements in the collection, using the most_common(n) method. Counter also supports the addition +, subtraction -, intersection & and difference | of one or more instances, using overloaded operators.

Jan 8, 2023

Generate n-grams ✂️

n = 5
n_grams = [(i, i+n, ' '.join(items[i:i+n])) for i in range(0, len(items))]

Discussion

n-grams are subsets of a string, consisting of “n” consecutive characters, syllables or words. In the example above, itemsis expected to be an array of strings. n-grams are often useful in Natural Language Processing tasks like predicting the next word in a sequence, based on the words around it.

In this One Liner, we generate a list of 3-tuples, each containing the start and end indices of the n-gram in the original text and the n-gram itself.

Further Reading

n-gram - Wikipedia

In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous…

en.wikipedia.org

Join in…

Feel free to leave comments here or clone the repo on Github and make a pull request if you think you’ve got a better solution. Benchmarks are welcome too!

GitHub - jeremyfromearth/one-a-day-one-liners-python: Every day I add a new Python One Liner

Every day I add a new Python One Liner to a thread on my Medium blog. Sometimes they are deceptively simple but merit a…

github.com

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

Python in Plain English

One a Day One Liners with Python — Week 2

Text Processing

Welcome Back!

Jan 14, 2023

Jan 13, 2023

Jan 12, 2023

Jan 11, 2023

Jan 10, 2023

Jan 9, 2023

Jan 8, 2023

n-gram - Wikipedia

In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous…

Join in…

GitHub - jeremyfromearth/one-a-day-one-liners-python: Every day I add a new Python One Liner

Every day I add a new Python One Liner to a thread on my Medium blog. Sometimes they are deceptively simple but merit a…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Python in Plain English

Written by Jeremy Brown

No responses yet

More from Jeremy Brown and Python in Plain English

Multiple Python kernels for Jupyter Lab with Conda

TLDR; To run Jupyter Notebook/Lab in various Conda environments with different versions of Python, install new kernels on the command line…

How I Learned to Love `__init__.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

The Enum Trick Every Python Developer Should Know

Learn how Python’s Enum can simplify constants, improve code readability, and add smart behavior to your projects. A must-know trick for…

Google Cloud Storage File Upload with Flask & JavaScript

In this article, I’ll quickly demonstrate how to upload an image file to Google Cloud Storage from a Flask app on App Engine Standard…

Recommended from Medium

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Transforming Data into Stories with Python: A Guide to Effective Storytelling using Pynarrative

Effective data storytelling reveals not only patterns in the data, but the narrative around it.

8 Confusing Python Concepts That Frustrate Most Developers

Python is one of the most beginner-friendly programming languages. But even experienced developers run into confusing concepts that can…

20 Advanced Statistical Approaches Every Data Scientist Should Know 🐱‍🚀

Data science is a multidisciplinary field that combines mathematics, statistics, computer science, and domain expertise to extract…

Goodbye RAG? Gemini 2.0 Flash Have Just Killed It!

Alright!!!

Drawing and Coding 18 RL Algorithms from Scratch

PPO, A3C, PlaNet and more!

How I Learned to Love `init.py`: A Simple Guide😊