Automating Work with Dictionaries in Python

After some time off to soak in the post-MBA life, take a vacation and spending more time just trying to wrap my head around the basics of Python programming, I’m back to writing and sharing my experiences. I’ve used several sources to supplement my learning of the language, first by wrapping up Python Crash Course (which I wrote a bit about in a previous post), working through Learn Python 3 the Hard Way (not my favorite book, but it might work for other people), and finally settling on a course with the University of Michigan and Coursera, which I’ve really been liking. I think the most impactful strategy for learning a bit more about Python is just consistently practicing, even if it’s only 20 minutes a day.

Dictionaries in Python

Beyond just trying to learn a new language and developing some general programming frameworks and philosophies, I think it’s important to focus on how you can apply what you’re learning to your workflow, your career, or even your whole organization if it’s that impactful. With that in mind, one component of Python that I’ve really enjoyed and can be used to automate processes are dictionaries. This topic has come up before in a previous post, but the concept is so powerful that it bears repeating. What makes dictionaries really powerful is that they allow you to store key-value pairs, and by gathering those pairs (or items, if you want to get technical), you can perform a ton of different functions on them. Dictionaries, when combined with other frameworks like while and for loops, also make life easier if you want to gather information from a flat-file source, like a CSV file or a spreadsheet. I know that I’ve been spoiled by all of the incredible search and analytics applications out there that make my life way easier when I’m trying to find or analyze some data, but when I have a file that’s in a different format, I tend to get stuck.

With that in mind, I wanted to share an incredibly simple program that I created for an assignment. In a nutshell, the program parses through all the different lines of the file, looks for any mention of the word ‘From:’,  and then collects whatever email address comes after that. After it does this, it checks to see who sent the most emails, and how many they sent. It’s simple, fast, and effective. I think that anyone with a little bit more programming experience could write functionality that’d do even more with the file, but you can clearly see how just this one script could be super helpful in automating your workflow and speeding things up a bit.

name = input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)

counts = dict()
for line in handle:
    if not line.startswith("From:"):continue
    line = line.split()
    line = line[1]
    counts[line] = counts.get(line, 0) + 1

bigcount = None
bigword = None
for word, sender in counts.items():
    if bigcount == None or sender > bigcount:
        bigword = word
        bigcount = sender

print (bigword, bigcount)

If you want to dig a little deeper into the code and how it works, feel free to check it out on GitHub.


My latest experience in learning Python has centered around classes, which is a concept that is really at the heart of what makes Python an object-oriented programming language. It’s been challenging, and I’m just barely scratching the surface, but it has been a lot of fun so far. I can tell that this concept is core to defining whole categories of objects and the actions and behavior that your code will take towards those categories. Classes allow you to apply this logic towards objects more consistently, at least that’s the theory. Below is a super simple example I’ve taken from Eric Matthes’ Python Crash Course:

What this example is showing is that we’re able to make an attempt at modeling a dog, creating the behavior and actions around any dog (it could be millions of dogs, but we only will need to do this once), and then proving this logic by applying various methods to different dogs.

Dictionaries in Python

Lately, I’ve been learning how to take advantage of dictionaries in Python to store and organize information. Dictionaries have a lot of different purposes in Python, but as it applies to data visualization, it can be extremely helpful. Simply put, dictionaries allow us to store almost limitless amounts of information about whatever we want, and then call it when we’re writing our programs or creating visualizations. Doing this allows us to model real-world information more accurately, whether it is storing information about a person’s information, statistics about sports, econometric data, and so on. Another beautiful thing about dictionaries is that they’re easy to manipulate. Rather than using variables for everything, we can sort, loop, or write functions using dictionary values. We can even create other lists of dictionaries. The example below shows how we can create a dictionary where the values are lists, and then cycle through the lists to create messages about each person’s favorite language. Yes, I realize it’s a bit weird just to post a screenshot of my code. However, it seems like a pretty intense process to show an iPython notebook in WordPress. (Edit: Updated the post to show iPython notebook)

Another example of using dictionaries is the basketball data I analyzed for an introductory course I took. The course was called “Python A-Z: Python for Data Science” and I found it through Udemy. We stored each player’s statistics for the past five years into dictionaries and then were able to call upon them way further down the script to run some analysis. Below you’ll see just one of the visualizations I was able to create using dictionaries for organizing the data, NumPy for the statistical functions we needed to run, and Matplotlib for the visualization part. Watch out for another post on Matplotlib and Seaborn shortly.

Back to Basics: Learning Python from Scratch

I’ve been working a lot with the R programming language recently and decided to take some time this weekend to refocus on my original goal of mastering Python. My plan all along has been to work in both languages, but my classes have put the emphasis on R. I decided to keep this weekend incredibly simple and just focus on the foundational components of the language that might be different from R; things like lists, if/else statements, for loops, things like that. It’s been a lot of fun, and I’ve enjoyed myself. Working in Jupyter Notebooks makes this process so incredibly easy because all I need to do is work in that IDE and execute the code right on the screen. RStudio is similar, but it’s more of a process, and it’s missing the overall “flow” that I think has attracted me to learning languages in the first place.

My Extremely Basic Code
My Extremely Basic Code

The primary resource I’ve been using this weekend has been a book called ‘Python Crash Course: A Hands-On, Project-Based Introduction to Programming‘ by Eric Matthes. Eric has done a great job of writing this book, keeping it simple, and building in instances where the reader can test the concepts and create some code themselves. As a beginner, I appreciate the challenge and opportunity to be creative and make something unique. I find that this opportunity to test your knowledge is missing in a lot of online learning environments, like Udemy. No disrespect to Udemy at all, it’s just a constraint of the online learning environment that is challenging to work around. Anyways, I chose this book because it promised projects working with data science, had high ratings on Amazon (for a good reason), and I needed something with a project-based approach where I could make mistakes, fail, and overcome the challenges when I actually experiment and apply the concepts.

Python Crash Course
Python Crash Course

Classification Trees Using R

Recently in class, we’ve been building models called classification trees (also known as decision trees). There are a few different types of models used in decision tree learning, but classification trees allow for predictions of outcomes that are various classifications. The classic example is the iris dataset published in 1936 by Ronald Fisher. In this example, we can see how the outcome (species name) can be accurately predicted using a classification tree to create decision branches. See below:

Classification Tree of Iris Data
Classification Tree of Iris Data

Classification trees have widespread applications across a variety of fields, but one of the more powerful examples are the implications it can have in health and medicine, where predictive modeling can be applied to large sets of data to hopefully help spot diseases and illnesses before they occur. We experimented with how these models could be built in our class, using anonymized data where the patient had a growth that was classified as either malignant or benign. Using these classification trees, a researcher (or software) can apply this trained algorithm or script to new data where the outcome is unknown and develop a plan to intervene with those patients (provide free screenings, contact patients that the algorithm predicts may be at risk, and so on). While these techniques shouldn’t replace regular visits and interactions with medical professionals, they can help an industry under enormous constraints be proactive in detecting and treating diseases before they cost the patient time, money, or even their lives.

Classification Tree for Breast Cancer Research
Classification Tree for Breast Cancer Research

Cluster Analysis with R

Recently in one of my classes, we’ve been experimenting with using R for running cluster analysis. Clustering is a way of assigning rows or records (or respondents, in the end) into groups. The person running the analysis is in control of how many groups they want to create, but depending on the data and number of rows, you’ll usually be setting up no more than 10 groups. At any rate, this is an extremely powerful tool because it allows the people in your organization to determine what your customers have in common, which allows for segmentation and a tailored approach to working with your customers. This isn’t just a greedy way of finding how to squeeze more money out of people, it can be used to create different products, user experiences, subscription packages, or any other offering that ultimately makes the relationship you have with your customer just a little bit better.

I’ve been learning a bit of R lately, as my class requires us to use it, and I can tell that it’s not only extremely powerful but has a lot of depth. I haven’t even scratched the surface. I’ve been using an IDE called RStudio. RStudio allows you to view the command line (if you use R out of the box, all it will do is open a command prompt window), create your script (the script I’m using is modified from one we used in class, I’m not that good yet :)), view all of the variables you’ve assigned and all of the data you’re using, and finally a window where you can view your plots or any outputs you’ve asked R to create. It’s quite fancy.


The data we used was fake demographic data that a fake automobile company had collected. Our assignment was to divide the respondents into as many clusters as we thought was appropriate (I chose 3, but you could certainly make a case for 4), and then provide recommendations to the fake marketing manager on how this information could be used to better market to our customers.

Cluster analysis is a lot of fun and should definitely be a tool in every marketer’s toolbox. That being said, it is completely subjective and you could make a case for as many or as few clusters of customers as you’d like. Also, it’s important to remember that even though it’s convenient for you to assign individuals to groups, you’ll still need to deal with people on an individual level and train your staff to not segment people or make assumptions just because someone fits your model. This all reminds me of an article we read for our class about this mathematics Ph.D. that applied clustering to his dating life. Sometimes, you can overthink things.

As always, if you want to view (and make changes or recommendations) my script or the raw data, you can do that at GitHub.

Visualizations with Python

I’ve recently started experimenting with the programming language Python. My main goal is to apply Python to data science projects when needed. I’m still in the super early stages of learning the basic concepts, working with packages, and just seeing the broader picture of what it has to offer, but so far it’s been relatively straightforward and clean. Working with Jupyter Notebooks is a huge factor in what makes it so convenient and user-friendly. Jupyter is basically an IDE that is online and has a very similar feel to Google Docs, Sheets, etc. and allows users to create in a more comfortable, convenient environment. I really love it so far. Anyways, below is a small dashboard I created. The data and assignment were both from a course on Udemy which served as my introduction to Python, although at this point I’m looking for more project-based tutorials. The visualizations are running off of a package called Seaborn, which is incredibly flexible and intuitive. Seaborn works on top of Matlplotlib, which is one of the core packages that comes with Python and allows for the calculations necessary for a lot of data science projects. Finally, if you want to review my code and let me know what you think, check it out at GitHub.

A visualization of critic ratings, audience ratings, and budgets, all using the Seaborn visualization package for Python.

Working with Tableau

Recently I’ve been working a lot with the data visualization tool Tableau. It’s exceptional in its ability to provide detailed filtering, granularity, and flexibility. In fact, I’m coming to find that if the data I’m working with is clean enough and set up correctly, there’s not much I can’t do. With that in mind, I’ve decided to share a workbook I’ve created for a class I’m taking. The data is basically just made up sales data, but the exercises show the level of flexibility you can achieve. Anyways, check it out.