A new way to analyze your personal chat history

In the last decade, I’ve sent and received over 460,000 messages—that’s an average of 126 messages per day! For photos, Google and Apple have both introduced a “Memories” features in their respective photo apps, which lets you look back on photos that the almighty machine learning algorithm has deemed meaningful to your life. Why not do the same for messages?

Inspired by Chandler’s similar project, for the last couple of months I’ve been working on an open source tool called Converscope (github, interactive preview). It pulls data from your Facebook chats and iMessages and shows you who you message the most—over the last year, particular time periods of your life, or over all time. You can drill down into a particular thread and see interesting metrics like

  • a histogram of message counts per day
  • the total number of characters sent per person (a proxy for how talkative each party is)
  • your longest streak and when it ended, a la Snapchat
  • the most used emoji per person
  • Blast From The Past™: a random, potentially cringy message from your past
  • the TF-IDF top tokens for the person/group chat. TF-IDF stands for term frequency-inverse document frequency. It automatically surfaces the topics that you talk about a lot in this chat that you don’t talk about with other people/groups. image This is what a Converscope thread report looks like. It appears the hot topic of conversation with this friend is Apple and Apple products.

I’ve anonymized names and released an interactive preview at converscope.daylen.com, so you can browse around and see what it looks like. (Certain features such as Blast From The Past™ and TF-IDF Top Tokens are not available in the preview, for privacy reasons.) If you’re one of the top 20 people I message the most, you can try to figure out what SHA-1 hash you are 🙃. And as I mentioned, Converscope is open source, so you can run this analysis on your own data! Instructions are in the README, and please do let me know (e.g. via Twitter) if anything doesn’t work.

The rest of this post is split into two sections: Interesting Findings, where I talk about some of the interesting things I learned by poking around this data, and Building Converscope, which is about the visual and technical design decisions I made.

Interesting Findings

Group chats

If you pop open converscope.daylen.com and flip over to the Group Chats tab, the first two chats you see are these mega-chats with over 20K messages each: image the birth and death of group chats

An artifact of the pre-Slack era, THE H@BMIND was the officer chat for Hackers at Berkeley, a club I joined as a wide-eyed freshman. We were primarily an educational club: we hosted events where we taught eager students things like how to build a web app, or how to use git. The chat was where we did event planning, and the ebb and flow of message counts corresponds to the school seasons—for example, lots of activity in fall 2013 and spring 2014 versus a drop in summer 2014. You can also see that the club never made a recovery from the summer 2015 slump: some of the more senior members graduated or moved onto TAing.

As for Laurel Grove crew: in the summer of 2015, I interned at Facebook, and this was the group chat for a set of us who lived at corporate housing at Laurel Grove (such creative naming). We were a chatty bunch, and attempts were made at keeping the group alive after the summer ended. But that petered out by the end of 2016. Every group chat eventually dies.

Every group chat eventually dies.

The predominant flavor of group chat varies based on time period. In my college days, there were the classics like the aptly named presentation due sunday 10pm! (which did in fact send the most number of messages on Sunday, November 27, 2016), CS162 project group, and 189 shittrs. These group chats shuttered after the respective classes finished. image gee, I sure do wonder when the presentation was due

Also in that same time period were several of group chats dedicated to trips: Thailand 12/31–1/13 __ (video recap!), NYC 8/15–8/21, and Roadest Trip 2017 (a spring break road trip across the PNW, after roader trip 2k16: A Profoundly Religious Experience and the original road trip in 2015, which lacks a group chat). Here, the Most Used Emoji metric comes in handy to distinguish the trips. (In Facebook Messenger, you can set a default emoji for a group, and it turns out people tend to smash that button a lot.)

image the elephants in Thailand were very cute

image I was the most prolific user of the car emoji in the Roadest Trip chat, sending it 30 times

After graduating, a larger percentage of group chats shifted to be trip-oriented. It turns out that group chats centered around trips follow a fairly standard formula:

  1. There is an intense planning phase __ several months or weeks prior to the trip
  2. Next, the actual trip __ generates the majority of the messages
  3. Finally, there may be __ one last hurrah where the Splitwise is closed out or photos are shared

Example of this phenomenon: new orleans 2019, Ski + Sundance 2019 subgroup, 🏠🏠🏠 Sundance2020.house. image the three phases of a trip-oriented group chat: intense planning, the actual trip, and one last hurrah

The Longest Streak metric consistently identifies the actual trip duration. Modeled after Snapchat streaks, Longest Streak requires that at least one message be sent every day for consecutive days. So if I’m looking at the confusingly titled You and Taco chat, I can see that perhaps this chat was about the post-graduation Europe tour that my college friends took in 2017, along with our diminishing attempts to stay in touch afterwards.

Direct messages

Obviously, it’s more exciting when you can actually see the names as opposed to just XXX everywhere. I’ve got a private instance of Converscope with STRIP_PII (personally identifiable information) set to false, but in this post I’ll refer to everyone by their (salted!) SHA-1 hash.

Using the “Sort by…” dropdown and selecting the College option, it’s interesting to see who from college I’ve kept in touch with and who have fallen by the wayside. Just contrast the histograms for my longtime friend 414129b (with whom I have a 59 day streak!) to 368db00. image a friendship that blossomed and faded

Equally interesting is seeing the new friends I’ve made _post-_college. Turns out, there are a lot! b01f615, bd417c1, and 8b20864, just to name a few. Here, the TF-IDF Top Tokens feature shows how most of these friendships are oriented around specific activities like cycling and photography: image “fcc” stands for fatcake club

image we both used to climb

image film, leica, SF. what a mood

And finally, to close out the Interesting Findings section, here’s a fun Blast From The Past™ to when I faceplanted when riding my Boosted Board: image

Notes on building Converscope

Frontend

The Converscope frontend is a React app and is pretty much just a fancy JSON viewer. I’ll just highlight two of my favorite features. First up is the hover animation for the card design, inspired by the card design in Apple TV. On hover, the card gains a drop shadow and also subtly grows in size.

mmm, drop shadows

Second, I’m pretty proud of the dark mode theme, which automatically kicks in if you switch to dark mode on your iOS or macOS device. (This is thanks to the prefers-color-scheme CSS Media Query!)

mmm, dark mode

You can try these for yourself at converscope.daylen.com.

Backend

There’s two parts to the Converscope backend: there’s a script that parses and merges the Facebook and iMessage data formats into easy-to-understand Inbox, Conversation, and Message protos. From there, computing the metrics to display is a matter of filling a dictionary.

The code to parse through iMessage data is probably the ugliest. Just look at this SQL query which has both an inner join and a left join, plus some fun inline date parsing!

A special thanks to Kevin Chen who pointed out that I should add a salt before hashing the sender name when generating conversation IDs—otherwise it would be trivial to deanonymize the preview website by running a dictionary attack on an easily obtainable friend list (e.g. Facebook).

For TF-IDF, setting maximum document frequency to 0.2 really boosted the quality of the displayed tokens. That means that if a token appears in more than 20% of chats, it is discarded. This helps eliminate common words like “the” and “and.”

In a similar vein, to ensure good quality for Blast From The Past™, I require that a message be at least 30 characters (otherwise you just get a lot of “lol,” “haha,” and “yeah”).Wow, you made it this far! Converscope has been a long time in the making and I’m happy to finally put it out there. I encourage you to check out the interactive preview (flip over to “Group Chats” so you don’t see XXX everywhere) and if you’re handy with the command line, to run it on your own data.