Update on the POUND project
I’ve been working on a POUND clone and wanted to share how my plan has evolved.
First off, thanks to Andrew Kelleher, who last week provided a tip about the limitations of browser fingerprinting. Because fingerprints are subject to change, a user may have more than one fingerprint for a given browser. This suggests that they should not be used as global unique identifiers, although in general browser fingerprints are unique and slight changes can be handled:
Unfortunately, we found that a simple algorithm was able to guess and follow many of these fingerprint changes. If asked about all newly appearing fingerprints in the dataset, the algorithm was able to correctly pick a “progenitor” fingerprint in 99.1% of cases, with a false positive rate of only 0.87%.
I am modifying my original idea in a couple different ways to reconcile this new information.
The first is to acknowledge that because browser fingerprints will change overtime, it may be beneficial to focus on streams/slices of data rather than analyzing the whole corpus statically. Maybe the last day or 12 hours for this site (larger sites will find more meaningful results on shorter timespans). Although I could see the full data being useful for forensics and data science later on, the more immediate benefit seems to be directional insight about how content is spreading in real-time.
The second is that rather than worrying about specific users, I am going to abstract all users into persona subsets based on some TBD metadata like location, device, source, etc. The actual generational share data will still be used, but aggregated into the personas so that newsrooms have an easier time generating narratives about the share behavior. This should also smooth out the browser fingerprint churn.
The idea is kinda similar to Chartbeat/Parse.ly but I think the geographical and generational aspects would be new to real time editorial analytics. Google Analytics has a real-time geographical view that drills down to the city level, but I think we could go more precise by enriching our own data with a GeoIP service. Neighborhood-level share data seems in reach.
My original visualization is not scaling too well. You can see what it looks like with a bunch of posts and visits…
I am still contemplating how the visualization should look, but I’m leaning towards something like this d3 block.
Here’s what the data structure might look like for one post, two personas & two neighborhoods.
postTitle.firstPersona, // Bucket for first gen under first persona
postTitle.firstPersona.firstPersona, // Bucket for second gen – first persona to first persona
postTitle.firstPersona.firstPersona.neighborhoodOne, 10 // Number of second gen visits first persona to first persona
postTitle.firstPersona.secondPersona, // Bucket for second gen – first persona to second persona
postTitle.firstPersona.secondPersona.neighborhoodOne, 10 // Number of second gen visits first persona to first persona
postTitle.secondPersona, // Bucket for first gen under second persona
postTitle.secondPersona.firstPersona, // Bucket for second gen – second persona to first persona
postTitle.secondPersona.firstPersona.neighborhoodOne, 10 // Number of second gen visits second persona to first persona
postTitle.secondPersona.secondPersona, // Bucket for second gen – second persona to second persona
postTitle.secondPersona.secondPersona.neighborhoodOne, 10 // Number of second gen visits first persona to first persona
Next I’d need to figure out how the existing data structure maps into this one. Probably ask DynamoDB for the last x records and then reduce them based on a couple basic personas. More to come…