Wednesday, September 16, 2015

Digging Dangerous Data | Ashley Madison & the Archaeology of the Now

In July 2015 a large amount of data was stolen from Ashley Madison, an online business dedicated to facilitating people who sought to have extramarital affairs. The hackers who stole the data, calling themselves ‘The Impact Team’, attempted to use it for blackmail, insisting that Ashley Madison and fellow Avid Life Media site be permanently shut down. On June 22nd, when the company failed to comply, a sample of the data was released publicly. Obviously, negotiations didn’t go as well as might have been hoped and in August the full dataset was made available on the internet. Late in the same month a further data dump was made available that included a number of corporate emails, including a substantial number from CEO Noel Biderman. Since that time, it is believed that a number of enterprising petty criminals have attempted to use the Personal Identifiable Information (PII) to blackmail alleged account holders.

Let me be clear. There are tangles of legal, ethical, and moral arguments in every sentence and clause in the above paragraph. I don’t hope to offer a solution to any of these … not one! If you think that you can deliver an all-encompassing answer or firmly-established response to any aspect of this in a few sentences, you’ve not thought about it enough - you really haven't. In some respects, all these conversations are academic – the data now exists in the public sphere. The initial data dump was made to the so-called ‘Dark Web’ Tor Network that can only be accessed via a specialised browser via an encrypted connection routed through a number of proxy services. However, the files quickly migrated to bittorrent, a peer-to-peer transfer protocol. As Alex Hern explains in The Guardian: 'The file is broken up into multiple blocks, which are then shared directly from one downloader’s computer to the next. With no central repository, it is all but impossible to prevent the transfer, although a “magnet” link – a short string of text telling a new downloader how to connect to the “swarm” of files – is still required'. To cut through the technobabble: whether or not you like it, hate it, are embarrassed by it, or whatever – this is data that is not going to go away. As Wikipedia reports: ‘The parent company Avid Life Media, which owns the site, has offered a reward of C$500,000 (£240,000) for information about the Ashley Madison hackers.’ So what? … Once that the cops have chased after and caught ’em … they’re punished, they go to jail … and the data is still there, available to everyone who wants it. Private Bradley Manning (now Chelsea Manning) stole something in the region of 750,000 classified and sensitive documents relating to the US military and diplomatic organisations. In 2013 she was sentenced to 35 years for violations of the Espionage Act and will be eligible for parole in 2021. But the documents, passed on to WikiLeaks, are still freely available. With regard to both instances, there are questions about the legality and morality of collecting and disseminating the data, but they really matter little terms of the fact that, once released, it can’t be erased or eradicated. There is no getting this genie back in the bottle.

Now that the Ashley Madison dump is available, what are we going to do about it? And by ‘we’ I mean archaeologists and historians. It may seem counter intuitive, but this is pure archaeology and history territory. If you think that historians are only properly employed in dusty archives, searching through government documents, you don’t understand the range of what their skills can be applied to. By the same token, if you think that someone can only be a ‘real’ archaeologist if they’re covered in mud, carefully uncovering a decorated pottery vessel before the bulldozers track in and destroy the lot, you’re equally mistaken. These are both aspects of what we do, but neither goes anywhere to encompass the totality of what either profession actually does or can achieve. In essence, both are really about using available data to tell stories at varying scales – from the individual, to countries, to continents. More importantly, both are adept at assessing the quality of and bias in data … basically, we were made for this data and this data was made for us! When the archaeologies and histories of the 21st century are written they will have to incorporate WikiLeaks data if they are to fully understand certain aspects of military and diplomatic events and relationships. In the same way that Cabinet papers released under the thirty-year rule can give a profoundly different insight on public actions and statements, these documents will change how we see historical events. While we’re, generally, a bit more prudish when it comes to stuff involving genitals, the Ashley Madison data is going to be just the same – no self-respecting historian of the future is going to be able to talk about early 21st century society without dealing with this data dump. On top of all that, it’s likely that stolen data dumps will only increase in size and frequency over the next decades and, thus, in importance for future researchers. This feeling is reflected in one of the statements by Avid Life Media, saying that this form of hacking and dumping ‘may now be a new societal reality’. Still not convinced? Let me put it another way … what if we excavated a data source that gave information on some aspects of the sex lives of individual builders of Stonehenge or Newgrange? We’d be all over that like a shot! 

Archaeologists in particular have a love of chronology and location and the Ashley Madison data has both in spades … the only difference is that their relative qualities are inverted. It’s usual on a regular archaeological excavation for the chronology to be a bit imprecise, but the location is well tied down and understood. For example, the pottery sherd came from this particular layer in that portion of the ditch on the site that is at that location in this country ... but the radiocarbon date on the charcoal from the same layer gave a range of 60 years. Seem familiar? Here the situation is that we can’t always be sure if the name attached to the account really relates to that specific person in that specific town (more on this later), but we can tie transaction data down to the second. In every real respect, this is data eminently suited to analysis by archaeologists and historians. The only real difference is the passage of time … and even this isn’t as much of a consideration as it might appear. Not too long ago the general working practice in parts of the UK was to machine off the Medieval remains to get to the more important Roman stuff … then that changed and we only machined off the post-Medieval to get to the Medieval, and now that too has changed. A little over 10 years ago I was working on the excavation of a brickworks that went out of business in 1922. Between the famously divisive ‘Transit Van’ excavation and work like Andrew Reinhard’s Atari video game burial dig, the age of what is considered to be ‘acceptable’ for the interests of archaeology and archaeologists has been brought closer and closer to the present time. All I’m proposing is that we change the scale from a couple of decades to a couple of months.

The data dump
With that in mind, it’s probably appropriate that we examine what the data dump (the ‘site’) actually holds and how it is structured. The 9.7Gb compressed archive (I've not accessed the 20Gb email dump) contains the site’s a large amount of internal corporate data along with the user database. The latter contains users profile information, including names, street addresses, and birth dates.  There are lists of personal details including data on whether the user smokes or drinks and the type of individual they’re hoping to encounter, along with their preferences in terms of sexual desires and the acts they’re willing, interested, or able to perform (via Alex Hern, The Guardian). The majority of interest has centred on the portion of the database that holds users email addresses as it is significantly easier to search and interrogate than other portions of the dump. Much has been made of the fact that addresses in the US and UK military along with a variety of companies and seats of learning have all appeared in the database. As Hern and other commentators point out, as much fun as this is, this is a relatively unreliable means of assessing who had an account. In the first instance, Ashley Madison did not require that an address be validated before it could be associated with an account and thus there are multiple cases of patently false email addresses in the data. The Wikipedia article on the data breach notes that ‘people often create profiles with fake email addresses, and sometimes people who have similar names accidentally confuse their email address, setting up accounts for the wrong email address.’, before going on to note that accounts could have been set up as part of office pranks. As an aside, I’d just note that if you work in a place that considers creating a fake account with an infidelity facilitating website an appropriate fun endeavour, you work with asshats and need to find another job! Herne’s piece in The Guardian briefly mentions that a further portion of the database contains details of credit card transactions, but does not include sufficient information to steal cash.

The next aspect I’d like to very briefly touch on is the business model used by Ashley Madison. It seems that, unlike many other dating and hook-up sites, Ashley Madison does not require a monthly subscription to keep an account active. Instead, they set a credit cost for men to open chat sessions and send messages to women. It also charges for men to read messages sent by women. There’s a premium rate for men to have a guaranteed affair, and even more money can be spent on sending gifts of animated gifs etc. From what I can gather, the only people paying for anything are men.

Colin Gleeson, writing for the Irish Times on July 21st, notes that ‘Figures released by the website in 2011 said there were more than 40,000 Irish members. However, a graphic called “the global infidelity map”, published on the website’s Twitter account a fortnight ago, outlined its per capita membership in countries around the world. It indicated that 2.5 per cent of the Irish population were members, which equates to approximately 115,000 individuals.’ He also notes that this data indicates that Ireland ranked 10th of the 45 countries with access to the Ashley Madison site. Again, more on this later ...

The Credit Card data
So, who’s using this data and to what ends? As you’d expect, other than the attempted blackmailers, the first ones to pick over the data have been journalists looking for a salacious titbits and they’ve certainly found them. Among the most high profile of those outed is the reality TV ‘star’ Josh Duggar (if you’ve never heard of him, be grateful – your life is a better place for it). While I care little for him or his particular views on religion and politics, I’m of the opinion that nothing good can come from shaming people – both the private individuals and the vaguely famous – though that’s not stopped the practice! Annalee Newitz at Gizmodo has been doing some excellent technical work analysing the source code, specifically seeking to understand how ‘bots’ (fake profiles) were used to chat to users and (allegedly) relieve them of cash for the privilege, when they thought they were talking to real women. Data Scientist Zack Gorman published an Ashley Madison Users By Zip Code map on Tableau Public, but only using data for the US (excluding Alaska & Hawaii), though he received some criticism from Reddit users for his actions.

Right now we’re in a situation where the analyses being carried out are either exclusively concerned with the US, B-List celebrities, and the prevalence of fake ‘bot’ women. Other than shaming individuals, I’ve got no problem with this … it’s just that while the Irish Times have made some general statements on the topic, there appears to have been little if any attempt to see what the data means for the UK and Ireland generally. It also appears that no one has attempted to use any of the financial data available in the dump to see what light it sheds on our society. As noted above, Alex Hern’s piece in The Guardian mentions ‘a database of credit card transaction information’. I’m not sure if he’s looking at the same data I’m looking at, but what I see is not a database. It’s a collection of 2642 .CSV (comma separated values) files covering the period from March 21st 2008 to June 28th 2015 - by my quick calculation, this indicates that records for only 13 days are missing – pretty dashed comprehensive! Each one appears to represent the totality of the day’s credit/debit card transactions and, I surmise, is a daily download for use in Ashley Madison’s Business Intelligence environment of choice. The reason that they appear to have been ignored until now is that while they are individually easy to work with, their sheer volume means that they take significant effort and energies to sort, search, and collate. As one commentator on Reddit says: ‘to search for a single name in the CC files would be something that would take a lot of time and effort. They are individual files per date, over years, not easily searchable’. But that is exactly what I’ve done. Even if opening each file, filtering the data for British and Irish material, copying it to a master file, and closing it down again only took 20 seconds, the whole procedure would have taken me a little under two working days … which is another thing to note about those of us with an archaeological background – we’re tenacious! As numerous commentators have noted, using the dump of account details and email addresses is fraught with problems. These include the free accounts set up by the curious who never progressed with their involvement, those who had accounts deliberately or accidentally set up in their names, or even spouses wanting to verify if their partners were already users of the site. On the other hand, I would argue that if you’ve actually gone so far as to spend money on the services offered, chances are that you’re pretty committed to the notion of infidelity. It is my opinion that – so long as these files are genuine – this is the most accurate and honest way of assessing involvement and participation in the site … and we’ve every reason to believe that these are genuine, including statements by Ashley Madison themselves.

Credit Card Data Structure
The first thing to explain is how these files are structured. The 2642 .CSV files range in size from 15Kb to 3.4Mb. While there’s variation, and it’s far from being a continuous and unbroken upwards progression, it really does show how the company expanded and grew during the period under review. Early files have only a couple of thousand lines of data and are confined to the company’s core US and Canadian markets. At the other end of the range, the data frequently exceeded 10,000 to 12,000 data rows, representing transactions from most (if not all) of the 53 countries where the company was active. Even if we only graphed the file sizes of each .CSV document we’d have an eloquent testimony of how the company grew and expanded over this period. Whatever the size of the documents, the internal format and layout was always the same. Going by column they were:

This is the account number into which the money is being paid. There are 41 different account numbers each from eight to 10 digits in length.

Again – pretty simple. This is the name associated with the account number (above). Ashley Madison and parent company Avid Life Media (ALM) don’t just have one website and one service. Apart from the main AM site, the same company also runs In the British and Irish data there are 41 Account Numbers associated with 26 different Account Names. It appears that each Account Name can have multiple Accounts. The Account Names include variations on ADL Media; AMDA; Ashley Madison; Avid Dating Life; CL Media; EM Media; and Swappernet. AMDA is probably not the American Musical and Dramatic Academy, but CL Media would appear to refer to CougarLife, a site for younger men to meet older women and EM Media is almost certainly EstablishedMen, the site for helping younger women meet older men. Swappernet is just what you think it is … if you have no idea what it is I can only reiterate Bartleby’s address to the female board member in Dogma ‘You, on the other hand, are an innocent. You lead a good life. Good for you.’

Column C: AMOUNT
This is the amount of money paid from a credit/debit card to Ashley Madison. There is no indication as to the currency that this is calculated in, but as ALM is a Canadian company, I’m presuming that it’s in Canadian Dollars (C$).

The Authorization Code is issued by the credit card holder’s bank to indicate that the charge has been approved. These codes are usually six or seven digits in length and can be either alphanumeric or plain numeric. Crucially, these codes are only issued when the charge has been approved. Thus, we can be certain that the amounts listed as charged were approved and were paid to Ashley Madison. The only exception to this is when the merchant takes payment without the authorisation code. If an authorization code is not issued, the merchant can receive a "no authorization" chargeback. In a chargeback, any payment the merchant receives is reversed by the card issuer (via eHow).

Column E: AVS
This is the Address Verification System. This is a system used to verify the address of a person claiming to own a credit card. The system will check the billing address of the credit card provided by the user with the address on file at the credit card company (via Wikipedia). From what I can see, this field will be populated if the Authorisation Code (Colum D) is populated, but will be blank if the Error Code (Colum P) is populated 

Column F: BRAND
The Brand refers to the type of card that was used to make the payment. Each credit/debit card company is represented by an abbreviation. For example, VI represents Visa, while MC (unsurprisingly) stands for MasterCard.

These are the last four digits of the member’s credit card number. I have not conducted any analysis on these and I do not intend to discuss them further.

Column H: CVD
The CVD is the Card Verification Data – the three digits on the back of your card. It is used in ‘card not present’ transactions and was instituted to assist in the fight against card crime. No actual CVD numbers are present here, merely the list of returned transaction codes. Visa and MasterCard, for example, use M for ‘Match’, N for ‘No Match’ and Y for ‘Non Applicable’. American Express, apparently wanting to be a bit different, went with M for ‘Not Applicable’ and Y for ‘Match’ (via Chase). I have not conducted any analysis on these and I do not intend to discuss them further.

This one should be easy. This one should contain the first name of the payee. SHOULD … but doesn’t always. True, there are a number of cases where this field is populated by a recognisable first name, but these are the minority. Instead, this column is mostly filled with five to seven digit numbers. In the absence of better understandings, I’m presuming that this is actually the user’s membership number. I’ve used the ‘first’ and ‘last’ names in the following analyses to ensure that when I’m examining data at the individual-level that it is consistent and actually relates to the same individual/account.

This column sometimes contains a surname when the first name is in the previous column. Sometimes it’s blank (especially in the earlier data). More often than not it contains the full name or pseudonym of the user. Like the data in the First Name field (Column I) I’ve used it as a visual reference to ensure that individual-level data is correct, and I've created a concatenated field of First and Last Names to create unique account identifiers. These have been used to clearly separate out user data, but at no point are names or pseudonyms referred to in the text.

The Merchant Transaction ID is a seven or eight digit code (I think) produced by the merchant (Ashley Madison in this instance) to identify each purchase. In a minority of cases this is populated with an alphanumeric, frequently including the words ‘Premium’, ‘Priority’, ‘Refund’ … sometimes in combination. I have not conducted any analysis on these and I do not intend to discuss them further.

This column is blank for all of the British and Irish data. I have not conducted any analysis on these and I do not intend to discuss them further … obviously …

Column M: DATE
The date is given in the format: Day/Month/Year Hour: Minute: Second (e.g. 28/03/2008 00:51:22). No archaeologist has ever worked with better chronological data than this! The only issue I have not been able to work out here is which time zone this is tied to. As Ashley Madison have their corporate headquarters in Toronto, it seems plausible that their systems would use local time - Eastern Time Zone (UTC-5:00), but I can’t be certain.

Column N: TXN ID
The Transaction ID is, essentially the ‘proof of purchase’ for the user. In the current data set they are usually nine to 10 digit numbers. In a minority of cases the ID is a 36 character alphanumeric and appears to correlate with an alphanumeric Merchant Transaction ID. I have not conducted any analysis on these and I do not intend to discuss them further.

Column O: CONF. NO.
I presume that this stands for Confirmation Number and is a seven to 10 digit number. From a quick check of the data, it appears that this is usually identical to the data in the TXN ID Column (Column N). Within the current data set, only 150 entries do not show a match between TXN ID and Conf No and these appear to correlate to instances where the Merchant Transaction ID is an alphanumeric. I have not conducted any analysis on these and I do not intend to discuss them further.

In the vast majority of cases this field is blank as the transaction processed without any error. However, at the bottom of each file there is a portion where uncompleted, failed, or stalled transactions are collated. While a transaction in this section will frequently have most of the rest of the data in place, it is unlikely to have an Authorisation Code (Column D) and an AVS code (Column E). Some may not have a pass flag for the CVD check (Column H). A collaborator on this project informs me that these codes are unique to individual merchants and may relate to various reasons why the transaction failed or was declined. For instance, the AVS check examines the address associated with the card and the user. Ostensibly, this is to prevent fraud, but based on some of my visual examinations of the data may occasionally be due to incorrect home addresses being entered into the system. This may be due to genuine errors or may stem from attempts by users to conceal their real world location and failing. Other reasons for transactions to error out may be fraud blocks put in place by the card holder’s bank or lack of funds/credit.

The options for what I presume is the Authorisation Type are either ‘Final’ or ‘Undefined’. I have not carried out any further analysis of this data.

Column R: TYPE
The transaction type may be one of five: Authorisation, Settlement, Chargeback, Credits, or Purchases. The most important thing to note here is that most of the data is composed of paired lines. By this I mean that there is one line representing the Authorisation to take the money and a separate line indicating the Settlement. A visual examination of the data indicates that all the other fields (e.g. Account, Amount, Names, Merchant Transaction ID etc.) will be identical, except that one line is flagged as Authorisation and the other is Settlement. Within the current data set there are 27,200 Authorisations and 25,590 Settlements. My presumption is that not all of the Authorisations progressed to Settlements and, thus, when talking about the amount of money spent, I’ll be omitting the Authorisation data. Purchases are a bit odd … there are 10,593 lines flagged as purchases, but all are associated with error codes of one kind or another. I freely admit that I have no idea why this is, and I’ve excluded them from any further discussion of the amounts of money paid. In almost all cases Credits are associated narrative entries in the Merchant Transaction ID that indicate that they are refunds, some of these are specifically indicated to be the result of fraud. As noted previously, Chargebacks are where Ashley Madison – for whatever reason – incorrectly took money from an account and later had to pay it back.

Column S: TXT_CITY
In every .CSV file Column S is called ‘TXT_CITY%2CTXT_COUNTRY%2CTXT_EMAIL%2CTXT_PHONE%2CTXT_STATE%2CTXT_ADDR1%2CTXT_ADDR2%2CZIP%2CCONSUMER_IP’, but it is clear that the ‘%2C’s are intended break up the titles for multiple columns. As the title indicates, this is the member’s city of residence.
Lines of data for UK & Ireland

For all of the Ashley Madison data, this is given as a two letter identifier code for the country. Ireland is indicated by IE and the United Kingdom (England, Wales, Scotland, & Northern Ireland) is given as GB. This has been my primary field for data selection. I’ve filtered the original .CSV files to only show IE and GB data, then selected and copied the data to a separate file. I’ve removed any data lines that are obviously errors. For example, one recurring purchaser always appeared to identify his country as IE, but the city was in the US. As his email address referenced the American Football associated with said city, I felt justified in removing him from the data set. However, there may be more included that should not be there. The other side of this is that British and Irish patrons that deliberately or accidentally identified as coming from another country have been overlooked and are not included in this analysis. While this is regrettable, it is my opinion that they make up a relatively insignificant portion of the overall group.

This field contains the email address of the member. While I’ve not used this data in these analyses, it is clear that this column contains wide variety of obviously fake email addresses along with many that, at least appear to be, genuine. Much has been made of the various email addresses of government and university sector workers that appear to have signed up, but as Ashley Madison did not appear to require email verification it is best to treat all addresses here with some degree of suspicion.

The phone number of the member. As there are only two numbers used for all of the British and Irish data: 12121212 and 111222333, I think it’s safe enough to regard them as fake. They have not been used in these analyses.

Presumably intended mostly for US customers, the British and Irish data contains a bewildering array of two and three letter acronyms; full and abbreviated county names, country names, even a few ones just composed of numbers. I’ve not examined this data in any depth.

Column X: TXT_ADDR1 and Column Y: TXT_ADDR2
These contain the addresses of the site’s clients. Again, there’s a variety of data captured here, and not all of it can be accepted at face value. The data varies from numeric (1-6 digits), gibberish alphabetic and alphanumerics, and obviously/likely fake addresses, though the vast majority are ostensibly real/plausible addresses. Indeed, the combined geographical and personal data is frequently coherent and sufficient to identify individuals with a reasonable degree of certainty. I’ve only used this data to attempt to ‘weed out’ occasional entries that have been miss-assigned a country code.

UK & Ireland valid transactions
Column Z: ZIP
The data here ranges from a variety of obviously faked alphabetic, numeric, and alphanumerics. The Irish data includes various versions of Dublin postcodes, but also town and county names and country designations. Interestingly, there are a small number of postcodes beginning with BT, indicating a Northern Irish origin. I would note here that it’s bizarrely saddening to see that sectarian divides are still prevalent, even on a site devoted to infidelities. I’ve seen a number of Londonderry’s that are in GB, and a few Derry’s in IE and I've considered each to be of the country they claim and have not altered the data. With regard to the UK, the data does appear to be dominated by valid (or valid looking) postcodes. I’ve entered a few into Google Maps and most return real-world locations, though that is no guarantee that the transactions refer to these exact places. Beyond these few manual checks, I have not used this data in these analyses.

My guess is that this data was collected automatically by the Ashley Madison system, where available. IP addresses feature relatively frequently in the data, but I’ve made no use of them.

Preliminary Data Analyses for the Republic of Ireland and the UK
Having copied all the relevant data I can find into a separate spreadsheet (and attempting to remove incorrect entries where I can spot them), I’m left with 63918 rows of data. Of these, 6,075 relate to Irish transactions, while 57,843 are associated with UK accounts. But these numbers only refer to account activity and payment transactions. What we really need is to get an idea as to how many active accounts there actually are. Colin Gleeson’s piece in the Irish Times gives figures of 40,000 and 115,000. I’m not saying that there are not 40,000 Irish people who’ve signed up to a free account out of curiosity or genuine intent. What I can tell you is that there are 1251 accounts that identify as Irish (including a number from Northern Ireland) that paid actual money to Ashley Madison. There are also 2501 accounts of UK origin (also including a number from Northern Ireland). Obviously, there is a question of where is a question of where infidelity begins … is it in the thought or in the deed? If it is when you hand over money to Ashley Madison, that’s an awful lot fewer that I might have expected. For anyone reading this thinking ‘there are twice as many British accounts as Irish ones’ the sobering thought is that there are huge differences in the relative populations of the two countries! Just based on the latest figures for population for the Republic and the UK (I’m using the entire population as I couldn’t find any figures for the adult segment alone) 0.004% of the UK have paid money to Ashley Madison, as opposed to 0.0273% of the Irish population.

UK & Ireland Revenue (C$)
And just what has been paid? In the timescale under review it appears that there were 25,590 transactions where money was paid to Ashley Madison and this amounted to C$2,400,245.80. No matter how you look at it, that’s a lot of money! This can be broken down as follows: C$2,151,348.91 from 23237 transactions in the UK and C$248,896.89 from 2353 Irish transactions. The average spend per UK transaction was C$92.58, while it was C$105.79 for their Irish counterparts.

Next question: Who’s getting paid? Well, Ashley Madison, right? While everything, one presumes, eventually goes back to the parent company, Avid Life Media, it goes via a variety of sub-companies as recorded in the Account Name (Column B). There are four account names that are variations on the name ‘ADL Media’ that brought in C$220,763.42 (C$198,343.73 from the UK and C$22,419.69 from Ireland). Avid Dating Life Inc brought in C$33,705.01 (C$2,327.69 from Ireland and C$31,377.31 from the UK). Two different accounts associated with the name ‘AMDA’ (still probably not the American Musical and Dramatic Academy) received C$39,877.00, all of it from the UK. There are twelve separate account names based on ‘Ashley Madison’ that received C$1,864,889.24 from UK accounts and C$224,149.51 from Irish accounts (Total: C$2,089,038.75). While I’m not at all clear on the specific services provided by each of these corporate divisions, it’s pretty safe to assume that the ‘Ashley Madison’ account names relate to its core ‘have an affair’ business. We would appear to be on firmer ground in assuming that CL Media refers to CougarLife. This account name received two payments (both from the UK) totalling C$238. However, both have associated error codes, indicating that the transactions were not completed. EM Media (EstablishedMen?) did rather nicely, thank you very much, taking in C$16,841.67 from 214 transactions from 150 unique accounts, all from the UK. Poor old Swappernet brought in only C$19.95 from a single UK transaction in 2013, which may go some way to explaining why it appears to have been shut down.

Account Names paid by UK & Irish members (Values in C$)
Just to round out the preliminary examination of the financial data, it is interesting to look at the cards used to pay for Ashley Madison’s services. For both Ireland and the UK, the leading brands are, in order, Visa (VI), MasterCard (MC), and (probably) American Express (AM).

Credit Card brands used in valid transactions. Top: overview. Bottom Left: UK. Bottom Right: RoI
Adding in the time dimension allows us to gain a number of different insights. For example, looking only at the yearly revenues shows a story of a company doing relatively well from 2008 to 2012, making a marked improvement in 2013, but simply accelerating beyond expectations in 2014 and 2015 – even more so when one considers that the 2015 data only goes up to June! Even when broken out into UK and Ireland data, a pretty similar story emerges for both. However, breaking it down by financial quarter unveils a different narrative. Now it’s clear that the major uptick in overall revenues only begins in Q4 of 2013. While there is still a vast hike in revenue, it is clear that there was a major hiccup in Q4 2014. There must have been a significant rethink of policy and direction at that point, as the Q1 2015 figures were the best ever achieved. As all of the Q2 2015 data is not available, there is a marked downturn in the latest figures. In this instance, breaking it down again by country tells quite different stories. The UK figures largely mirror the overall picture – natural as they make up the lion’s share of the data. Like the parent data, there’s the first major surge in in Q4 2014 and the Q4 2014 plunge. However, the UK data shows a continued increase in the latest figures that is out of step with the overall picture. This can be explained with reference to the Irish data where Q2 2015 has shown an unprecedented slump. Other differences in the Irish data include a much more marked increase in Q4 2013, followed by an immediate collapse in the following quarter. Increasing the level of resolution to monthly shows a much more fraught series of peaks and troughs of surging and collapsing revenue streams. While the UK data shows a peak in revenues in April 2015, it seems that the Irish data climaxed in January 2015 and, despite repeated attempts at revival (most notably in March and May 2015), was becoming increasingly flaccid. This level of resolution can be increased to weekly, where the data is reduced to staccato lunges, thrusts, and general throbbing. At this resolution, it is clear that the Irish market had been in serious decline for some time before the data breach. This increase in resolution can be charted down to the minute and second, but the data becomes remarkably difficult to visualise clearly, and my already-unravelling ability to refrain from genital-based imagery goes into even steeper decline.

Annual Revenues. Top: overview. Bottom Left: UK. Bottom Right: RoI
Annual Revenues by Quarter. Top: overview. Bottom Left: UK. Bottom Right: RoI
Annual Revenues by Month. Top: overview. Bottom Left: UK. Bottom Right: RoI
Annual Revenues by Week. Top: overview. Bottom Left: UK. Bottom Right: RoI
If we remove the 2015 data (as it does not include second half results) it is clear that across the years, business got better by the quarter from slow starts in Q1 (Jan, Feb, & Mar) to the best results in Q4 (Oct, Nov, & Dec). I’m not going to speculate on why that should be, but both the UK and Irish data show the same results, if with slightly different emphases. Again excluding the 2015 data, it appears that for the UK there was a visible rise in expenditure (presumably correlated with a rise in actual infidelity) in May, August, and September. Basically, as the year went on there was an increase in payments made to Ashley Madison. The Irish, being different, also show a general increase towards year end, but the peak month are July and November … I’m not even going to try and explain why …

Annual Revenues by Quarter. Top: overview. Bottom Left: UK. Bottom Right: RoI
Annual Revenues by Month. Top: overview. Bottom Left: UK. Bottom Right: RoI
We’re on a roll! Looking at when in the month is the most lucrative for Ashley Madison, we can clearly see that the UK picture (including the 2015 data) shows peaks on the 2nd, 11th, 20th, 22nd, and 27th, though the trend is towards falling revenue across the whole month. The Irish data, again apparently loath to follow their neighbours, shows a series of small peaks of similar size on the 2nd, 4th, 13th, 18th, and 21st before taking a break in preparation for what can only be described as an earth-shattering climax on the 29th. Overall, the trend is towards increased financial activity across the month. There are even differences between the two populations in terms of when the commit their greatest expenditures. The UK data indicates a preference for Thursdays and Fridays, while the Irish data shows distinct preferences for Wednesdays and Saturdays. Putting it all together, if you're from the UK, you're more likely to be spending money at Ashley Madison on Thursday August 11th, while Irish users are more likely to start laying out the cash on Saturday November 29th. The chronology of this data is so fine that discussions can be formed at the second-level … this is chronology like no archaeologist has ever dealt with before … and it’s wonderful …

Annual Revenues by Day of Month. Top: overview. Bottom Left: UK. Bottom Right: RoI
(owing to a stupid labeling error, the chart says 'ex 2015' but does include 2015 data)
Annual Revenues by Day of Week. Top: overview. Bottom Left: UK. Bottom Right: RoI
Perhaps less wonderful is the locational data. As I said above, archaeologists are familiar with (usually) tightly controlled locational data, be it at the site or context level. We’re not so good with the idea that a site might have been in Scotland, but was actually in Killarney. And that’s kinda’ what we’ve got here. I’ve used the Country and City fields to attempt to map locations. The biggest issue here is that Ashley Madison, in attempt to shield their members privacy, did not require email verification. They also did not appear to enforce any form of robust location control or checking. Thus, anyone wishing to hide their real world locations could do so without problem. That is why there are a number of data entries that claim to be from Ireland, but have, say, the city given as New York and include a New York address. These I’ve attempted to weed out. Less easy to spot and eradicate are instances where someone claimed to be in Ireland in the Country field and then gave a valid Irish address that may not have been their own. There are certainly plenty of real addresses in this data, but whether any of them are associated with the people who live at those addresses is quite another matter. The other issue is one of my own making and/or laziness … Tableau is good at spotting real world places and generating mapable Latitudes and Longitudes for them … but not perfect. Thus, there are over 3000 City codes that it has been unable to place. With more time and patience that I possess, it would have been possible to interrogate each one and manually add co-ordinates. Thus, the map data is heavily skewed towards larger, more recognisable urban areas. There are 173 unique values in the City field for Ireland and another 819 for the UK. I’m not going to discuss the UK data here, as I’m much less familiar with British place names, however in the Irish data a number of things are visible. Firstly, this field can be as specific as an individual Townland or it can be as general as the County name. This is compounded by a number of Irish Counties that have the same names as their major urban centres (e.g. Galway, Limerick, and Dublin). While I could have attempted to augment the data with postcode and address-level indicators, I felt that it was way too much trouble and potentially too revealing of where active Ashley Madison account holders really were. While it’s deeply flawed and heavily biased towards the larger urban centres, the data is still worth examining at this level. The map (using only Settlement data) shows a clear preference for the south-east of England, the Midlands, southern Wales, and the north-east, along with the central belt of Scotland. In terms of the Island of Ireland, there is a clear preference for Dublin, Meath, the east coast, and eastern Ulster. If the marks are resized by the amount of expenditure (i.e. again Settlement data only), it is clear that in Ireland. Dublin leads the way. In the UK there are notable ‘hot spots’ in Glasgow, Middlesbrough, Salisbury, and Poole, but all are dwarfed by the activities of London. As I say, this locational aspect of the data deserves much more work and adjustment, but I’m content to leave this for further students and researchers.

All Tableau-mapable locations for UK & Ireland
All Tableau-mapable locations for UK & Ireland with marks re-sized to reflect relative expenditure
The final aspect of the data I wanted to examine here was at the individual-level. The first thing to say is that I’ll not be revealing any names or exact real world locations. As many commentators have noted, they may not refer to the actual people named. Even if they were guaranteed accurately to identify everyone who spent money on this site, I find little interest in naming and shaming any of them. Although there are multiple issues with linking the cities, addresses, postcodes, and even the countries to the names in the data set to real world people, I’ve found that they are remarkably consistent across transactions. By this I mean that the details given to make a payment in, say, July 2008 will be the same as when the member makes a further payment the following month. We may not be able (or want to) identify individuals, but the activities of individuals can be coherently tracked through the data. When you first graph this data (again, Settlements only) it appears like there’s just a small smudge in the bottom left hand corner of the page. It’s an issue with attempting to graph so many users against such a variety of amounts of money. There are a few people who’ve paid an awful lot and so many that have paid very little, they basically cancel each other out. To be fair, things are not much better when you separate out the data by country either. Instead it’s probably better to concentrate on a small number of case studies within the data to examine generalities. The first thing to note about looking at the respective top 20 accounts by amount spent is that Ireland and the UK are vastly different. The top spending UK Account managed to relieve himself of C$182,000. Admittedly, the top Irish account wasn’t terribly far behind, spending C$119,300. But, while the UK accounts pretty much step down along a gentle curve, the Irish ones simply drop between first and second place – the next most generous Irish account spent C$7,502, as opposed to the second place UK account that shelled out C$112,900. By the time you get to the 20th placed Irish account you’re talking about a spend of ‘only’ C$690. It’s really not much when compared to the 20th placed UK account that forked over C$10,870. For this top 20 group the averages are quite telling too. The UK average here is C$40,933, while the Irish average is a ‘mere’ C$7,728. For the entire dataset the average spend is C$255 – C$310 for Irish accounts and C$250 for UK ones. Not that I’m condoning it, but that sounds pretty reasonable for an affair! I became intrigued by what some of the lower spending customers were spending and I note that there are 1132 examples of people paying exactly C$19. This is an important and much vaunted figure in that this was the price charged by Ashley Madison for their ‘Full Delete’ service … that, it seems, they didn’t actually carry out. With 275 examples from Ireland and the remaining 857 from the UK, that’s C$21,508 that Ashley Madison got for (allegedly) doing very little. When examined on an account, or individual level, it is clear that many of these are the only payments ever made to Ashley Madison. These may be the result of people desperately trying to erase a ‘prank’ account created by co-workers, to people deciding that, now that they’ve had a bit of a look about and a think about it, Ashley Madison and the services they offer are not for them. Whatever about the latter group, I do sincerely hope that the former group immediately followed this up by finding a job where they’re not surrounded by asshats. Below this C$19 marker there are 9,009 transactions, across 91 price points, where amounts were settled with Ashley Madison, ranging from C$18.75 to C$8.08. All together, these totalled C$135,022.25 … an awful lot of money to make in small transactions.

All of the UK & Ireland account-level data by expenditure. So much data there's only a small smudge visible!
I next wanted to look at how individual-level accounts spent their money … basically, I wanted to see if they spent it all in a short financial thrust or gently smouldered over a much longer timescale … (Sorry! I had to take a break from typing as I started to hallucinate Barry White). Anyway, Mr White successfully exorcised, I took a look at this for the UK and Ireland and was rather confused by what I found. According to the data, our biggest spender in the UK – lashing out a cool C$182K – did it in only two payments of C$91K in August and September of 2014. But he’s not alone. The second placed account paid out C$72,900 in March 2105, followed by two payments of C$20K each in the two months following. Even in the eighth and ninth spots, they made their entire spends over two transactions in a single month. I will admit that my first reaction was one of disbelief and shock - it seems implausible to me that so much money could be spent on a caprice over such a small span of time. I'm afraid that this aspect of life is beyond my experience (both the affairs and that amount of disposable cash), so I'm not able to comment on whether this is reasonable or plausible. For these apparent highest of high rollers, I did wonder if they were not the victims of fraud. This may yet prove to be the case, but there is no direct evidence of it that I can see in the data available. The Irish data is only slightly less strange. The top payer managed to get through C$119,300 in 12 payments, varying between C$2,000 and C$12,800, in the period from November 2014 and March 2015. The third, fourth, and sixth ranked accounts all paid out within a single calendar month (C$6,396, C$4,900, and C$1,120 respectively). Only the second, eighth, and ninth ranked accounts paid their cash out in smaller sums and over a relatively lengthy time frame. The second ranked account shows Settlement activity in the period from April 4th 2011 to August 20th 2012. The account holder spent C$7,866.99 over 105 transactions, with amounts ranging from C$20 to C$249. As an aside, there is a person of the same name, same address, but different account number, and different email (with a different credit card) that spent C$720 over 11 transactions in the period from July 3rd 2014 to April 4th 2014. Thus, it would seem that individual accounts may not hold all the events of a single individual. One way or another, this aspect of the data requires further detailed study and scrutiny.

Top 20 accounts by expenditure for UK (Left) and Ireland (Right)
Breakdown of expenditure by Top 10 UK accounts by month
Breakdown of expenditure by Top 10 Irish accounts by month
This is as far as I’ve taken these analyses, but I think that there’s much more there to be found and investigated. But beyond these direct analyses, is there much that this form of data analysis can bring to our understanding of modern British and Irish society? I think that the answer has to be a resounding yes. This is exactly the type of data that doesn’t make it into standard historical narratives precisely because it’s usually inaccessible and impossible to quantify. For this reason alone, I believe that it is worthy of study and consideration. Certainly in terms of the Irish data (north and south), it lends an important nuance to traditional, conservative narratives about how ideas of ‘family’ and sexuality are understood and presented. For the British data as much as the Irish, at its simplest level, it gives the lie to so many standard stereotypes about the denizens of these islands being sexually repressed and conservative. It was probably always thus, the difference now is that the internet has provided the means to connect and now we have the ability to analyse and interrogate the data. One way or another – whatever the morality or legality of taking or using this data – it is here and cannot be suppressed. Not only that, it will become more frequent and increasingly normalised in the years to come. Here’s the challenge for archaeologists, historians, and anyone interested in the current state of our culture: how will we react to this data, how will we use it responsibly, and what insights will we achieve?

Where do we go from here? In the first instance, I think it’s important to note that I’ve only undertaken a small amount of ‘excavation’ on one small portion of this digital site – just the credit/debit card data for the British and Irish members ... But there is so much more … not just in terms of card transactions, but in the context of the data dump as a whole. I do think that we need to move the analysis of this archive away from the outing individual people to an atmosphere of genuine research at the societal level. Think about it like undertaking a research dig on an ancient city – there’s room for one team to investigate the houses of the wealthy, while other projects look at the dwellings of the middle classes and the workshops of the artisans. There’s still enough room for other groups again to look at the aqueducts and the bathhouses … you just want to keep the treasure hunters away from digging out the shiny stuff, shorn of all context. I think that the archaeology metaphor holds when we consider that as large and comprehensive as this data dump is, it is still just one ‘site’ (in both the archaeological and internet senses). I hesitate to be seen to advocate for more hacking of personal data, but should other related types of site become available – for example, eHarmony, Tinder, Grindr, Christian Mingle,, and any of the plethora of niche sites that are out there – we may just be able to begin to create a landscape archaeology of the digital realm. Should these be combined with data from other (non dating/hookup) digital sources, like Facebook, Twitter, YouTube, MumsNet, and Instagram, then we could begin producing some genuinely deep and interesting insights at all levels from the individual to the planet as a whole ... and that, dear readers, is the true essence of archaeology and history!

For a project like this I’d normally direct the reader to a Tableau presentation where they can play about with the data and make their own discoveries and bring out the particular things that interest them. I have not done that in this case, and for a number of reasons. In the first place, I fell that despite my interest in it and argument that it is a valid field for research, it is still too sensitive to release en masse – even if I did anonymise the names and remove all even vaguely personal information. The second reason is Tableau themselves. They do give off a very Geek-friendly vibe, but their heavy handed response to users analysing portions of the WikiLeaks data and uploading it to Tableau Public makes me think it’s better to be overly cautious here (see also here | here | here | here | here). Although Tableau have changed their policy in the time since (here | here), I’m not inclined to risk it. Instead, I just used Tableau Public to make the visualisations and screen-grabbed them … it's not perfect, but it works!

I thought about writing to Visa and suggesting that they attempt to capitalise on this data by taking out advertising to say: ‘Visa: Your card of choice for infidelities!’ ... but I have a feeling that they’d not go for it. I even had a rough draft of a script for a TV advert … it could have been great …

In doing research for this post, I stumbled upon the photography of Georgios Makkas and his 'Archaeology of Now' project looking at abandoned shopfronts in Greece. His work has a haunting quality that eloquently shows the effects of the recent economic downturn on the post-war Greek dream of family shop-ownership. See his work here and here

I just wanted to add that while the prevailing narrative regarding the Ashley Madison data is one of heterosexual infidelities, there are actually six categories of membership:
1: Attached Female SeekingMales
2: Attached Male Seeking Females
3: Single Male Seeking Attached Females
4: Single Female Seeking Attached Males
5: Attached Male Seeking Males
6: Attached Female Seeking Females

While the majority of coverage has centred on the first two categories, this ignores a (probably) significant proportion of the membership. In particular, the final two categories of individuals either in heterosexual relationships seeking homosexual encounters or people in long-term homosexual relationships, looking for affairs have been absent from the discussion. I’ve not been able to find evidence in the credit card data to make differentiations at this level of detail, but I do think it’s worthy of study and further research.

No comments:

Post a Comment