The numeric data we refer to on this website was obtained from archived versions of Google's Transparency Report, available via Wayback Machine. You can find snap-shots of the website here (depending on your country settings you may see different snapshot dates).
The data was not visualised on the front-end of Google's Transparency Report but available in the source code in JSON format under the variable name AGGREGATE_DATA. All snapshots we found before March 23rd 2015 do contain the breakdown of the data into issue types, while all snapshots after March don't contain the issue types anymore. Have a look at the current source code of Google's Transparency Report for comparison.
We analysed all datasets obtained from these snapshots from October 2014 to March 23rd 2015. The overall ratios, both regarding issues and outcomes, don't very much over time, so the data we chose to present here reflects the most recent data from March 23rd 2015.
The data object itself is contained in the bare html file of the Transparency Report in JSON format, named AGGREGATE_DATA. That's not a really a nice way to structure a website (I am sure any Google front-end developer would be happy to lecture on 'why') - but it works.
AGGREGATE_DATA breaks down into country names, issue types and respective outcomes within these groups. Counts are available in either number of request or number of urls, for one request to Google can contain multiple urls to be de-linked from searches. Below is an exerpt from the JSON file, showing the data for Romania.
"RO": {
"name": "Romania",
"requests": {
"all": {
"pending": 51, "need_more_info": 568, "rejected": 2023, "complied": 1147, "total": 4073
},
"issues": {
"political": {
"pending": 1, "need_more_info": 30, "rejected": 159, "complied": 91, "total": 282},
"cp": {
"pending": 0, "need_more_info": 1, "rejected": 15, "complied": 1, "total": 17},
"private_personal_info": {
"pending": 49, "need_more_info": 520, "rejected": 1780, "complied": 1052, "total": 3439},
"public_figure": {
"pending": 0, "need_more_info": 5, "rejected": 66, "complied": 12, "total": 83},
"serious_crime": {
"pending": 2, "need_more_info": 35, "rejected": 60, "complied": 14, "total": 111}
}
},
"urls": {
"all": {
"pending": 236, "need_more_info": 2167, "rejected": 10419, "complied": 3835, "total": 16870
},
"issues": {
"political": {
"pending": 1, "need_more_info": 131, "rejected": 945, "complied": 237, "total": 1317},
"cp": {
"pending": 0, "need_more_info": 4, "rejected": 78, "complied": 1, "total": 83},
"private_personal_info": {
"pending": 230, "need_more_info": 1949, "rejected": 8776, "complied": 3571, "total": 14719},
"public_figure": {
"pending": 0, "need_more_info": 6, "rejected": 633, "complied": 43, "total": 693},
"serious_crime": {
"pending": 6, "need_more_info": 183, "rejected": 382, "complied": 20, "total": 594}
}
}
},
"BE": { ... },
"FR": { ... },
"UK": { ... },
"DE": { ... },
...
The data for each country is broken down in terms of numbers of requests and numbers of URLs. On average one request contains 3-5 URLs to be de-linked but initial analysis suggests that the category Google assigns to URLs does not neccessarily match the respective request. To avoid overcomplicating things, we therefore chose to visualise the number of requests for this initial insight.
To make use of the most detailed level the data offers, we used the numeric values at the deepest leaf of the tree structure, i.e. deepest breakdown-level, to calculate sums and percetages per country. Importantly, one of these deepest leaves is named 'total' suggesting that this is the sum of the remaining 4 leaves (i.e. outcomes) in each branch. However, the outcomes 'complied', 'rejected','need_more_info' and 'pending' don't add up to 'total'; in many countries and issue types, there is a difference (for example: requests from Romania falling in issue type 'political' resolve to "pending": 1, "need_more_info": 30, "rejected": 159, "complied": 91, adding up to 281, the "total" however is 282). We accounted for these requests which did not have any assigned outcome labeling them 'undefined'. All percentages refer to the leaf labeled "total" as 100%.
Scraping and processing was done in the programming language R version 3.2.0 (2015-04-16). A commented script and a list of Wayback Machine snapshots it is ingesting is available via GitHub here.
Conceptually it does the following:
for each LOGDATE
for each COUNTRY
for each UNIT TYPE (i.e. 'requests', 'urls')
for each ISSUE TYPE
set 'undefined' = 'total' - SUM('complied','rejected','pending','need_more_info')
The resulting data table contains six columns: CountryName, CountryCode, CountUnit (requests or urls), Issue, Outcome, VALUE (the count), logdate. I.e.:
CountryName CountryCode CountUnit Issue Outcome VALUE logdate
Austria AT requests public_figure rejected 20 2014-10-13
Austria AT requests public_figure complied 10 2014-10-13
Austria AT requests public_figure pending 3 2014-10-13
Austria AT requests public_figure need_more_info 2 2014-10-13
Austria AT requests political rejected 16 2014-10-13
Austria AT requests political complied 3 2014-10-13
.
.
.
All further analysis on this page is based on this extract.