Data Source


The numeric data we refer to on this website was obtained from archived versions of Google's Transparency Report, available via Wayback Machine. You can find snap-shots of the website here (depending on your country settings you may see different snapshot dates).
The data was not visualised on the front-end of Google's Transparency Report but available in the source code in JSON format under the variable name AGGREGATE_DATA. All snapshots we found before March 23rd 2015 do contain the breakdown of the data into issue types, while all snapshots after March don't contain the issue types anymore. Have a look at the current source code of Google's Transparency Report for comparison.

We analysed all datasets obtained from these snapshots from October 2014 to March 23rd 2015. The overall ratios, both regarding issues and outcomes, don't very much over time, so the data we chose to present here reflects the most recent data from March 23rd 2015.

Data Structure


The data object itself is contained in the bare html file of the Transparency Report in JSON format, named AGGREGATE_DATA. That's not a really a nice way to structure a website (I am sure any Google front-end developer would be happy to lecture on 'why') - but it works.
AGGREGATE_DATA breaks down into country names, issue types and respective outcomes within these groups. Counts are available in either number of request or number of urls, for one request to Google can contain multiple urls to be de-linked from searches. Below is an exerpt from the JSON file, showing the data for Romania.

"RO": { 
  "name": "Romania", 
  "requests": {
    "all": {
      "pending": 51, "need_more_info": 568, "rejected": 2023, "complied": 1147, "total": 4073
    }, 
    "issues": {
      "political": {
          "pending": 1, "need_more_info": 30, "rejected": 159, "complied": 91, "total": 282}, 
      "cp": {
          "pending": 0, "need_more_info": 1, "rejected": 15, "complied": 1, "total": 17}, 
      "private_personal_info": {
          "pending": 49, "need_more_info": 520, "rejected": 1780, "complied": 1052, "total": 3439}, 
      "public_figure": {
          "pending": 0, "need_more_info": 5, "rejected": 66, "complied": 12, "total": 83}, 
      "serious_crime": {
          "pending": 2, "need_more_info": 35, "rejected": 60, "complied": 14, "total": 111}
    }
  }, 
  "urls": {
    "all": {
      "pending": 236, "need_more_info": 2167, "rejected": 10419, "complied": 3835, "total": 16870
    }, 
    "issues": {
      "political": {
          "pending": 1, "need_more_info": 131, "rejected": 945, "complied": 237, "total": 1317}, 
      "cp": {
          "pending": 0, "need_more_info": 4, "rejected": 78, "complied": 1, "total": 83}, 
      "private_personal_info": {
          "pending": 230, "need_more_info": 1949, "rejected": 8776, "complied": 3571, "total": 14719}, 
      "public_figure": {
          "pending": 0, "need_more_info": 6, "rejected": 633, "complied": 43, "total": 693}, 
      "serious_crime": {
          "pending": 6, "need_more_info": 183, "rejected": 382, "complied": 20, "total": 594}
    }
  } 
},

"BE": { ... },
"FR": { ... },
"UK": { ... },
"DE": { ... },
 ...

              

Data Processing


The data for each country is broken down in terms of numbers of requests and numbers of URLs. On average one request contains 3-5 URLs to be de-linked but initial analysis suggests that the category Google assigns to URLs does not neccessarily match the respective request. To avoid overcomplicating things, we therefore chose to visualise the number of requests for this initial insight.

To make use of the most detailed level the data offers, we used the numeric values at the deepest leaf of the tree structure, i.e. deepest breakdown-level, to calculate sums and percetages per country. Importantly, one of these deepest leaves is named 'total' suggesting that this is the sum of the remaining 4 leaves (i.e. outcomes) in each branch. However, the outcomes 'complied', 'rejected','need_more_info' and 'pending' don't add up to 'total'; in many countries and issue types, there is a difference (for example: requests from Romania falling in issue type 'political' resolve to "pending": 1, "need_more_info": 30, "rejected": 159, "complied": 91, adding up to 281, the "total" however is 282). We accounted for these requests which did not have any assigned outcome labeling them 'undefined'. All percentages refer to the leaf labeled "total" as 100%.

Scraping and processing was done in the programming language R version 3.2.0 (2015-04-16). A commented script and a list of Wayback Machine snapshots it is ingesting is available via GitHub here.
Conceptually it does the following:


for each LOGDATE
    for each COUNTRY
        for each UNIT TYPE (i.e. 'requests', 'urls')
            for each ISSUE TYPE
                set 'undefined' = 'total' - SUM('complied','rejected','pending','need_more_info')
              

The resulting data table contains six columns: CountryName, CountryCode, CountUnit (requests or urls), Issue, Outcome, VALUE (the count), logdate. I.e.:


CountryName CountryCode CountUnit     Issue           Outcome         VALUE   logdate

Austria     AT          requests      public_figure   rejected        20      2014-10-13
Austria     AT          requests      public_figure   complied        10      2014-10-13
Austria     AT          requests      public_figure   pending          3      2014-10-13
Austria     AT          requests      public_figure   need_more_info   2      2014-10-13
Austria     AT          requests      political       rejected        16      2014-10-13
Austria     AT          requests      political       complied         3      2014-10-13
.
.
.
              

All further analysis on this page is based on this extract.