Pattern Drive Private Limited

metadata-data-protection-privacy
Metadata: Data Protection & Privacy

The Complex World Of Metadata: A Look At Data Protection And Privacy

Updated on: 29/12/2023

903 Views | 0 Comments

Hardly any other case of surveillance versus fundamental rights concerns the European Court of Justice (ECJ) as much as data retention. Judgments on this were made in 2014 , 2016 , 2020 and 2022 . Data retention refers to legislative proposals that provide for the mass storage (“data retention”) of connection data from public communications services without any reason. This happens for a certain period of time (“in reserve”) for the purpose of later criminal prosecution. The ECJ has repeatedly overturned data retention regulations. In doing so, what should actually be stored in reserve almost fades into the background. Usually it says in a half-sentence that it is about “traffic data” or “metadata”. A constant companion to this term is often the prefixed trivialization “only”. As if the nebulous statement “we only process metadata” explained everything. But this is not the case. This article attempts to clear some of the fog surrounding the term “metadata”.

Although the term is often used in the context of mass government surveillance, metainformation is also a basis for tracking business models. The Meta (Facebook) Social Graph would not work without metadata about social relationships.

We will therefore also look at how the knowledge contained in metadata can be made usable.

 

1. Metadata: A Big Ball Of Wibbly Wobbly Meta Stuff?

Metadata, also known as meta-information, is a type of structured data that includes information about the properties of other data.

In political discussions, the term “metadata” usually only appears indirectly and often in connection with, for example, telecommunications surveillance. The proponents of surveillance primarily argue with the differentiation from the (conversation) content of a communication in order to prove the supposed harmlessness of the surveillance measure.

Nobody listens to your phone calls. (…) What the secret services do is check telephone numbers and the duration of calls. They don't look at people's names and they don't look at the content. By looking through this so-called metadata (…)

Source: Associated Press. (2013, June 7). Obama: “Nobody Is Listening to Your Phone Calls” [Video]. YouTube. Translation by the author.

Crucial to understanding communication metadata is that even if the content of a communication remains confidential, the circumstances of the communication will inevitably be disclosed. Since state surveillance measures generally concern commercial communications services, nothing else can apply to them. Services such as WhatsApp emphasize the end-to-end encryption of messages in the media and thus equate encryption with the protection of privacy.

We cannot read the messages that users send to each other.

Source: Krempl, S. (2021, September 28). WhatsApp boss: Don't collect metadata on a large scale heise online.

Drawing lines between the content of a message and the circumstances of its transmission is an arbitrary distinction. This leads to metadata being defined primarily by what it is not. Namely the content of a message. The propagation of end-to-end encryption simultaneously creates the impression that the confidential (content) is protected and “everything else” is not worth protecting.

There is no clear distinction between metadata and content. It's more of a coherent whole.

Source: Conle, C. (2014). Metadata: Piecing together a privacy solution. ACLU of Northern CA. Translation by the author. Translation, emphasized by the author.

1.1 Technical Definitions Of Metadata

There are numerous definitions that explain metadata based on its specific technical application. An important feature emerges that enables data analysis on a large scale: structuredness. The prerequisite for this is standardization of the technical processes.

Internet communication transmits packets and “nests” these packets according to a structured layer model. Higher layers use functionalities of lower layers, interfaces connect the different layers. Layer models are the basis for ensuring that digital communication between different technical systems functions in a standardized manner.

The use of layer models is important for the technical definition of the metadata term, because which information represents communication content or communication circumstances (...) depends on the layer under consideration.

Source: Leibniz Institute for Information Infrastructure, University of Innsbruck, Karlsruhe Institute of Technology, Boehm, F., Böhme, R. & Andrees, M. (2017). Expert report for the hearing of the 1st investigative committee of the German Bundestag of the 18th electoral term on the topic: How or in what different ways is the term traffic and usage data used scientifically in a technical and legal context? How is this to be distinguished from the term metadata? German Bundestag.

For example, if an email is sent, the transmission protocol Simple Mail Transfer Protocol (SMTP) is used. For the technical standards (Internet), the RFC documents of the Internet Engineering Task Force are a reference, such as documents 5321 and 5322 . The technical structures are derived from this. E-mails are structured into content-describing data (headers; header fields) and the message content (text body). However, this structuring is done in order to be able to carry out technical processes - it is not a categorization into questionable and harmless data.

It should be noted that, from a technical perspective, a clear distinction between metadata and content data that could be generalized to all types of electronic communication is not possible.

Source: ibid.

Structures lead one to assume that a definition is universally valid. This is not the case with metadata because its definitions are application specific .

1.1.1 Email

E-mails consist of three components: the envelope with the technical information for the delivery of the e-mail (e.g. SMTP), the email header (header) and the message body. The message text (body) corresponds to the content of the message. Envelope and header fields correspond to “content-describing” data and are therefore metadata.

The header fields specified by RFC  5322 are:

  • Sender
  • Recipient
  • Other recipients (Cc, Bcc)
  • Reference
  • Creation time

It should be noted that metadata can contain personal reference (e.g. personalized email, IP address).

The subject is a short description of the message body. It is not unusual for travel details such as travel times or destinations to be included in the subject line. Metadata also has a contextual reference .

1.1.2 Telecommunications Data, Such As Call Detail Record (CDR) Or Individual Connection Records

In telecommunications (TK), telecommunications metadata such as the Call Detail Record (CDR), known in Germany as individual connection proof, automatically documents the circumstances of a TK connection (including the basis for billing).

Our current use of telecommunications is a mix of mobile and landline communication and is neither limited to telephone calls nor are conversations always carried out using the same technical procedures (e.g. Voice over IP). Accordingly, metadata arises from an intersection of these usage occasions, be it email, mobile surfing or telephony.

  • Telephone numbers (A and B phone numbers),
  • call times; Call duration (start; end),
  • Call type (SMS, voice, etc.),
  • as well as other fields. See, for example, a document from the English  Federation of Communication Services ,
  • for mobile phone connections, additionally the card identifier International Mobile Subscriber Identity (IMSI) , device identifier International Mobile Equipment Identity (IMEI) , mobile cell identification or cell ID , IP addresses.

Device or card identifiers, such as B. an IMEI or an IMSI can uniquely identify people.

1.1.3 Internet

Tim Berners-Lee , the founder of the World Wide Web, sees the criterion “machine readable” as the core of metainformation (in the context of the Semantic Web).

Metadata is machine-readable information about web resources or other things (…) The term “machine-readable” is crucial . (…) Metadata is data.

Source: Berners-Lee, T. (1997, January). Axioms of Web Architecture: Metadata. Web architecture: Metadata.   Translation, emphasis by the author.

In order to meet this requirement, criteria, structure and standardization are required - see RFC 3896 . According to Berners-Lee's definition, machine-readable information on the Internet falls under the term metadata. This includes resources of different types:

  • Website,
  • Where you are on a website (path),
  • Web service,
  • Files such as documents, images, etc.,
  • IP addresses,
  • Users ( RFC 3896 point 3.2.1 ),
  • (…).

1.1.4 Metadata In File Formats

Metadata can also be contained in file formats. The Exchangeable Image File Format (Exif) defines metadata for image formats such as JPEG:

  • Date, time and geo-coordinates of the recording,
  • Device and settings information,
  • (…).

People have an intention or purpose in mind when using technology. For example, a photo may be followed by an upload to a cloud system. This means that photo capture and upload generate metadata that is associated with the achievement of a purpose. These are not separate usage processes, rather they merge into one another.

1.2 Metadata And Social Identity

Meredith Whittaker, President of the Signal Foundation , gives a classification of metadata (in the context of instant messaging) in an interview. Whittaker understands instant messaging metadata as information about the identity (name, profile information) of users (“who you are”). However, it places this identity in a social context (contact list, members of a group chat). In this sense, “who you are” is also “who your friends are.” On the one hand, this means that metadata can not only be isolated individual information, but can also reflect social relationships. On the other hand, a person's metadata is always information about other people.

We encrypt not only the content of the messages, i.e. what you say , but also the information about who you are , i.e. the metadata: your name, your profile information, your contact list and the members of your group chats. We are unable to provide this information because we do not own it.

Source: Grob, R. (2023, September 4). »Artificial intelligence is mostly used for surveillance . Swiss month. Translation, emphasized by the author.

The Signal president describes the content of the message as “what you say.” This distinguishes between content and metadata. However, the encryption of both components is a strong indication that Signal views meta and content data as a coherent whole. Nevertheless, it must be said: Although Signal encrypts metadata with Sealed Sender , it leaves individual metadata points in the technical infrastructure.

1.2.1 Concatenation Of Bulk Data

WhatsApp uses the Signal protocol to encrypt message content, but not to encrypt metadata. And as we know, metadata is extraordinarily [sic] insightful. And let's face it, it's part of meta. So it's not inconceivable that it could combine the metadata and other information it has with the extraordinarily [sic] invasive surveillance data collected by other meta-owners like Facebook or Instagram.

Source: Grob, R. (2023, September 4). »Artificial intelligence is mostly used for surveillance . Swiss month. Translation, emphasized by the author.

Whittaker also points out the possibility of concatenating different information. An effective combination (and subsequent evaluation with knowledge gain) requires a well-known prerequisite: the masses of information must be structured.

1.3 Legal Definitions

Legal terms that can be understood as metadata are always to be understood within the regulatory scope of the respective laws. Like technical definitions, they are always application-specific.

1.3.1 European Court Of Justice (ECJ)

Metadata is not a recognized term in German telecommunications and telemedia law. The ECJ (C-311/18) uses the term – in connection with the monitoring of communications – in the well-known form of non-content.

As part of the UPSTREAM program (…), the NSA has access to both the metadata and the content of the communication in question.

Source: JUDGMENT OF THE COURT (Grand Chamber) Data Protection Commissioner v Facebook Ireland Ltd, Maximillian Schrems , (C‑311/18). (2020).

However, in his argument (C-817/19), the Advocate General Giovanni Giulio Pitruzzella objects to the possible harmlessness of metadata. Rather, Pitruzzella expressly sees meta and content data in the ECJ's previous case law as a protected property that belongs together.

The Court has repeatedly emphasized that not only the content of electronic communications, but also the metadata may contain information "on a wide range of aspects of the private life of those concerned", "including sensitive information such as sexual orientation, political opinions, religious, philosophical, social or other beliefs as well as the state of health", that from the totality of this data "very precise conclusions can be drawn about the private life of the people whose data has been stored, for example about habits of daily life, permanent or temporary locations, daily or in changes in location, activities carried out, social relationships of these people and the social environment in which they move at a different pace" and that the data enables the creation of "a profile of the persons concerned which also has the right to respect for private life represents sensitive information like the content of the communications themselves.

Source: OPINION OF ATTORNEY GENERAL GIOVANNI PITRUZZELLA Human Rights League Council of Ministers (C‑817/19). (2022). Emphasis by the author.

1.3.2 German Law

What the ECJ understands as metadata overlaps, for example, with the “traffic data” of the Telecommunications Act.

1.3.2.1 Traffic Data

According to Section 3 No. 70 of the German Telecommunications Act (TKG; last changed in 2021), traffic data is “data whose collection, processing or use is necessary for the provision of a telecommunications service”. According to Section 176 TKG, this legal definition includes:

  • E-mail address,
  • Phone number or identifier (e.g. IMSI, IMEI) of the telephone connections involved,
  • Date and time (start and end),
  • Routing information and information about IP addresses and MAC addresses,
  • Mobile phone connections location data (radio cells),
  • (…).

Traffic data for emails does not include content-related data such as the name of a file attachment or the subject line. This shows the contradiction between legal and technical definition (header fields).

Section 176 TKG is a suspended regulation on data retention. The Federal Administrative Court has once again established the illegality in the decisions BVerwG 6 C 6.22 and BVerwG 6 C 7.22 (2023). The Digital Society Association has created a list of all data categories that would be subject to data retention.

1.3.2.2 Usage Data

According to Section 2 No. 3 TTDSG, usage data is the personal data of a user of telemedia (e.g. website, app), the processing of which is necessary to enable and bill the use of telemedia. The justification for the law provides little information about exactly what data is included. The non-exhaustive list in letters a) to c) is also not very meaningful. This lack of clarity leads to problems in delineating traffic data.

What is clear, however, is that usage and traffic data must have an overlap in terms of the data categories recorded.

The example of an IP address shows that one and the same date can be both a usage date and a traffic date. Conversely, for example, a login date consisting of a user ID and password for a telemedia is simply a usage date.

Source: Leibniz Institute for Information Infrastructure, University of Innsbruck, Karlsruhe Institute of Technology, Boehm, F., Böhme, R. & Andrees, M. (2017). Expert report for the hearing of the 1st investigative committee of the German Bundestag of the 18th electoral term on the topic: How or in what different ways is the term traffic and usage data used scientifically in a technical and legal context? How is this to be distinguished from the term metadata? German Bundestag.

However, the fact that a telemedia service (menstrual app, forum about a specific illness, online gambling, etc.) is used can already allow conclusions to be drawn about sensitive issues.

Usage data can be very close to the communication content and can therefore be even more sensitive than traffic data.

Source: ibid.

The use of a specific telecommunications service (e.g. Vodafone) is initially beyond the scope of the content. On the other hand, the use of a specific telemedia service (e.g. www.anonyme-lokaliker.de) can be closer to the content.

1.4 Metadata: A Big Ball Of Wibbly Wobbly Semantic Stuff!

The overall view only allows this conclusion: Metadata is not a specific data type. Rather, it is a generic term .

This already makes it clear that metadata represents an excess of traffic and usage data and only coincides with these categories if they are restricted by more precise definition. (…)

Conversely, the experts are not aware of any traffic or usage date that could not be described as metadata.

Source: ibid.

Therefore, when people talk about metadata, it is usually unlikely that neutral, technical “non-content” is actually meant. The mass storage of communication circumstances for no reason must therefore inevitably collide with fundamental rights. Artificially splitting a message into “what you say” and “who you are” is a political tactic.

This mental dividing line is intended to push into the background the fact that a coherent whole cannot simply be broken up. But if a design violates its actual purpose, it must be viewed as defective. There is no other explanation for the fact that the content of an SMS could not technically be separated from the data describing the content. Nevertheless, SMS were subject to data retention.

If contradictions occur in a construction, this is another indication that it is incorrect. How else should a regulation for questionable non-content in Section 176 (1) TKG (reference to Section 11 (5) TTDSG ) be understood? With § 176 (1) TKG, for example, telephone numbers such as those of the telephone counseling service are to be excluded from data retention. However, this is a fundamental rights fig leaf. Because the exception is designed as a bureaucratic obligation and is passed on to civil society.

2. Attack On Privacy

We haven't asked a question yet. Why does supposedly harmless, neutral metadata arouse the desires of prosecutors, secret services and companies?

1. Ubiquitous surveillance using technical means is a comprehensive attack on privacy

Ubiquitous observation with technical tools (…) is comprehensive (and often covert) monitoring through a comprehensive collection of protocol components, including application content or protocol metadata such as headers. (…) is characterized by the fact that it takes place without cause and on a large scale (…).

Source: Farrell, S. & Tschofenig, H. (2014). Pervasive surveillance is an attack. RFC 7258. Translation, emphasis (text) by the author.

First of all, it should be taken into account that meta information allows conclusions to be drawn about the content. If, for example, it is clear from the geo-coordinates and the chronological sequence of photo files that the locations where the photos were taken were a Catholic church and a registry office, then conclusions can be drawn about the event and even about religious affiliation. So you can assume that knowledge of the content is actually not necessary. A wealth of meta-information can be evaluated, derived and linked, which enables much more comprehensive knowledge generation .

In the following sections, graph theory is used to illustrate the metadata privacy attack. It may be helpful to read the article Topography of Data : A Black Box for the User (2020).

Graphs can be understood as a structured description of knowledge and information. Graphs consist of nodes and edges.. An illustrative example of graphs are subway maps . Every subway station is a node. The subway line or the connection between two stations corresponds to an edge.

2.1 Mass

The mass attribute for metadata has already been mentioned. There is different information regarding the storage of connection data. The German Intelligence Committee of Inquiry in 2014 spoke of storage up to the fifth level (contacts of the original number, contacts of these contacts, etc.). In the court decision United States v. Moalin (2020) we learn that in the USA connections are stored up to three “hops”. This means that if a person has 100 contacts (and each of those contacts has 100 contacts, etc.), with three levels you would get a million connection metadata.

2.2 Social Connections

“Let me see your social connections and I will show you who you really are.” That’s what social network analysis is all about.

To illustrate, an undirected graph with edge weights is used. The values ​​of the edge weights correspond to the number of social connections between the actors.

Each actor is represented by a node. As we have shown, metadata can be chained and generate knowledge. For example, doctors use meaningful email addresses - information about the specialty or location is not uncommon. Some actors already have meaningful names. In our example, a social link can be any type of digital communication between the actors.

 

 

In social networks, in addition to individual nodes, there are also groups of nodes. A group can represent any type of social group or interest (circle of friends, family, colleagues, sports club, etc.). Based on interactions and behavior patterns, nodes may or may not be assigned to specific groups.

The calling behavior of family members (…) indicates strong social bonds between them, which is reflected in the total number of calls and the frequency of calls.

Source: Motahari, S. (2012). The impact of social affinity on phone calling patterns: Categorizing social ties from call data records . Translation, emphasis by the author.

Even a superficial analysis of the actors and groups provides insights. Options for calculating additional characteristics can be found here.

  • Cathy and Dennis are important. Without it, two separate subgraphs would be created and the family doctor node would lose its connection,
  • Group Cathy, Edwin, Beate, Michael are almost all connected and their edge weights are high,
  • Group Dennis, Gudrun, Ingo, Julian are all connected and their edge weights are mostly high,
  • Both groups have strong social ties. These are probably families,
  • Cathy and Dennis connect the two groups. It is probably a married couple. This conclusion is supported by the connection with the GP node,
  • Dennis has a connection to the Family Lawyer Node,
  • Ingo shows the least social interaction in his group,
  • Julian and Dennis have a connection to the actor hospital
  • (…).

The number of social connections can be represented in a machine-readable adjacency matrix . Algorithms can use it to calculate, for example, the shortest connection in a network, the lowest price or the intensity of social interaction. To do this, the edge weights are transferred to a matrix. The letters correspond to the actors (C=Cathy).

 

 

2.2.1 Noise Floor, Patterns And Deviations

Our daily routine creates a kind of “data background noise”. Therefore, deviations from existing patterns can be informative. A family is characterized by strong social ties. In the present graph this can be used to identify group members. But it is also a pattern in which deviations stand out.

In our example, the Hospital node is one such deviation. The edge weight to other actors is averagely high. But higher than other peripheral players.

 

 

Source: Created by the author.

 

If the node Ingo - which has an unusually low connection strength compared to the group - and the actor hospital disappeared from the graph, this would be further anomalies. If one puts these abnormalities in context, a likely interpretation would be that Ingo has died.

Our everyday lives are characterized by recurring processes and are therefore not nearly as unique as we like to believe. Our “data background noise” is comparable to that of other people – this also applies to groups. Our social relationships have become measurable, segmentable and decomposable. Contact with a family law lawyer must also be noticed because it falls out of the usual pattern. Sometimes it's not the habits of everyday life that betray us, but rather the deviations from them. What is crucial is that whoever has control over social metadata also has knowledge power over us. He is also responsible for interpreting and drawing conclusions from this knowledge.

3. Detect And Identify

Metadata characterizes its properties more clearly than any categorization. Metadata has many faces – it can be personal, sensitive and content-related.

Their mass is rooted in the inevitability of their automatic generation. Our current communication habits unconsciously generate masses of metadata. This mass of data is accessed through structures that make information machine-readable and enable our digital communication.

Due to its structure, meta-information can be linked and generates previously unknown knowledge. If we do not notice the creation of this data, we have no idea about the creation of new knowledge about ourselves.

Arbitrary rhetorical dividing lines between content and non-content are only intended to give the impression that data retention without reason can somehow be proportionate and therefore in accordance with fundamental rights.

This data is technical data - not the content of the communication.

Source:  Inquiries to the BKA. (n.d.). Data retention questions & answers ). Federal Criminal Police Office.

A prime example of this rhetorical trivialization of surveillance is the claim that metadata is something technically neutral - as can be found on the BKA homepage - and therefore harmless. If metadata were truly neutral, it would be useless for surveillance purposes. We can see that this cannot be the case in the European directive on data retention (2002/58/EC). The treacherous terms “Trace and Identify” can be found there.

If surveillance is understood as an act of control and verification, then “tracking and identifying” is a practical activity that results from this act. Metadata would then be the means of “tracking and identifying” and must therefore be the opposite of harmless. Ultimately, the purpose is not surveillance itself, but rather repression in the sense of state law enforcement. However, in a constitutional state, repressive criminal prosecution is subject to certain conditions. There must therefore be a reinterpretation of the unprovoked and mass-like into the specific and goal-oriented. The entire “cyberspace” is a crime scene; everyone who is there is suspect. These narrative shell game tricks are needed so that new surveillance fever dreams like chat control can give the appearance of proportionality.

In reality, they are an attack on fundamental rights and civil liberties. The fact that metadata is the appropriate means for this should speak for itself.


We call for a wrap-up here. Thank you for staying with us till the end. The purpose of this article is to explain what metadata is and how the information included in metadata can be utilized efficiently. We hope you found this read insightful. For more such informative topics, make sure to visit our Knowledge Based Section under the Cyber Security Category.

To stay connected with us, follow us on FacebookTwitterInstagram, and LinkedIn. Find us on Telegram to get regular updates on malware and malicious applications that might cause you great harm. If you are looking for cybersecurity consulting services or want to know more about our services, contact us through the contact form, drop in an email at [email protected], drop in a text on WhatsApp, or directly ring us at +91 907 396 3301.


You'll Love These Related Reads:

⫸ Latest Cybersecurity Predictions: What To See In 2023 & Beyond?

⫸ How To Hide Browsing History From ISP And Be Anonymous?

⫸ Data Sale: Are Your Data Being Sold To Third Parties?

⫸ Android Encryption: Why And How To Take This Step?


Tags


Share


Leave a Comment

By Submitting you agree to our Terms of Service and Privacy Policy.