“While Internet encryption increases the grasp of platforms on data, AI may help pursuing their sovereign operations.”
Internet traffic encryption yesterday, today and tomorrow
Traffic encryption is the practice that aims to make data exchanged over networks incomprehensible to unauthorized observers.
Figure 1: End-to-end traffic encryption on the Internet
On the Internet, traffic encryption was popularized by Netscape, with the introduction of SSL, the ancestor of TLS, in 1995 and has grown with the rise of electronic commerce. Traffic encryption allows Internet users to verify that sites they connect to, match with the names displayed by the browser in its address bar (but it does not protect against name similarities). Besides, traffic encryption makes data exchanged through the network, between the browser and these sites, incomprehensible to third parties: an attacker intercepting communications on the network is unable to access communications content, such as for example bank card numbers.
Mass eavesdropping revealed by the Snowden affair in 2013 generated the intent of pervasive traffic encryption. The same year, the Internet Engineering Task Force (IETF), the body coordinating the development of Internet standards, declared mass eavesdropping a technical attack on the privacy of Internet users [1]. Subsequently, Internet protocols will massively use encryption.
In 2020, encrypted Internet traffic was estimated to weight around 85% of the total worldwide volume [2]. On Orange networks, this share rose from 50% at the end of 2015 to 85% in early 2021. With the applications of certain players such as Google, the share of encrypted traffic is approaching 100% [3].
In its current form, encryption applies to application content, such as videos watched, their titles or web pages browsed. However, current encryption is imperfect and privacy sensitive information still passes unencrypted over the networks (some are contained in the 15% of unencrypted traffic volume, others are exchanged during encryption setup and reveal sensitive information). In more details, today, applications still emit in clear-text the name of the Internet service when users are accessing it (for example https://www.youtube.com/). This happens twice: with DNS, when a service name needs to be translated into a network address and when encryption is being setup with the “Server Name Indication” of the TLS handshake.
The mobilization of Internet players to correct the non-encryption of the service name is strong: experiments are being carried out and interoperable standards exist [4] or are being designed [5].
Less avowable motivations for encrypting the service name
Encryption secures e-commerce, improves privacy, and makes difficult mass eavesdropping from networks.
Platforms promoting encryption also gain image benefit and increased control over the data exchanged.
With the encryption of the name-to-Internet address translation service [4] and thanks to their market power they even have the possibility of capturing additional personal data by pushing the use of their own name-to-Internet address translation service instead of the ones provided by network operators.
Consequences for network operators
The consequences of traffic encryption on network operation are numerous [6]
Thus, certain practices of network operators have been largely transformed.
Popular content caching is an example. Caching helps to optimize the quality of experience for users and the costs of networks and platforms. Before the generalization of encryption, caching was transparent, that is to say at the network level, without the involvement of platforms. Currently caching is implemented under the control of the platform which holds the encryption keys. This example illustrates that despite a communication focused on the protection of personal data, encryption serves other platform objectives: here to avoid losing control of the content.
When service names encryption will be in effect, other practices will be vastly transformed. This is particularly the case of operations relying on traffic classification by application category (e.g. video streaming, chat, webmail) or by service provider (e.g. YouTube, Gmail, WhatsApp).
Classifying traffic, what for?
Classification allows to know the market shares of applications or categories of applications. Usage analysis by customer segment and by offer in turn allows to bring more relevant offers to the market
Traffic classification also allows to measure the quality of service by application or category of applications. This allows to predict traffic evolutions in order to adapt network capacities and offer the best possible quality to users.
For example, traffic classification helped anticipate Netflix’s move to 4K. It also helps preparing the impact of a football match on data rates, depending on the broadcaster.
In fraud detection, traffic classification helps identifying misuse, for example when unbilled traffic is used for applications other than those intended
How to classify encrypted traffic?
With traffic leaving service names in the clear, traffic classification could be done directly by reading unencrypted protocol information, such as service names, in the IP packets constituting the application traffic. This is known as Deep Packet Inspection or DPI.
Figure 2: Encrypted traffic classification
With the upcoming encryption of service names, deep inspection will no longer directly reveal the application. The question that arises is therefore: do the intrinsic characteristics of the encrypted flows constituting the network traffic allow to identify the applications or the category of applications?
The intrinsic characteristics of flows are the encrypted traffic properties that remain observable. These are primarily three time series: the packet sizes, inter-packet delays, traffic direction (user to server or server to user) and possibly some unencrypted information elements in the exchanged packets but also statistics summarizing the flow as its duration, throughputs and data volume per direction. Finally, the information exchanged during the establishment of the encryption session is rich and can also help identifying the application.
Orange, in collaboration with the University of Waterloo, in Canada, has developed an AI (Artificial Intelligence) model, composed of convolutional neural networks and long and short-term memory recurrent networks, permitting to classify the categories of applications (web browsing, video streaming, chat, etc.) with a success rate of 96% and applications (Facebook, Gmail, Netflix, etc.) with a success rate of 97%. The tests were carried out to classify into 8 categories of applications (see Figure 4) and 19 applications.
Figure 3: AI-based traffic classification
The Orange-Waterloo solution offers superior performance over other state-of-the-art solutions, such as C4.5 [7], a decision tree-based solution widely used for classification, which only ensures a success rate of 81% or than the CNN UCDavis [8] solution, a neural network solution, which has a detection rate of 91%. CNN UCDavis presents a detection quite close to Orange-Waterloo, however, it exhibits many false positives (misclassification) as Orange-Waterloo reduces this false-positive rate by 50% compared to CNN UCDavis. With the Orange-Waterloo solution, the precision (proportion of correct predictions among all predictions made) and recall (proportion of positive predictions for samples that are really positive) values are between 90% and 100% depending on the categories. The f1-score rate, a combination of precision and recall values, has an overall average of around 94%, demonstrating the good quality of the model (see Figure 5).
Figure 4:Matrice de confusion pour la classification des catégories d’applications
Figure 5:Valeurs de précision, de rappel et f1 score pour la classification des catégories d’applications
These research results were accepted for publication at the global flagship conference in the field: Sigmetrics 2021 and awarded with the best-student paper price. The conference paper can be viewed here [9].
Perspectives
Internet traffic encryption is meant to enhance the security of exchanges and to protect user privacy against attacks carried out from networks. It does not protect sensitive data from attacks once stored or from misuse of platforms. The focus of mistrust on networks could even allow certain platforms to collect additional personal data.
Internet traffic encryption disrupts network operations based on traffic classification: designing relevant network offerings; customer experience improvements and fraud detection. Ultimately there would be the risk of a transfer of control to Internet platforms that have the necessary data. This raises questions about the evolution of the telecoms market and about sovereignty.
However, the intrinsic characteristics of encrypted flows may still allow, thanks to artificial intelligence, to identify applications and categories of applications.
In the designed method, an artificial intelligence algorithm is trained from “ground truth”, that is to say, it is trained to recognize which encrypted flow corresponds to which application, knowing the application concerned.
With improved encryption, “ground truth” will no longer be available. It is therefore necessary to do without ground truth in a future version of the algorithm. Another challenge will then consist in operationalizing these artificial intelligence techniques so as to make them applicable for the network operations concerned with in particular the probable need for network cards capable of executing such algorithms at the high speeds of network arteries at a cost compatible with the stakes of classification.
References
[1] Pervasive Monitoring Is an Attack
[2] Fortinet Blog
[3] Google transparency report
[5] TLS Encrypted Client Hello
[6] Effects of Pervasive Encryption on Operators
[7] C4.5 Algorithm
[8] CNN UCDavis
[9] A Look Behind the Curtain: Traffic Classification in an Increasingly Encrypted Web
A detailed version of this work is available in the journal “Proceedings of the ACM on Measurement and Analysis of Computing Systems”, volume 5, pages 1-26, published in 2021.