On the occasion of National Data Protection Day, the question of AI based on large language models arises. They are trained from data entered by users. These are sometimes sensitive and confidential, but the AI does not know how to sort through them. How can we ensure that they will not be disclosed?
This will also interest you
[EN VIDÉO] The 10 Most Dangerous Artificial Intelligence Threats Artificial intelligence has enabled advances in the field of health, research and…
Imagine that elements of your personal life appear in a response from a chatbot like ChatGPTChatGPT at anyone’s house. This is unfortunately possible because the language model is enriched by what its users enter. This is particularly the case for companies where employees enter confidential financial data or proprietary source codes without realizing that the model will ingest them and no doubt bring them out eventually. Moreover, research shows that the rate of data leaks of this type is not anecdotal. For ChatGPT, there would be 158 such incidents per 10,000 users per month.
It is for this reason that regulations are starting to emerge regarding data protection in the case of artificial intelligence. In December 2023, the European Union agreed to develop harmonized rules on artificial intelligence. They provide for an obligation to transparencytransparency and publishing a summary of the training data used for the models. In addition to regulations which in any case remain out of step with the pace of innovation, there are solutions to protect data and paradoxically, it is AI itself which can carry out this task.
In this way, it is possible to indicate to the AI which data must be protected. This manipulation can also be automated once again by AI. Likewise, apart from the use of chatbots, this AI makes it possible to control whether sensitive information is likely to leave a company’s network, for example. The other technical solution consists of learning AI with synthetic data. This is data that is artificially created and replaces real data to protect it. This data is precisely created by AI. These are doubles of the data, more anonymous, which will have the same effect as real data for AI.
Concretely, in medicine for example, instead of feeding the AI with real x-rays showing tumorstumorsan AI will generate equivalent radios to train a system of machine learning. This synthetic data will have the same effect and allow AI to support the work of radiologists just as effectively. The problem is that generating this synthetic data requires creating specific AI and this comes at a huge cost. In fact, data protection at the AI level is currently not a priority.
rewrite this content and keep HTML tags