{"id":16724,"date":"2019-11-18T12:44:28","date_gmt":"2019-11-18T17:44:28","guid":{"rendered":"https:\/\/vieux.ivado.ca\/projets_de_recherche\/lenjeu-danonymisation-des-documents\/"},"modified":"2020-04-29T09:34:27","modified_gmt":"2020-04-29T13:34:27","slug":"the-challenges-of-document-anonymization","status":"publish","type":"projets_de_recherche","link":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/","title":{"rendered":"Anonymize documents"},"content":{"rendered":"<p>Source documents obtained as part of a process of data collection or an access-to-information request may be redacted or anonymized; that is, sensitive data will have been blacked out, or replaced by fictitious data, to prevent individuals from being identified. Scrutinizing page after page of text to find proper names, addresses, or medical, financial and other information to achieve the desired result can be a laborious task\u2014unless it is entrusted to algorithms. This is precisely the goal of a research project that we are conducting in partnership with Irosoft, a company specialized in data valorization.<\/p>\n<p>With certain deep-learning techniques, we can \u201cteach\u201d algorithms to locate sensitive information in a text. Training them to do so involves working with documents in which that kind of information has previously been tagged manually. Of course, no documents can contain all first names, names of cities, dates of birth, etc. in existence, but algorithms can learn to identify them from context. For example, they will recognize a person\u2019s name when it is preceded by \u201cMr.\u201d or \u201cMs.\u201d These types of algorithms already exist, but they are always trained using the same sets of texts, for strictly academic purposes, to improve performance.<\/p>\n<p>\u201cFor 15 years, researchers have been training algorithms using the same data: news articles, mostly sports stories,\u201d Philippe Langlais explains.<\/p>\n<blockquote><p>\u201cThis is a great collaborative effort between industry players and researchers; it\u2019s productive for both sides and something we want to continue.\u201d<\/p><\/blockquote>\n<p>&#8211; Alain Lavoie, President and co-founder, <a href=\"https:\/\/www.irosoft.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Irosoft<\/a><\/p>\n<p>Irosoft, however, meets the anonymization requirements for documents of a medical, legal, financial or other nature that contain specific types of sensitive information: names of drugs or financial institutions, for example. In these contexts, a common name can become a sensitive data element. In other cases, \u201cin legal documents where it is ubiquitous, the word <em>judgment<\/em> may be insignificant, but it could become a clue to identifying a person in another field,\u201d Alain Lavoie points out. Fortunately, there are corpora of documents from a variety of fields for which sensitive information has already been tagged.<\/p>\n<p>Professor Langlais uses this type of corpus to train and test the algorithm. \u201cIn each of these corpora, there are sensitive data, and we used them to test algorithms that had been trained on other corpora,\u201d he notes. \u201cIt turns out that from one field to another, the algorithm learns differently, so adaptations are necessary for it to be applicable to a different field than that it was originally trained for. The solution is to find concordances of tags that allow switching from one field to another.\u201d<\/p>\n<blockquote><p>\u201cWith Irosoft, we studied algorithms in situations that differ from those of the academic milieu. We asked questions that scientists never ask.\u201d<\/p><\/blockquote>\n<p>&#8211; Philippe Langlais, professor and project lead in the Universit\u00e9 de Montr\u00e9al <a href=\"https:\/\/diro.umontreal.ca\/english\/home\/\" target=\"_blank\" rel=\"noopener noreferrer\">Department of Computer Science and Operations Research<\/a><\/p>\n<p>However, he emphasizes, \u201crecognizing sensitive information is only a first step toward anonymization.\u201d Disguising the information to prevent individuals from being identified, while ensuring the text remains intelligible, is another matter entirely.<\/p>\n<div class=\"six columns\"><div class=\"gdlr-item gdlr-column-shortcode\"><p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-13237 aligncenter\" src=\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/alain-lavoie.jpg\" alt=\"\" width=\"183\" height=\"122\" \/>Alain Lavoie<br \/>\nPresident and co-founder<br \/>\n<a href=\"https:\/\/www.irosoft.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Irosoft<\/a><\/p>\n<\/div><\/div>\n<div class=\"six columns\"><div class=\"gdlr-item gdlr-column-shortcode\"><p><a style=\"margin-top: 71px; display: block;\" href=\"https:\/\/www.irosoft.com\/\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9555\" src=\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2018\/08\/Irosoft_500x300.png\" alt=\"\" width=\"122\" height=\"122\" \/><\/a><\/p>\n<\/div><\/div>\n<div class=\"six columns\"><div class=\"gdlr-item gdlr-column-shortcode\"><p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-13237\" src=\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/Philippe-Langlais.jpg\" alt=\"\" width=\"183\" height=\"122\" \/>Philippe Langlais<br \/>\nProfessor<br \/>\n<a href=\"https:\/\/diro.umontreal.ca\/accueil\/\" target=\"_blank\" rel=\"noopener noreferrer\">DIRO, Universit\u00e9 de Montr\u00e9al<\/a><\/p>\n<\/div><\/div>\n<div class=\"six columns\"><div class=\"gdlr-item gdlr-column-shortcode\"><p><a style=\"display: block; clear: both;\" href=\"https:\/\/diro.umontreal.ca\/accueil\/\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-14062 size-medium\" src=\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2018\/08\/DIRO_500x300.png\" alt=\"\" width=\"300\" height=\"91\" \/><\/a><\/p>\n<\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Alain Lavoie<br \/>\nPresident et co-fonder, Irosoft<\/p>\n<p>Philippe Langlais<br \/>\nProfessor and project lead in the Universit\u00e9 de Montr\u00e9al DIRO<\/p>\n","protected":false},"featured_media":16703,"template":"","categories":[217],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Anonymize documents - IVADO<\/title>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Anonymize documents - IVADO\" \/>\n<meta property=\"og:description\" content=\"Alain Lavoie President et co-fonder, Irosoft  Philippe Langlais Professor and project lead in the Universit\u00e9 de Montr\u00e9al DIRO\" \/>\n<meta property=\"og:url\" content=\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/\" \/>\n<meta property=\"og:site_name\" content=\"IVADO\" \/>\n<meta property=\"article:modified_time\" content=\"2020-04-29T13:34:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1140\" \/>\n\t<meta property=\"og:image:height\" content=\"383\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/\",\"url\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/\",\"name\":\"Anonymize documents - IVADO\",\"isPartOf\":{\"@id\":\"https:\/\/vieux.ivado.ca\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg\",\"datePublished\":\"2019-11-18T17:44:28+00:00\",\"dateModified\":\"2020-04-29T13:34:27+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#primaryimage\",\"url\":\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg\",\"contentUrl\":\"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg\",\"width\":1140,\"height\":383},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Accueil\",\"item\":\"https:\/\/vieux.ivado.ca\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Projets de recherche\",\"item\":\"https:\/\/vieux.ivado.ca\/projets_de_recherche\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Anonymize documents\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/vieux.ivado.ca\/#website\",\"url\":\"https:\/\/vieux.ivado.ca\/\",\"name\":\"IVADO\",\"description\":\"Institut de valorisation des donn\u00e9es\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/vieux.ivado.ca\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Anonymize documents - IVADO","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_US","og_type":"article","og_title":"Anonymize documents - IVADO","og_description":"Alain Lavoie President et co-fonder, Irosoft  Philippe Langlais Professor and project lead in the Universit\u00e9 de Montr\u00e9al DIRO","og_url":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/","og_site_name":"IVADO","article_modified_time":"2020-04-29T13:34:27+00:00","og_image":[{"width":1140,"height":383,"url":"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/","url":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/","name":"Anonymize documents - IVADO","isPartOf":{"@id":"https:\/\/vieux.ivado.ca\/#website"},"primaryImageOfPage":{"@id":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#primaryimage"},"image":{"@id":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#primaryimage"},"thumbnailUrl":"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg","datePublished":"2019-11-18T17:44:28+00:00","dateModified":"2020-04-29T13:34:27+00:00","breadcrumb":{"@id":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#primaryimage","url":"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg","contentUrl":"https:\/\/vieux.ivado.ca\/wp-content\/uploads\/2019\/11\/MONTAGE-anonymization.jpg","width":1140,"height":383},{"@type":"BreadcrumbList","@id":"https:\/\/vieux.ivado.ca\/en\/research_projects\/the-challenges-of-document-anonymization\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Accueil","item":"https:\/\/vieux.ivado.ca\/en\/"},{"@type":"ListItem","position":2,"name":"Projets de recherche","item":"https:\/\/vieux.ivado.ca\/projets_de_recherche\/"},{"@type":"ListItem","position":3,"name":"Anonymize documents"}]},{"@type":"WebSite","@id":"https:\/\/vieux.ivado.ca\/#website","url":"https:\/\/vieux.ivado.ca\/","name":"IVADO","description":"Institut de valorisation des donn\u00e9es","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/vieux.ivado.ca\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/vieux.ivado.ca\/en\/wp-json\/wp\/v2\/projets_de_recherche\/16724\/"}],"collection":[{"href":"https:\/\/vieux.ivado.ca\/en\/wp-json\/wp\/v2\/projets_de_recherche\/"}],"about":[{"href":"https:\/\/vieux.ivado.ca\/en\/wp-json\/wp\/v2\/types\/projets_de_recherche\/"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vieux.ivado.ca\/en\/wp-json\/wp\/v2\/media\/16703\/"}],"wp:attachment":[{"href":"https:\/\/vieux.ivado.ca\/en\/wp-json\/wp\/v2\/media\/?parent=16724"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vieux.ivado.ca\/en\/wp-json\/wp\/v2\/categories\/?post=16724"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}