... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
alt.privacy
Discussing privacy, laws, tinfoil hats
112,125 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 110,174 of 112,125
Fritz Wuehler to All
How to protect data from web scraping? G
30 Jun 24 15:54:28
   XPost: alt.privacy.anon-server   
   From: fritz@spamexpire-202406.rodent.frell.theremailer.net   
      
   https://www.datenschutz-notizen.de/how-to-protect-data-from-web-   
   craping-guidelines-from-the-italian-dpa-4448630/   
      
   The Italian Data Protection Authority (Garante per la protezione dei dati   
   personali, or short Garante) has released in May 2024 guidelines aimed to   
   protect personal data published online by public and private entities (in a   
   role of data controller) from    
   web scraping performed by third parties. While the purposes to perform data   
   scraping or web scraping are multiple, the Garante focused the guidelines, on   
   the data scraping practices intended to train Generative Artificial   
   Intelligence (or “GAI”)    
   models.   
      
   The main goal of the guidelines is to advise data controllers who make   
   personal data available on their websites, on how to implement appropriate   
   security measures to prevent or minimise web scraping activities performed by   
   third-parties. The guidelines    
   are a result of the fact-finding investigation completed by the authority at   
   the end of last year.   
      
   In relation to the legal basis of personal data processing for data scraping   
   purposes, ongoing investigations are still being conducted to determine the   
   lawfulness of the practice under the legitimate interest of the data   
   controller (the data scraping    
   companies). The Garante has already considered unlawful the web scraping   
   activity carried out by the US company Clearview.   
      
   But what is data scraping?   
   Web scraping (data scraping performed on the internet), in a simple   
   definition, is meant as the collection of publicly available personal data on   
   the internet by third parties for different purposes. The purposes to perform   
   data scraping may be very    
   different, but mostly relate to a future use of the data harvested online for   
   marketing activities or to approach prospect business partners. The term   
   ’scraping‘ normally refers to the set of automated mechanisms for   
   extracting information (such as    
   bots) from databases that by design, are not intended for this function.   
   Scraping tools (specifically, web crawlers) are usually more or less   
   “intelligent” scripts that navigate the internet by automatically and very   
   quickly “scanning” web pages    
   and accessing the links included on such websites: during the “scanning”,   
   they extract targeted data and save them locally in a structured and more   
   usable manner.   
      
   A typical example of web scraping are price comparison websites, that use web   
   scraping to read price information from different online shops selling a   
   certain item, and provide a list and overview of the prices and the shops   
   selling the specific product.   
      
   Data scraping and web scraping itself, may be of course a legitimate business   
   practice (provided that the data are publicly available and used lawfully) and   
   its results might be very insightful for consumers and also for businesses (in   
   fact, pretty often    
   website owners make their data publicly available for data scraping and other   
   forms of automatic data collection), however great attention should be paid   
   when personal data are involved in the researches.   
      
   On this point, the Garante says:   
      
   “To the extent that web scraping involves the collection of information that   
   can be referred to an identified or identifiable natural person, a data   
   protection issue arises”.   
      
   As anticipated, the guidelines of the Garante are addressed to the controllers   
   of the personal data made available on the websites, therefore they do not   
   focus on the data scraping companies but on the measures that the targeted   
   websites’ owners may    
   apply. They do focus, on the other hand, on one of the specific purposes for   
   data scraping or web scraping, that is: training GAI models.   
      
   The big datasets used by developers of generative artificial intelligence   
   models have different sources, but web scraping constitutes a common   
   denominator. Those developers can, in fact, use datasets they have scraped   
   themselves, or pull data from third-   
   party data lakes. Those data lakes are themselves harvested by data scraping   
   operations.   
      
   What does the Garante suggest?   
   To create restricted areas:   
   Creating areas of the website that are accessible only upon registration would   
   limit the public availability of data. This practice subtracts data from   
   indiscriminate availability, thereby reducing opportunities for web scraping.   
   Restricted areas should    
   anyway be designed in compliance with the data minimisation principle,   
   therefore preventing the processing of additional information from users   
   during registration.   
      
   Incorporation of ad hoc clauses in the terms of service:   
   Prohibition of the use of web scraping techniques in the terms of service of   
   the website constitutes a contractual clause that, in case of a breach, allows   
   the websites owners to take legal action for a breach of the contractual   
   obligations against the    
   web scraping companies. Despite the fact, that this action is taken “ex   
   post” and thus does not necessarily prevent the scraping, it can still be   
   considered a good deterrent for an effort to protect personal data in case of   
   unauthorized practices.   
      
   Monitoring network traffic:   
   Monitoring HTTP requests received by a website may seem a simple measure,   
   however it allows to detect any unusual flow of data within a website and   
   react accordingly. Rate limiting, that means limiting the access to the   
   website to the requests coming    
   from specific IP addresses, can also be an extra measure to minimise the   
   traffic in the first place, therefore limit the web scraping activities.   
      
   Managing the access to bots:   
   As mentioned earlier, most of the data scraping activities on the websites are   
   conducted by the use of automated tools such as bots. Technologies that are   
   aiming to restrict bots’ access to the website are automatically reducing   
   the automated data    
   collection activities, including web scraping. The Garante mentions some   
   examples of technologies having such a goal:   
      
   - CAPTCHA (Completely Automated Public Turing-test-to-tel-lComputer-   
   and-Humans Apart) verification tools that make sure a human is sitting behind   
   the request.   
      
   - Periodical change of the HTML markup that would make the identification of a   
   web page more difficult for a bot.   
      
   - Embedding content in media objects: Embedding data in images or other media   
   makes automated extraction more complex, requiring specific technologies such   
   as optical character recognition (OCR).   
      
   - Monitoring of log files in order to block undesired user-agents, where   
   identifiable.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]