Forums before death by AOL, social media and spammers... "We can't have nice things"
|    alt.privacy    |    Discussing privacy, laws, tinfoil hats    |    112,125 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 110,174 of 112,125    |
|    Fritz Wuehler to All    |
|    How to protect data from web scraping? G    |
|    30 Jun 24 15:54:28    |
      XPost: alt.privacy.anon-server       From: fritz@spamexpire-202406.rodent.frell.theremailer.net              https://www.datenschutz-notizen.de/how-to-protect-data-from-web-       craping-guidelines-from-the-italian-dpa-4448630/              The Italian Data Protection Authority (Garante per la protezione dei dati       personali, or short Garante) has released in May 2024 guidelines aimed to       protect personal data published online by public and private entities (in a       role of data controller) from        web scraping performed by third parties. While the purposes to perform data       scraping or web scraping are multiple, the Garante focused the guidelines, on       the data scraping practices intended to train Generative Artificial       Intelligence (or “GAI”)        models.              The main goal of the guidelines is to advise data controllers who make       personal data available on their websites, on how to implement appropriate       security measures to prevent or minimise web scraping activities performed by       third-parties. The guidelines        are a result of the fact-finding investigation completed by the authority at       the end of last year.              In relation to the legal basis of personal data processing for data scraping       purposes, ongoing investigations are still being conducted to determine the       lawfulness of the practice under the legitimate interest of the data       controller (the data scraping        companies). The Garante has already considered unlawful the web scraping       activity carried out by the US company Clearview.              But what is data scraping?       Web scraping (data scraping performed on the internet), in a simple       definition, is meant as the collection of publicly available personal data on       the internet by third parties for different purposes. The purposes to perform       data scraping may be very        different, but mostly relate to a future use of the data harvested online for       marketing activities or to approach prospect business partners. The term       ’scraping‘ normally refers to the set of automated mechanisms for       extracting information (such as        bots) from databases that by design, are not intended for this function.       Scraping tools (specifically, web crawlers) are usually more or less       “intelligent” scripts that navigate the internet by automatically and very       quickly “scanning” web pages        and accessing the links included on such websites: during the “scanning”,       they extract targeted data and save them locally in a structured and more       usable manner.              A typical example of web scraping are price comparison websites, that use web       scraping to read price information from different online shops selling a       certain item, and provide a list and overview of the prices and the shops       selling the specific product.              Data scraping and web scraping itself, may be of course a legitimate business       practice (provided that the data are publicly available and used lawfully) and       its results might be very insightful for consumers and also for businesses (in       fact, pretty often        website owners make their data publicly available for data scraping and other       forms of automatic data collection), however great attention should be paid       when personal data are involved in the researches.              On this point, the Garante says:              “To the extent that web scraping involves the collection of information that       can be referred to an identified or identifiable natural person, a data       protection issue arises”.              As anticipated, the guidelines of the Garante are addressed to the controllers       of the personal data made available on the websites, therefore they do not       focus on the data scraping companies but on the measures that the targeted       websites’ owners may        apply. They do focus, on the other hand, on one of the specific purposes for       data scraping or web scraping, that is: training GAI models.              The big datasets used by developers of generative artificial intelligence       models have different sources, but web scraping constitutes a common       denominator. Those developers can, in fact, use datasets they have scraped       themselves, or pull data from third-       party data lakes. Those data lakes are themselves harvested by data scraping       operations.              What does the Garante suggest?       To create restricted areas:       Creating areas of the website that are accessible only upon registration would       limit the public availability of data. This practice subtracts data from       indiscriminate availability, thereby reducing opportunities for web scraping.       Restricted areas should        anyway be designed in compliance with the data minimisation principle,       therefore preventing the processing of additional information from users       during registration.              Incorporation of ad hoc clauses in the terms of service:       Prohibition of the use of web scraping techniques in the terms of service of       the website constitutes a contractual clause that, in case of a breach, allows       the websites owners to take legal action for a breach of the contractual       obligations against the        web scraping companies. Despite the fact, that this action is taken “ex       post” and thus does not necessarily prevent the scraping, it can still be       considered a good deterrent for an effort to protect personal data in case of       unauthorized practices.              Monitoring network traffic:       Monitoring HTTP requests received by a website may seem a simple measure,       however it allows to detect any unusual flow of data within a website and       react accordingly. Rate limiting, that means limiting the access to the       website to the requests coming        from specific IP addresses, can also be an extra measure to minimise the       traffic in the first place, therefore limit the web scraping activities.              Managing the access to bots:       As mentioned earlier, most of the data scraping activities on the websites are       conducted by the use of automated tools such as bots. Technologies that are       aiming to restrict bots’ access to the website are automatically reducing       the automated data        collection activities, including web scraping. The Garante mentions some       examples of technologies having such a goal:              - CAPTCHA (Completely Automated Public Turing-test-to-tel-lComputer-       and-Humans Apart) verification tools that make sure a human is sitting behind       the request.              - Periodical change of the HTML markup that would make the identification of a       web page more difficult for a bot.              - Embedding content in media objects: Embedding data in images or other media       makes automated extraction more complex, requiring specific technologies such       as optical character recognition (OCR).              - Monitoring of log files in order to block undesired user-agents, where       identifiable.                     [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca