Recently, the topic of data scraping has been in the news. But what is it? How do people do it? Why would anyone want to do it? Are there any dangers associated with it? And what can be done to deal with it?
What is a data scraper?
A data scraper is one who extracts data generated by another program—the most common use is web scraping, where the scraper captures various types of data from a website.
A web scraper imports the data and transfers it into a spreadsheet for various reasons, some of which include conducting research for web content/business intelligence; doing pricing for travel booker sites/price comparison sites; finding sales leads/conducting market research by crawling public data sources; and sending product data from an e-commerce site to another online vendor.
In this sense, when the scraping of public data is done to gain insights and not to make a profit or cause harm to individuals, there can be beneficial uses.
The dark side of web scraping
But there is a dark side to data scraping, involving things such as email harvesting, where email addresses are collected and sold to spammers or scammers. It is important to note that email harvesting is considered to be a bad marketing practice and also contrary to the privacy laws of some jurisdictions. For example, Canada’s federal privacy law, the Personal Information Protection and Electronic Documents Act (PIPEDA), clearly prohibits email harvesting.
Another important example to keep in mind is the joint investigation regarding Clearview AI, which I wrote about here,
where it was concluded by the Office of the Privacy Commissioner of Canada, the Commission d’accès à l’information du Québec, the Office of the Information and Privacy Commissioner of British Columbia and the Office of the Information and Privacy Commissioner of Alberta, collectively referred to as the Offices, that Clearview AI violated the privacy rights of Canadians.
The Offices concluded that biometric facial information was sensitive in almost all circumstances—it was intrinsically, and in most instances permanently, linked to the individual. It was distinctive, unlikely to change over time, difficult to modify, and largely unique to the individual. Simply put, facial biometric information was particularly sensitive.
And when Clearview AI scraped the facial information from websites, it was necessary to first obtain express opt-in consent before it collected the images of any individual in Canada. Further, the stated purposes of helping law enforcement were neither appropriate nor legitimate—this represented the mass identification and surveillance of individuals by a private entity in the course of commercial activity.
A recent example that may hit close to home for many
One very recent example of data scraping is the LinkedIn web scraping that has taken place in the spring and summer of 2021—it was reported that a hacker first posted 500 million LinkedIn records for sale on a hacker forum. Subsequently, the number of records that were scraped and placed for sale on the Dark Web rose to 700 million.
The saga continued shortly after this, where more data was added to the collection. The data was scraped from public LinkedIn profiles and other websites—totaling one billion LinkedIn records—containing further pieces of personal data. The hacker provided screenshots to prove that several types of data were exposed, neatly organized into categories in a spreadsheet. Ultimately, the personal data that was scraped included several types of data, some of which included: full names; email addresses and passwords; locations; phone and fax numbers; websites; LinkedIn profiles; company names and job titles; as well as LinkedIn connections.
Needless to say, this incident was concerning, given the potential for spamming, scamming, and identity theft of individuals and business owners who used LinkedIn.
At this point, it is important to note that data scraping is not permitted by LinkedIn under the user agreement involving members or the terms of service agreement involving recruiters. In fact, statements by LinkedIn made in April 2021 and June 2021 have emphasized that this recent scraping activity violated LinkedIn terms of service:
When anyone tries to take member data and use it for purposes LinkedIn and our members haven’t agreed to, we work to stop them and hold them accountable.
It was also confirmed that this did not constitute a data breach since no private LinkedIn member data was exposed—the data was scraped from LinkedIn and other websites. In fact, a rash of data scraping has been reported recently, hitting other social media platforms including Facebook and Clubhouse.
How does data scraping differ from a data breach?
These social media companies have strongly pointed out that there has been no data breach since only public information was scraped—no private member information was hacked.
Conversely, when there has been a data breach, private member information held by an organization is hacked and certain obligations are consequently triggered. For example. under PIPEDA, a breach of security safeguards refers to the loss of, unauthorized access to, or unauthorized disclosure of personal information resulting from a breach of an organization’s security safeguards or from a failure to establish those safeguards. PIPEDA has reporting and notification requirements (to the Privacy Commissioner and affected individuals respectively), record-keeping requirements, and very serious consequences for noncompliance. Further details can be found in the Breach of Security Safeguards Regulations.
What can organizations take from this?
Data scraping can have some beneficial uses—but when these uses become questionable, organizations are recommended to review the consent provisions in PIPEDA, the Guidelines for Obtaining Meaningful Consent, and their own policies and procedures and ensure that they are in compliance with privacy laws. In addition, it is important for organizations to appreciate the sensitive nature of biometric information, and the particularly sensitive nature of facial biometric information when examining consent and purposes of collection, use, and disclosure of personal information.
And if individuals and business owners find that they may have been affected by a data scraping incident, the following is recommended:
- create new and different passwords for online accounts
- use a password manager or create complicated, unique, and lengthy passwords
- use antivirus software
- use two-factor authentication
- stay away from suspicious messages
- visit the actual social media account to determine if something is wrong with an account
- use a VPN
In terms of data breaches, organizations are recommended to enhance their cybersecurity and create an incident response plan. Of course, if there has been a data breach, it is necessary to immediately comply with PIPEDA and the Breach of Security Safeguards Regulations.
- Recent proposal for an American federal privacy law - April 19, 2024
- Bill 149 receives royal assent March 21, 2024 - April 1, 2024
- Reasonable expectation of privacy in Internet Protocol (IP) addresses - March 26, 2024