What You Need to Know about Data Harvesting and How to Prevent it
August 27, 2014
The term data harvesting, or web scraping, has always been a concern for website operators and data publishers. Data harvesting is a process where a small script, also known as a malicious bot, is used to automatically extract large amount of data from websites and use it for other purposes. As a cheap and easy way to collect online data, the technique is often used without permission to steal website information such as text, photos, email addresses, and contact lists.
One method of data harvesting targets databases in particular. The script finds a way to cycle through the records of a database and then download each and every record in the database.
Aside from obvious consequence of data loss, data harvesting can also be detrimental to businesses in other ways:
- Poor SEO Ranking: If your website content is scraped, reproduced and used on other sites, this will significantly affect the SEO ranking and performance for your website on search engines.
- Decreased Website Speed: When used repeatedly, scraping attacks can lower the performance of your websites and affect the user experience.
- Lost Market Advantages: Your competitors may use data harvesting to scrape valuable information such as customer lists to gather intelligence about your business.
Various methods are available to help website builders protect different types of online data from being scrapped. For database protection, Caspio provides several tools to help prevent your data from being targeted by malicious bots.
- CAPTCHA — One of the most effective methods to fight data harvesting is CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Human Apart). It protects an ad hoc search against bots by displaying a code and tests that only humans can verify to ensure the user is not a bot. With Caspio, you can easily implement CAPTCHA on your search forms and prevent bots from collecting your data.
- Access Control — Caspio Reports provide a built-in feature to create search criteria for authorizing access to database records. To be more specific, only records that match the search criteria can be accessed. Therefore, data harvesting can be prevented because a bot cannot gain access to records that do not match the search criteria through the report. The same is true with Record Level Security, where you can set a limit on what a user can access down to each record in a database, which prevents either a human user or a bot from gaining access to any unauthorized records.
- Complex IDs — Many databases use an auto-number or other sequential ID forms as database keys. If you have an Update Form or a pre-defined criteria Report based on a sequential ID, it’s very easy for a bot to cycle through all your records using the sequential IDs. Using a much more complex ID such as a GUID is one way to address this. In Caspio, you can easily generate random unique IDs in a hidden text field in a Submission Form as new records are created. Another method is to require two fields to bring up a record, such as an ID field combined with some other value. This way, the bot will have to correctly guess both fields in order to retrieve the record, making it more difficult to program.
Last but not least, password-protecting your database isn’t designed specifically to prevent data harvesting, but it does limit access to only authorized and trusted users and is another measure to protect your data from unauthorized access.
If you’re interested in knowing more about how to implement security safeguards such as CAPTCHA using the Caspio platform, simply sign up for a free trial and try it for yourself. You can also request a project consultation to go over specific features.