Sometimes we’d love to save every single webpage as we browse. We could use Ctrl+S to do so, one by one. But, that would take too much time. Using these chrome extension and PHP script, we can enjoy our time browsing, leaving the saving jobs to the computer.
This article was adapted from an article written by Craige Thomas, Automatically Save HTML Of Every Page You Visit. It was written on 2011. During the last 3 years Google Chrome API has been updated. The changes have render the code written by Craige obsolete. It won’t work with the recent version of Google Chrome. As for the code used in this article, it has been modified. It will work with the most recent version of Google Chrome (version 36).
The code consists of 2 parts, the Google Chrome extensions and a simple PHP script. The workflow for both code basically are as follow:
1. After the web page has been loaded, the Google Chrome extension sends an AJAX request containing the HTML of the web page to the PHP script hosted on local server.
2. The PHP script receives the request and save the HTML as a file on local server.
Installation Instructions:
1. Download the source code for html-capturer-chrome-extension on github. Extract the zip file. Notice the 2 folders extracted from the zip: “src” and “localserver”. “src” is a folder containing the source code of Google Chrome extension, while “localserver” contains PHP script “recv.php” which should be hosted on local server. Any files outside those folders can be ignored.
2. Install XAMPP if you don’t have it yet and start the Apache server. Tutorials can be found on the internet, this is one example (in Bahasa Indonesia).
3. Install Google Chrome if you don’t have it yet.
4. Open htdocs folder in XAMPP (usually it’s in “C:\xampp\htdocs\” (Windows) or “/opt/lampp/htdocs” (Ubuntu)). Create a folder named “capturer” in the htdocs folder.
5. Inside the folder “capturer”:
5.a. paste the file “recv.php”, and
5.b. create a folder named “Captured”.
Warning: Mind the letters case. Make sure it’s written exactly as in the instructions since both PHP and Javascript are “case-sensitive”.
6. Now, you should have the following files and folders in your htdocs:
– …\htdocs\capturer\recv.php
– …\htdocs\capturer\Captured
“recv.php” is the script that getting the AJAX request from Google Chrome extension, while “Captured” is the folder to save the HTML files. (the names and locations of “recv.php” and “Captured” can be changed later by editing “eventPage.js” and “recv.php”. But, for now, let’s stick with the instructions. :p )
7. Open “chrome://extensions/” in Google Chrome. Ensure that “Developer Mode” has been checked. Click “Load Unpacked Extensions”, and reach for the folder where the Google Chrome extension is (the “src” folder).
8. “HTML Capturer” will be showed up in “chrome://extensions/”. Ensure that “Activated” has been checked.
9. Testing time, try opening a web page. If things work, as soon as the web page finished loading, a HTML file can be seen inside the folder “Captured”.
Usage Instructions:
1. To turning off the feature, uncheck “Activated” beside “HTML Capturer” on “chrome://extensions/”.
2. To turning on the feature, ensure that the apache server has been running and tick “Activated” beside “HTML Capturer” on “chrome://extensions/”.
You can click “Pack Extension” on “chrome://extensions/” in order to make the extension work in “Normal Mode” (not requiring “Developer Mode”).
Data collected in HTMLs inside the folder “Captured” can be processed to produce meaningful information. One of the easiest tool to do so is Beautiful Soup, a Python library to pull out data from HTML. I will cover simple data processing using Beautiful Soup someday, In Sya Allah. I hope that this article can be helpful.
References: