Website Cloning with httrack

Website cloning becomes necessary when duplicating a website is required for usage or backup purposes. When we have access to the source code, this procedure is fairly facile. However, this task can become cumbersome without access to the source code or the server. Fortunately, it's not as difficult as it seems. In this blog post, we will copy this blog site and run it on our local computer.

Installation and Preparation

To start, we need to install the necessary applications. First, install httrack for website duplication and Python for its easy server command  (Note: Python comes preinstalled in many Linux distros, so Python instalation is not included in the command below).

sudo apt install httrack


To run the web server, navigate to your destination folder and execute the following command:

python -m http.server 8000

Now that we have installed the necessary software and ensured everything is ready. We can proceed with cloning the website and compare the results using different arguments.

Execution and Comparison

We can start by running the httrack command with a specific URL. Please note that this process may take some time.

httrack https://extendedtutorials.blogspot.com/

We have successfully cloned our website. However, upon disconnecting from the internet, some content may not show up. Fortunately, using the "-n" argument retrieves non-HTML files linked to the downloaded HTML files. Try again with the following command:

httrack -n  https://extendedtutorials.blogspot.com/

This time, we were able to fetch the images locally. This argument is particularly useful for basic content linked to external sites. However, if an external HTML file is linked to the downloaded HTML file, this command may fail.

To address this, we can also download the websites linked by our website using the "--ext-depth" argument. Please note that increasing this argument beyond a depth of 1 might cause the program to download irrelevant sites, and recursive downloads of external websites can consume excessive storage. We can limit the recursion rate for all sites using the "--depth" argument.

httrack --ext-depth=1 --depth=4 https://extendedtutorials.blogspot.com/

As observed, we were able to render external HTML files successfully. However, we downloaded 26 websites, some of which are irrelevant to our site. Consequently, we couldn't use the "-n" argument. To address this, we should identify the sites our website accesses from the web console during our initial attempt.

We can add these sites to the arguments using the "+" sign and allow httrack to download all content we access from our website using the "*" sign.

httrack -n  https://extendedtutorials.blogspot.com/ +https://blogger.googleusercontent.com/* +http://fonts.gstatic.com/* +https://blogger.googleusercontent.com/* +https://resources.blogblog.com/* +https://www.blogger.com/*

This time, we were able to render both external HTML files and other files successfully. However, the main menu failed to load visual content because httrack doesn't parse complex JavaScript files.

Conclusion

In summary, website cloning is relatively easier than it appears. While duplicating websites we own or have permission to do so is acceptable, publishing these websites as our own may lead to legal and ethical issues. Besides, double-checking the authenticity of the website is essential to avoid visiting phishing websites, as duplicating a website is relatively straightforward.

Comments

Popular posts from this blog

Web Scraping with Scrapy

Reverse Search Engines