It has been a while since I published a new program and I have decided to publish a very simple but effective tool, I have called it
Sitemap Extractor
. As the name implies, it allows us to
extract the URLs of an XML sitemap and export them to a text file
.
The program has been tested on many websites and if the sitemap follows the XML coding standards for sitemaps, the program will be able to extract the URLs. There are two types of sitemaps that it is not able to process:
-
Sitemaps compressed with gzip (we recognize them because they have the extension .gz). In the future, I do not rule out adding support for this type of compressed sitemaps.
-
Sitemaps that do not follow the XML standards for sitemaps.
At the time of writing this article, Sitemap Extractor is in a first beta version, so you may have some errors. Personally, I have tried it in several WordPress blogs that use different plugins to create the sitemap (Google XML Sitemaps, WordPress SEO by Yoast, etc.), I have also tested it in Tumblr blogs and it has extracted the URLs correctly in all cases.
Many of you will be wondering why you need to extract the URLs of a sitemap, but I cannot give you the answer, since there are many situations in which we may need this functionality. To give you some examples, I have used it as a starting point to extract content in bulk, to perform SEO audits and even to speed up the process of searching for expired domains. The program itself does not allow you to make the examples shown directly, but obtaining the list of URLs of a web page is the starting point for making them.
How to extract the URLs of a sitemap with Sitemap Extractor
The operation of the program is very simple, you just have to enter the URL of the sitemap and the program will automatically process its contents.
Based on the numbering of the previous image we will follow the following steps:
-
Enter the URL of the sitemap we want to extract.
To locate the sitemap we can test the most common location that is
http://www.web.com/sitemap.xml
or we can also check the
robots.txt
file to see if it shows where the sitemap is stored. Once the sitemap URL has been entered, click on the
"Extract"
button.
-
Secondly, we mark the sitemaps to which we want to extract their URLs.
Many sites have an index of subsitemaps, but Sitemap Extractor recognizes them and we just have to select the ones we want to extract. On many occasions we may be interested not to extract the URLs of the categories or labels, or only to extract the URLs of a specific month, etc.
-
Thirdly, we only have to process the selected sitemaps
, for this we click with the right mouse button in the area where the URLs of the sitemaps are listed and press the
"Process"
button. We will see how we are shown a new window asking us for the name and location of the text file where the URLs will be saved.
-
Finally, we can only wait for the process to finish and in the registration section we can check if everything went well or if we have any error message.
And with these simple steps (long to explain with words but that are done in a few seconds with a few mouse clicks), we get a list of URLs that we can use for other tasks.
I hope you find it useful and if you find a problem or have any suggestions, do not hesitate to use the comments to inform me.
Sitemap Extractor:
-
Requirements: Windows operating system, .Net Framework 4 and a functional brain.
-
Download Sitemap Extractor v0.1beta:
Download