Find-Sitemap is a python library, it has a database of 10k+ structures data and you only need add the domain URL then it helps you easily locate sitemaps on any website, even if they are hidden deep within the website’s directory structure. Furthermore, it can detect multiple sitemaps, allowing you to view and analyze all the pages included in the site’s sitemap.
If you don’t want manually check common XML sitemap locations, such as /sitemap-index.xml
, /sitemap.txt
and /sitemap.php
. The Python library Find-Sitemap is the best option for you.
Table
How to use the Find-Sitemap package?
This tutorial is a step-by-step explanation of how to find a sitemap and everything will run in a Google Cloud environment, so you don’t need to install any plugins on your Mac/Windows computer!
First, open this Google Colab Document, click “File” and “Save a Copy in Drive”.
Second, replace ‘google.com’ with the website you want, then start running the code.
If everything goes well, you will see the results as below! We have found two sitemaps from Google domain.
1 2 3 4 5 6 7 8 9 10 11 12 |
>>> from Find_Sitemap import FindSitemap >>> main = FindSitemap('google.com') >>> main.crawl() ... ... check 13801/13804: https://google.com/sitemap.xml check 13802/13804: https://google.com/feed.xml check 13803/13804: https://google.com/sitemap_index.xml check 13804/13804: https://google.com/sitemapindex.xml -------------------- Find sitemap urls len: 2 Find sitemap urls list: ['https://www.google.com/sitemap.xml', 'https://www.google.com/about/sitemap.xml'] |
Advanced Features
1. Show all the subdomains, slugs_L1, slugs_L2, filetypes parameters from the database.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from Find_Sitemap import FindSitemap main = FindSitemap('google.com') main.subdomains >>> {'www.'} main.slugs_L1 >>> {'/default', '/sitemap', '/feeds', '/api', '/contents' ...} main.slugs_L2 >>> {'/sitemap', '/stock', '/sitemap1', '/sitemap0', ...} main.filetypes >>> {'txt', 'xml', 'xml.gz', 'jsp', 'html', ...} |
2. Add the subdomains, slugs_L1, slugs_L2, filetypes parameters.
1 2 3 4 5 6 |
from Find_Sitemap import FindSitemap main = FindSitemap('google.com') main.subdomains.add("shop.") main.slugs_L1.add("/node") main.slugs_L2.add("/site") main.filetypes.add("xml") |
3. Remove the subdomains, slugs_L1, slugs_L2, filetypes parameters.
1 2 3 4 5 6 |
from Find_Sitemap import FindSitemap main = FindSitemap('google.com') main.subdomains.remove("shop.") main.slugs_L1.remove("/node") main.slugs_L2.remove("/site") main.filetypes.remove("xml") |
Thanks for your reading. I hope this article will help you better understand how to use Find-Sitemap library to find any website sitemap.
References: