sitemap.xml and robots.txt
"sitemap.xml" and "robots.txt" files are placed in the html root directory of a website by a website owner.
The content of sitemap.xml and robots.txt files can be read by all search engines. The sitemap.xml file content tells them which pages you want them to index and show in their search engine results. However, the robots.txt file content tells them which part of your website you do not want them to look at or use. There are no documented statements about search engines activity on websites, except that they have a direct path to them. It is said that the automated visits to websites allows them to place the names of such sites in their search results.
There is a web page maker function that you can use to generate a sitemap.xml file.
You put or leave in the URL of web pages that you definitely want search engines to see and index in your sitemap.xml file.
You leave out the web pages that you do not want search engines to use in search results, out of your sitemap.xml file.
That also means that you do not have to place <h1> headings and other unnecessary things in pages that you do not want search engines to use.
It seems that search engines "bots" will go ahead and read all files that are on a website server and then, if a sitemap.xml file is also available, check in it to see which web pages it should use in search results.
Note: As is well known, web pages can be fetched using more than one http type of string content.
Example Strings for same Content:
No matter what any search engine "bot" may say, leave only one version of a web page's internet address in your sitemap.xml file.
Example of a sitemap.xml content
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
Your web page maker will usually have software function that allows you to generate sitemaps and modify their content.
Example1 of robots.txt content
You use a plain text editor to make this file. If you are just beginning your website, it can contain this and nothing else:
On a new website, that text will stop search engines from collecting and using the content of web pages, before you are ready for them to do so.
* means all search engines.
Disallow:/ means do not use the content of any directory or file on the server.
Note: Certain things can get inexplicable, if you do not stop search engine "bots" from collecting and using your data, before you are ready for them to do so.
Example2 of robots.txt content
This website is just about ready to make itself known to the world.
The content of robots.txt now allows access to all search engines and then in the lines of instructions that follow, it tells where the sitemap is to be found, limits the access to pictures.
is supposed to prevent the special "bot" Googlebot-Image from looking at or using any of the images.
As is often the case in real life, one may later have reason to delete the statement above from the robots.txt file.
Not all search engines have a special program like "Googlebot-Image" to deal with images, therefore the website in question here has simply used the instruction
to prevent all the other search engines from going to the website and getting images as they have a mind.
Note: Many images are sold on the open market. Search engines have been known to compare the pixel content of images and come to erroneous conclusions, because more than one person on the web has used the same image.
You could block individual files in a specific directory, but it may be easier to just place the html files that you want to block from search engine view in a specific directory and then block the whole directory as shown above.
The PHP World: Certain website builders keep all image and text data in a table on the server. When required, they get them and construct web pages a split second, before it is sent to a web browser. No general purpose statements has been made that specify whether or not search engines can read pictures while they are still inside the tables just like they can read those pictures that are in directories. Logic says that "bot" software could easily request the pages in the normal way using the sitemap.xml file and also get the images.
Warning: You may come across a SEO tool that tells you that one or more of the main urls that point to your index.html page is "canonical". Unless you have found out what to do about"canonical", do not bother to place any html instruction codes in your already existing web pages relating to it.
As a normal website owner with limited software knowledge, you will probably not be able to keep up with the changes that are being made to data structures and programs used by servers and search engines.
SEO tools will ask for your sitemap.xml file more often than they will ask for the robots.txt file.
Before you get directed into unsuitable directions and thoughts, remember that you, the owner of a website with your own content, are the one with the ownership rights on the internet. Nothing else has any meaning there.
SEO Tips and Tricks - Using the sitemap.xml and robots.txt files to control access to websites