How to create custom robots.txt in blogger : Okongeorge.com
Okongeorge | Digital Wealth & Empowerment

Learn and Make Money Online.


Date: - Time:

How to create custom robots.txt in blogger






Custom robots.txt and custom header tag are two must prominent SEO interface recently added by Google for boasting search engine optimization factor that was mostly seen as a disadvantage when compare Blogger to Wordpress.



Should in case you’re blogging using Google blogger platform, and you’re yet to
 setup custom robot header tag for your blog search engine prime preferences, you should consider doing that asap.


Other than custom robots.txt and custom header tag, there are other undoubtedly deserve an applause prominent SEO features also added by Google for bloggers, which include Description Meta tag, Custom 404 error page, Custom redirect, No follow tag, Do follow tag etc.  

Custom robots.txt may sound rare, but it is a blogging terminology, and an important blogger SEO that can be customized as desired within the search preferences tab in Blogger dashboard, though, custom robot.txt is not only subject to Blogspot or Blogger alone.  

I expect you read this article carefully for a proper understanding of adding custom robots.txt file in your blog before geared toward setting it up because, any wrong setup of custom robot.txt may lead to de-indexing of your blog, and that could mean not getting seen on the major search engines.

What is Custom Robots.txt?


Custom robot.txt is a set of code of instruction that interacts with search engine crawlers or spiders like Googlebot on a set of action of either to crawl and index your old and new webpages or otherwise blog content or not.

While custom robots.txt file is used to inform search engines spiders and bots to crawl and index your new webpages or content, robots.txt file on the other hand is also used to stop search engine bots from crawling and indexing of the “not so important” part of your blog. For example, search pages, archive pages, custom 404 error pages etc.  

Search engines crawlers and spiders’ robots are often on a look out for new blog content to crawl and index, and are often notified by web feed. However, your custom robots.txt plays an important role of instructing the search engines spider and crawlers of what pages or new content to indexed, and what pages or web content should be discarded.

Search engines spiders and crawler like Googlebot before executing crawling and indexing new content task, will first look at your robots.txt file before responds to a set of rule you have instructed, which is either to crawl and index such content or not. 

Note: By default, each blog precisely hosted on Blogger has it custom robots.txt, which only index 25 of your posts or webpages.

User-agent: Mediapartners-Google
Disallow:          
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://www.okongeorge.com/feeds/post/default?orderby=UPDATED

In case of your own default Blogger blog, your website url in the sitemap above are not indicated. Don’t worry; below we will elaborate on the interpretation of the code. 

Read - How to submit blogger sitemap to Google webmaster tool

Where is custom robot.txt file situated?


Custom robot.txt file is usually situated within the root directory of any website or blog. The root directory however works for Wordpress, but in Blogger it work different due to no option for accessing Blogger or Blogspot root directory, which can be done through search preferences (crawler and indexing) in Blogger or Blogspot dashboard.

It isn’t spiders and bots problem to go about locating custom robots.txt file on your blog. The major thing is having the custom robot.txt file set right in either your root directory, dashboard or control panel of your website. And if not, spider and crawler robot will go about crawling and indexing of your web content by default.

                                      How does custom robot.txt look like? 

  
Custom robot.txt by default is an empty box.

User-agent: Mediapartners-Google
Disallow:           
User-agent: *
Disallow: /search
Allow: /
Sitemap: http://www.okongeorge.com/feeds/post/default?orderby=UPDATED


The interpretion of the code:

User-agent: Mediapartners-Google

Mediapartners-Google is a set of command that instruct Google adsense robots to crawl your website or blog in order to serve any available adsense on your blog to the readers.

Disallowing this option means that Adsense bots will not be able to see and display any ads on your site to the readers. Also, if you have not ads on your blog, you simply have to remove this command. 

Disallow: 

This command instruct search engines robot not to crawl any web pages of your website. If for instant, you have “disallowed” a page or new content in your robot.txt file, that means that spider will follow through the set of rule of “Disallow:”, and would not crawl or index that page, making it not seen on the search engines.

User-agent: *

This * (asterisk) signify Allowing all Search Engines spiders, robots and crawlers to execute instructed “index or no index” functions.

Disallow: /search

The ‘Disallow’ implies that search engine spider, robots and crawlers should not allow or block certain webpages on your blog. The ‘search’ next to it implies that any spiders and crawler should not index any of your webpages related to search. 

And should we remove Disallow: /search or include Allow:/search in the above code, then crawlers will access our entire blog to index and crawl all of its content and web pages including the keyword search after domain name.

Ex: www.okongeorge.com/search/label/traffic 

Allow: /

The keyword ‘Allow’ instruct search engine spiders and bots to follow through that set of rule by allowing indexing of your webpages for search engines results. 

The mark ‘/’ indicate that your home page and other subpages (ex: www.okongeorge.com/p/contact-us.html) should also be crawled and indexed.

Sitemap:

The sitemap simply tell the search engine spiders and robots to crawl and index every new blog post. Including the sitemap link is creative SEO approach for effective indexing of new blog post including guest post by the bots.

Note: If you have no custom robots.txt file on your blog root directory (for Wordpress) or Search Preference Setting (for Blogger), then search engines spiders and crawlers will by default crawl and index each pages or your website or blog.  

 Example of different set of rules to allowed or disallowed for custom robots.txt


Note; I earlier said that you can customize custom robots.txt to your choice of what to crawl/index, and what not to crawl/index.


Now, the most important thing to consider while customizing blog custom robot.txt defining directives for specific search engines is to know the name of the search engine spider (ex: user agent) and their specific roles.


Let look at the few example of allow and disallow certain search functions:

1.     Disallow Particular Post

Let assume you want to exclude a particular post from indexing then we can add below lines in the code.


Disallow: /yyyy/mm/post-url.html

The yyyy and mm refers to the year and month you published. To for example, if I made the blog post by last year February 2015, the format should look like this:

Disallow: /2015/02/permalink.html

To make this easy, simply copy the post permalink url you want to disallowed starting from post year and month by removing the blog name from the beginning. Example:
Disallow: /2016/08/how-to-add-clickable-link-in-blogger.html


2.     Disallow all static pages except “About-us”.


Search engine crawler will not index and of your blog pages except a page called ‘about us’. That by example, the instructed custom robot code should include this command: 

User-agent: *
Disallow: /p
Allow: /p/about-us.html


3.     Allow and Disallow certain blog images.
 
 

Like I said, knowing the name of a search engine spiders will helps in defining a specific rule of instruction to your individual webpages and search engines functions.


Search engine spider (user agent) like Googlebot-Image for example, will define rules for the Google Images spider. For example if you want to disallow all your blog images from being index on the search engine, you can set the following code below:


User-agent: Googlebot-Image
Disallow: /images/
Allow: /images/header.gif

How to create custom robots.txt file?


In Wordpress, it is easy to setup robot.txt file and place it on your site control panel manager or by using FTP client like filezilla. 


In Blogger however, you can add or setup custom robots.txt file through search preferences (crawler and indexing) after you’ve clicked on the setting tab in Blogger or Blogspot dashboard.


To setup your blog custom robots.txt simply: 


1.     Login to blogspot/blogger dashboard.

2.     Click on Settings.

3.     Click on Search Preferences for crawler and indexing panel to open and then click on the ‘Edit’ next to custom robot.txt.  See screenshot image below.



4.     Now, click on the ‘Edit’ tab next to the custom robots.txt for a warning and ‘Yes’ or ‘No option of if you want enable custom robot.txt content.




5.     Tick the ‘Yes’ option for an empty box to open, where you can define the setting of your custom robots.txt




6.     Now, copy and paste the below codes as also seen in the screenshot image above.

User-agent: Mediapartners-Google
Disallow:           
User-agent: *
Disallow: /search
Allow: /
http://www.okongeorge.com/feeds/post/default?orderby=UPDATED


7.     Now, click on Save.

My suggestion is that you use robots.txt to point to your sitemap url (ex: sitemap: www.okongeorge.com/atom.xml) and then control no indexing tags using custom robot meta tags, but then, I've come to realized that using sitemap do not really work well, and in all research, all webmasters rely on the updated sitemap in the sample above.

Should in case you’re yet to set your blog sitemap, custom robot.txt by default, will index only 25 webpages of an existing website or blog, and that’s not a favorable condition for a chronic blogger. That is to say, if you’re a core blogger with huge and continuous blogging content, you should think of crawling and indexing all of your content by replacing the updeated sitemap link in the default custom robot.txt with this one: 

Sitemap: http://www.yourwebsite.com/atom.xml?redirect=false&start-index=1&max-results=500

If for instant you have more than 500 published posts on your blog, and you still not setup your blog sitemap, you should rather use this sitemap: 

Sitemap: http://www.yourwebsite.com/atom.xml?redirect=false&start-index=500&max-results=1000

How to check your robots.txt file confirmation

You can access your blog robots.txt file by simply adding /robots.txt to the end of your website url. For example - www.okongeorge.com/robots.txt

Always remember to replace your website url with mine. In this case, when your site robots.txt url is visited, what displays on your browser should look that this:
 


Search engine spiders

Below are the few search engines spiders among various others that crawl and index webpages on the internet.
  • Googlebot – Google
  • Googlebot-Image – Google Images
  • Googlebot-News – Google News
  • Bingbot – Bing
  • Teoma – Ask
  •  
In conclusion, I really took my time to ensure a detailed explanation of creating custom robots.txt file for your blog, other than the default set of rule. Should in case you any difficulties customizing your own blog custom robots.txt, then don’t hesitate to ask me by commenting.

If this tutorial has been of help to you, you can simply support my blogging ministry to the permanent site by helping me spread my words by sharing of the social media using the share buttons below.

Thanks. 





No comments:

Post a Comment

Protected by Copyscape

DMCA.com Protection Status

Widget