Tuesday, February 12, 2008

What Are Search Engine Spiders?

About Spider

A spider, also known as a robot or a crawler, is actually just a program that follows, or "crawls", links throughout the Internet, grabbing content from sites and adding it to search engine indexes.

Spiders only can follow links from one page to another and from one site to another. That is the primary reason why links to your site (inbound links) are so important. Links to your website from other websites will give the search engine spiders more "food" to chew on. The more times they find links to your site, the more times they will stop by and visit. Google especially relies on its spiders to create their vast index of listings.

Spiders find Web pages by following links from other Web pages, but you can also submit your Web pages directly to a search engine or directory and request a visit by their spider.

In fact, it's a good idea to manually submit your site to a human-edited directory such as Yahoo, and usually spiders from other search engines (such as Google) will find it and add it to their database. It can be useful to submit your URL straight to the various search engines as well; but spider-based engines will usually pick up your site regardless of whether or not you've submitted it to a search engine.

Tuesday, January 8, 2008

How to Password Protect a Directory on Your Website

Password protecting a directory on your site is actually fairly easy. Webmasters typically want to protect a directory if they have information that they want to make available only to a selected number of people. This guide teaches how you can make a folder on your website accessible only to people with the appropriate password. If Your Web Host Has a Control Panel

Before you dive into the task of manually password-protecting a directory using Apache's built-in facilities, you might want to check out your web host's control panel to see if they already provide the facility for protecting directories. In my experience, many commercial web hosts already provide an easy way for you to password-protect your directories. If such facility is already available, it's probably best to use it since it will save you time, particularly if you are not familiar with shell command lines and editing of .htaccess files. Otherwise, read on.System Requirements

You will need the following before your attempt to password-protect anything is successful.

  1. Your website must be running on an Apache web server.
  2. Your web host must have enabled .htaccess processing - that is, they allow you to customize your web server environment using localized configuration files called .htaccess files.
  3. You must have shell access, either via telnet or Secure Shell (SSH). You should also know how to use telnet or SSH to connect to your web hosting account.

Steps to Protecting a Directory with a Password Using .htaccess on Apache

1. Create a .htaccess file

Use an ASCII text editor like Notepad to create a text file with the following contents:

AuthName "Secure Area"
AuthType Basic
AuthUserFile /path/to/your/directory/.htpasswd
require valid-user


Note that you will have to modify the above according to your situation. In particular, change:

1. AuthName

Change "Secure Area" to any name that you like. This name will be displayed when the browser prompts for a password. If, for example, that area is to be accessible only to members of your site, you can name it "Members Only" or the like.

2. AuthUserFile

You will later create a file containing passwords named .htpasswd. The "AuthUserFile" line tells the Apache web server where it can locate this password file.

Ideally, the password file should be placed outside any directory accessible by visitors to your website. For example, if the main page of your web site is physically located in "/home/your-account-name/public-html/", place your .htpasswd file in (say) /home/your-account-name/.htpasswd. That way, on the off-chance that your host misconfigures your server, your visitors cannot view the .htpasswd contents by simply typing http://www.example.com/.htpasswd.

Wherever you decide to place the file, put the full path of that file after "AuthUserFile". For example, if the directory where you placed the file is /home/your-account-name/.htpasswd, modify that name to "AuthUserFile /home/your-account-name/.htpasswd". Note that your password file need not be named .htpasswd either. It can be any name you wish. For ease of reference, however, this tutorial will assume that you chose ".htpasswd".

3. AuthType and require

You do not have to modify these. Just copy the lines as they are given above.

2. Save and Upload the .htaccess file

Save the .htaccess. If you are using Notepad, be sure to save the file as ".htaccess", including the quotes, otherwise Notepad will change the name to ".htaccess.txt" behind your back. Then upload the .htaccess file to the directory that you want to protect.

3. Set Up the Password File, .htpasswd

Use your telnet or SSH software and log into your shell account.
Be sure that you are in your home directory, not somewhere else. Note that your web directory is probably not your home directory on most commercial web hosts. On servers that use a Unix-type system (like Linux, FreeBSD and OpenBSD), you can usually go to your home directory by simply typing "cd" (without the quotes) followed by the ENTER key (or RETURN key on a Mac). This, by default, will switch you to your home directory. (Note for Windows users - this is different from the Windows/DOS shell, where "cd" only displays the current working directory.)

Then, type the following command:

htpasswd -c .htpasswd your-user-name

where your-user-name is the login name of the user you want to give access. The user name should be a single word without any intervening spaces. You will then be prompted to enter the password for that user. When this is done, the htpasswd utility creates a file called .htpasswd in your current directory (home directory). You can move the file to its final location later, according to where you set the AuthUserFile location in .htaccess.

If you have more than one users, you should create passwords for them as well, but using the following command for each subsequent user:
htpasswd .htpasswd another-user-name

Notice that this time, we did not use the "-c" option. When the "-c" option is not present, htpasswd will look for an existing file by the name given (.htpasswd in our case), and append the new user's password to that file. If you use "-c" for your second user, you will wipe out the first user's entry since htpasswd takes "-c" to mean create a new file, overwriting the existing file if present.

If you are curious about the contents of the file, you can take a look using the following command:

cat .htpasswd

Since the .htpasswd file is a plain text file, with a series of user name and encrypted password pairs, you might see something like the following:

sally:abcdefgHijK12
mary:34567890LMNop


This file has two users "sally" and "mary". The passwords you see will not be the same as the one you typed, since they are encrypted.

Before you quit, you should make sure that permissions on the file are acceptable. To check the permissions, simply type the following on the shell command line:

ls -al .htpasswd

If you see the file with a listing like:

-rw-rw-rw- (...etc...) .htpasswd


it means that the .htpasswd can be read and written by everyone who has an account on the same server as you. The first "rw" means that the owner of the file (you) can read it and write to it. The next "rw" means everyone in the same group as you can read and write the file. The third "rw" means that everyone with an account on that machine can read and write the file.

You don't want anyone else to be able to write to the file except you, since they can then add themselves as a user with a password of their own choosing or other nefarious stuff. To remove the write permission from everyone except you, do this from the shell command line:

chmod 644 .htpasswd

This allows the file to be read and written by you, and only read by others. Depending on how your server is set up, it is probably too risky to change the permissions to prevent others from your group or the world from reading it, since if you do so, the Apache web server will probably not be able to read it either. In any case, the passwords are encrypted, so a cursory glance at the file will hopefully not give away the passwords.

If you have set a different directory for your password file in your .htaccess earlier, you will need to move it there. You can do this from the shell command line as follows:

mv .htpasswd final/location/of/the/file

Remember that your file does not even have to be called .htpasswd. You can name it anything you like. However, if you do, make sure that your AuthUserFile has the same directory and filename or Apache will not be able to locate it.

Testing Your Setup

Once you have completed the above, you should test your set up using your browser to make sure that everything works as intended. Upload a simple index.html file into your protected directory and use your web browser to view it. You should be greeted with a prompt for your user name and password. If you have set everything up correctly, when you enter that information, you should be able to view the index.html file, and indeed any other file in that directory.
A Word of Caution

You should note a few things though, before you go berserk password protecting directories and harbouring the illusion that they can safeguard your data:

  1. The password protection only guards access through the web. You can still freely access your directories from your shell account. So can others on that server, depending on how the permissions are set up in the directories.
  2. It protects directories and not files. Once a user is authenticated for that folder, he/she can view any file in that directory and its descendants.
  3. Passwords and user names are transmitted in the clear by the browser, and so are vulnerable to being intercepted by others.
  4. You should not use this password protection facility for anything serious, like guarding your customer's data, credit card information or any other valuable information. It is basically only good for things like keeping out search engine bots and casual visitors. Remember, your data isn't even encrypted in the directory with this method.

Congratulations

Congratulations. You have now successfully password-protected a directory on your website.

Pros and Cons of Putting a Blog in a Subdirectory / Folder

When you first install a blog on your site, you are faced with the decision of whether to put your blog into a subdirectory (folder), like http://www.example.com/blog/, or just let your blog be accessed from your main URL, like http://www.example.com. No one answer fits all, since there are advantages and disadvantages to either of these options. This article gives you the good points and downsides of both alternative so that you can evaluate what is best for your site.

Advantage of Putting Your Blog in a Folder / Sub-directory

1. Multiple Purpose Websites

If you intend your website to be more than a blog, for example, if you intend to sell goods or services from your site, putting your blog into a folder or sub-directory has certain advantages. In particular, your main page can then be freed to advertise your products or services and link to your shopping cart. From that page, you can still have a link to your blog.

When your site serves different purposes, putting your blog on the main page has the potential to decrease your sales, cause confusion among your customers and make your site look unprofessional. Think about it. What do all blogs look like? No matter how you change the theme or appearance, all blogs have certain visual features in common. They typically have a series of posts on their main pages, linking to the actual articles. Except for a slogan underneath the name of the blog, all the other content on the page usually move off the page as new posts are made.

This works against you since the expectations of people wanting to buy things is that they can immediately see either a list of products on your main page or information about your company and the kind of things it sells. From there, they can navigate to the price lists or product description pages, and so on. Having a blog front page may lose you some visitors who, unaccustomed to your unusual layout, may not be able to find what they are there to do, or think that they have arrived at the wrong site.

2. Your Non-Blog Pages are not Dynamic or Database-Server-Dependant

As mentioned in one of my other articles, most blog software create pages dynamically (except possibly Movable Type which has the option of creating static pages). They depend on a chain of facilities, from scripts (programs) running on the web server to database servers supplying data, to deliver a single page to your visitors. A failure at any point in that chain, such as the database server being too bogged down to reply to additional requests, means that your web page can no longer be delivered.

In view of this, separating your non-blog pages like your product pages and ordering pages from your database-dependent blog system is probably wise. While your blog may be the apple of your eye, since you invested so much time writing for it, its "down" time will probably not cost you as much as your product and ordering pages being unavailable. You will want the latter pages to be static pages, dependent only on the web server.

3. Conflicting Web Addresses (URLs)

On some web hosts, the root directory of your website contains links to a variety of built-in facilities provided by the host. For example, they may place links to your control panel or to your web statistics, accessible by browser using a URL like http://www.example.com/name-of-facility.

Installing a blog into the main directory of your site causes problems on these hosts. Many of the sophisticated blog software or CMS software completely take over the directory it is installed in as well as its subfolders. Try to access a file that is not recognised by the blog software (as your control panel or web statistics will definitely not be), and you get a File Not Found error issued by the blog. You would have effectively lost access to that facility provided by your web host. Don't blow this out of proportion, though. It is still possible to circumvent the blog software by using a bit of .htaccess magic, if your site is hosted on an Apache server.

Of course, if you install your blog into a subfolder, this problem disappears, since your root folder is not managed by the software.

Disadvantages of Putting Your Blog in a Sub-folder

Before you all rush out to put your blog into a subdirectory, there are also disadvantages to doing so. If you are not careful, you will encounter all of these downsides.

1. Set-up and Maintenance Time is Increased

When you set up your blog in the main web folder, so that your blog appears when you type your domain name alone, your setup work is mostly complete when you finish installing and configuring the blog.

If you install your blog in a subdirectory, you still have to create a main page for your site, since the blog software will no longer take care of that for you. You will have to figure out how to design that main page, and how you should link to your blog. Since people arriving at your site will now see that static page instead of your blog posts, you will no longer have the convenience of the blog software automatically promoting your latest article on that page. If you want your latest article highlighted somehow, you will have to either manually do it, or write a script to insert it for you. All these increase the time it takes for you to set up and maintain your site.

In view of this, if your site is intended to be purely a blog, putting the software in a sub-folder is probably unnecessary. There's too much extra labour for little gain.

2. Link Dilution: Links to Your Blog will be Divided

When your blog is in a sub-folder, some webmasters will link to the blog in that sub-folder, while others will link to your main page. This reduces the number of links going to any particular page on your site. As mentioned in my other article on How to Create a Search Engine Friendly Website, you don't want this to happen since it may reduce the importance of your page in the eyes of the search engines.

This is not a big issue if your site is truly intended to sell things (or some other purpose), with the blog serving as a sort of side-endeavour, intended to supplement your main purpose. People who want to link to the product will probably link to your main site or the product page (which is good), while others who are only interested in your philosophical ruminations will link to your blog. You win here, since your site will have gained links (to your blog) that it would never have got had you not maintained a blog. And since those links have different link texts, there's no real issue. The search engines will see that "Widget XYZ" can be found at http://www.example.com, but "ABC Blog" is be found at http://www.example.com/blog.

The problem comes only if your site is primarily a blog, but has the blog software installed in a folder. Your main page, in such a case, probably doesn't do much other than point to your blog. As a result, some sites will link to your main page since that's the easiest thing to do, while other sites, figuring out that your main page has nothing of use to their readers, will just link directly to your blog folder. By way of example,

http://www.example.com will, in such a case, have X number of links for the term "Mary's Blog" in the search engines while http://www.example.com/blog/ will have Y number of links for the same term, instead of one page having a total of X+Y links. If another site with "Mary's Blog" can be found on the Internet with more links than your X or Y links, that site may be counted as being more important than your site for that term.

Admittedly, there's more to the search engines' link algorithm than this, and my search term "Mary's Blog" is sort of contrived. However, the general principle of link dilution still applies.

Is It Possible to Get the Best of Both Worlds?

I'm not sure if there's a 100% satisfactory way to solve all the issues in an easy way.

1. If yours is a purely blog site, and you see nothing in the future for that site other than it being a blog, by all means, install the software directly in the root web directly. If ever you want to sell something in the future, you can always buy another domain for the product, preferably with the name of the product as the domain name as well.

Alternatively, if you want to build on the link reputation your blog has garnered over the years, it is still possible to re-purpose and redesign the existing site. As mentioned in my article on changing a site's design, if a website lasts a certain amount of time, it is likely to get redesigned anyway, so it's not as though you're saving yourself some labour by meticulously planning umpteen years into the distant (foggy) future. However, when you do so, you will want to find some way to preserve the URLs of your existing blog posts. Otherwise your link reputation will be lost when all the existing links to your blog articles are broken. The way to do this is to start preparing ahead now by forming future-proof URLs for your blog posts so that you have fewer problems in the future.

2. If you want your site to both sell things and be a blog, but don't really want to spend time creating web pages for your main directory with a web editor like DreamWeaver or Nvu, it may be better to get a full blown content management software (CMS) like Drupal than to use software like WordPress which is primarily a blogging program.

CMS software allow you to create non-blog pages like the sort you need for a typical website selling products and services. If you get one that supports blogs as well, as Drupal does, then you have the all the facility you need in one package for your site. Installed in the root web directory of your site, you can have both normal pages as well as blog pages. You still won't solve the issue of database dependence though, but you will have reduced your setup and maintenance time. As your income increases from your site, you can always move your site to a dedicated server, so that your site will have exclusive use of the database server (among other things).

Conclusion

When planning for your website or blog, it is important to consider all the issues involved in putting your blog or CMS software in the main web directory or a sub-folder of your site. In both cases, there are pros and cons, and a clear understanding of all that is involved will help you plan what is best for your particular situation.