It¹s EverywhereŠ. It¹s EverywhereŠ.
Don E. Descy
Minnesota State University

Today
Searching, Searching, Searching
The Searchable Web
The Invisible Web. (deep Web)
     What is it?
     Why is it?
     How do we get around it?
Resources/References

Why Is This Important??
You are going to want information..
*Reports, papers, presentations
*Medical, family, jobs, personal.
Your students are going to want information..
*Reports, papers, presentations,
personal.
Most of what you want can¹t be found using regular search techniques !

The Question:
How do you find information
that is availableŠ.
but isn¹t  ??
How do you find your exit on the ³Information Superhighway² if mapping that exit can¹t be done?

The Invisible Web
Web sites that are hidden or are unable to be found or cataloged by regular search engines.

"³Public information on the..."
³Public information on the deep Web is currently 400 to 550 times larger
than the commonly defined
World Wide Web.²
(BrightPlanet, 2004)

"³A"
³A full ninety-five per cent of the deep Web is publicly accessible information ‹ not subject to fees or subscriptions..²
(BrightPlanet, 2004)

The Invisible Web Facts
200,000+ Web sites.
550 billion individual documents compared to the three billion of the surface Web.
Contains 7,500 terabytes of information compared to nineteen terabytes in the surface Web.
Total quality content is 1,000 to 2,000 times greater than the surface Web.

The Invisible Web Facts
Sixty of the largest sites collectively contain over 750 terabytes (84B pages) of information ‹ They exceed the size of the surface Web forty times.
Fastest growing category of new information on the Internet.
Fifty per cent greater monthly traffic than surface sites.

Invisible Web Facts
Narrower, with deeper content, than conventional surface sites.
More than half of the content resides in topic-specific databases.
Content is highly relevant to every information need, market, and domain.

Invisible Web Facts
Not well known to the Internet-searching public

Searching, Searching, Searching
Usually carried out using a ³directory² or ³search engine².
Fast and efficient.
Misses most of what is out there.
70% of searchers start from 3 sites (Nielson, 7/2004):
Google,Yahoo, MSN.

Searching Tools
Directories.
Search engines.

Directories
Hand selected, evaluated, annotated.
Broad topics work best.
Quality over quantity.
Location on list: May be paid.

How Directories Work
Directory Problems
Done by humans.
Takes time.
No universal categories or cataloging system.
Misses the most information/sites.

General Subject Directories
³Yahoo².
Biggest and most famous.
Often useful.
Information.. jobs.. travel.. shopping..
toŠ..
Yahoo.com

Slide 18
Search Engines
Computer generated.
Must be static and linked.
Narrower topics.  Quantity over quality.
Uses newer retrieval technologies.
Location on list: May be paid.
Google, Hotbot, Northern Light, AltaVista, etc.

How Search Engines Work
Search Engine Problems
Spiders/robots don¹t think.
More likely to index sites with more links to them (popularity).
More likely to index US sites.
More likely to index commercial sites.
Sites pay for indexing/position.

"At one
time"
At one
time
showed
actual
bid!

Slide 23
Finding Good Search Engines
UC-Berkeley: Recommended Search Engines:
http://www.lib.berkeley.edu/TeachingLib/
    Guides/Internet/SearchEngines.html
UC-Berkeley: The Best Search Engines (9/2004):
#1 Google #2 Teome
#3 Yahoo! Search

Who are you really searching?
Who are your really searching?
What Do We Miss?
Library of Congress: 30 million+ documents.
ERIC databases.
Most daily newspapers.
Health and medical databases.
Museum and library collections.
The information you need????

Why are pages invisible? (1)
1. Searchable databases:
      Typing required.
      Selection of option combination
        required.
**Pages not available until asked for (ex: Library of Congress).
**Pages are not static but dynamic (may not exist until requested).

Why are pages invisible? (1)
Search engines can¹t handle ³dynamic pages².
Search engines can¹t handle ³input boxes².

Slide 30
Slide 31
Slide 32
Slide 33
Why are pages invisible? (2)
2. Password or Login required:
(Spiders do not know passwords or login IDs).
3. Non-HTML pages:
PDF, Word, Shockwave, Flash...
Some search engines may find them:
ex: Google, AltaVista

Why are pages invisible? (3)
4. Script-based (computer generated)  pages:
Create all or part of Web page.
Contain ³?² in URL.
Spiders programmed to back off.
http://calver.org/search/file/ship (yes!)
http://calver.org/search?title=plane (no)

Sites to Check
Finding Invisible Information
³Librarians¹ Index².
Compiled by librarians in the ³information supply business².
Highest quality sites only.
Reliable, annotated.
www.lii.org

Finding Invisible Information
³About².
2,400,000 + resources.
Wide variety of subjects: Teens, religion, spirituality, shopping, (expected)
About.com

Finding Invisible Information
³direct search².
³Data not easily or entirely searchable/accessible from general search tools.²
www.freepint.com/gary/direct.htm

Slide 40
Finding Invisible Information
³The Invisible Web Catalog².
10,000 + searchable databases.
Quick search, ³Hot List²
Sort alphabetically or by score (relevance).
www.profusion.com

Slide 42
Finding Invisible Information
www. invisible-web.net

Finding Invisible Information
³IncyWincy².
Over 100,000 databases.
Many links to other search engines.
www.incywincy.com

Finding Invisible Information
³CompletePlanet²:
103,000 + databases and specialty search engines.
Some Œsurface¹ searching.
www.completeplanet.com

Finding Invisible Information
Some are research oriented.
³Infomine².
Infomine.ucr.edu/
³Academic Info².
www.academicinfo.net

Slide 47
Slide 48
SoŠ What To Do...
Search several sites.
Used the
³Advanced Search² feature.
Search using the term
  ³Invisible Web² for IW search
sites.
Search several ³Invisible Web² sites.

Questions ?
PowerPoint available at descy.net

Slide 51