How the WWW.*.* list was created:
Turns out I made a bad assumption in the first program that cut out a lot of
the domain names, I am currently redoing all of the pages, I have yet to do
the EDU and COM page, the net page took less than a day, I expect the EDU
page to be the same, I started the COM page a couple of times and expect it
to run for a very long time (maybe a week)...
Why I started this project is not relevant. What I wanted was a list of
valid addresses that fit the form HTTP://WWW.*.*. And I needed to learn
Windows Sockets programming, this turned out to be a good exercise. The first cut at it
started at the same place and ended a little different and took two orders of
magnitude longer. I started with the INTERNIC zone files (master lists of
all registered domains). Stripped of the last two words off of each entry
(*.com for example) added WWW. to the front. Next step is to do dns searches
on each address, if an address returns successfully then open a TCP socket
(port 80), on the first cut, I would request the default page at that address
and parse the title string. On this latest run I did not requst the page and
parse the title string. The time consuming part is the DNS address lookup
and the TCP connection (also the page request was time consuming before).
Bottom line the first run took literally four months nearly 24 hours a day
over a 33.6 connection. This run took four days. I started with RFC 1035 and found that DNS requests can and
usually do use UDP, which means there is no time consuming connection time,
and if there is no response, who cares, for this case I am only interested in
the ones that respond. This run starts again with the
INTERNIC zone files. The first
program ZONE.EXE is used to extract unique domain names from the
original zone file. Next DNS.EXE takes that
list and sends four requests a second (keeps traffic managable). And
processes the responses. It is interesting to note that I compared the
results here with results given by gethostbyname(), turns out gethostbyname()
gives bad addresses often (my TCP/IP stack was the native Windows95 PPP Dial-up
Adapter) which would slow down the following procedures. Next I sorted the list alphabetically with SORT.EXE.
Yes it is a bubble sort, and yes it was slow, I only really needed it for
WWW.*.COM, the rest I did with my editor. The next step is the slowest one, I
used SEARCH.EXE to check TCP connections to all of the found
addresses. And lastly I used SPLIT.EXE to turn the list into nicely
formatted HTML files. ZONE.EXE, SORT.EXE, and SPLIT.EXE
are DOS programs. DNS.EXE and SEARCH.EXE are
Windows (16) programs. I leave it up to the reader to view the source
code and figure out what to pass each program on the command line.
Source for above programs