Heuristikud WSDL standardil veebiteenuste otsimiseks roomaja Heritrix näitel
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Resümee
Käesoleva bakalureuse töö eesmärgiks on seadistada ja täiustada avatud lähtekoodil
baseeruvat Heritrix veebiussi. Tehtud muudatuste tulemina peab Heritrix suutma leida
veebiteenuseid märkivaid WSDL faile. Veebiuss ehk web crawler on programm, mis otsib
automatiseeritult mööda Interneti avarusi ringi liikudes soovitud veebidokumente. WSDL
on XML formaadis keel, mis sätestab veebiteenuse asukoha ja protokolli ning kirjeldab
pakutavad meetodid ja funktsioonid.
Eesmärgi saavutamiseks uuriti avaldatud artikleid, mis kirjeldasid erinevaid strateegiaid
Internetist veebiteenuste otsimiseks kasutades veebiussi. Mainitud tööde põhjal loodi
Heritrix'i seadistus, mis võimaldas WSDL teenuse kirjeldusi otsida. Lisaks kirjutati
programmeerimis keeles Java Heritrixi täiendav klass, mis võimaldab lihtsustatud kujul
salvestada veebi roomamise tulemusi.
Ühes leitud artiklites kirjeldati suunatud otsingu (focused crawling) toe lisamist
veebiteenuseid otsivale Heritrix veebiussile. Suunatud otsing võimaldab ussil hinnata uusi
avastatud veebilehti ning lubab keskenduda lehtedele, mis suurema tõenäosusega
sisaldavad otsitavaid ressursse. Kuna vaadeldavas programmis puudub tugi suunatud
otsingu funktsionaalsusele, lisati see käesoleva töö käigus täiendava mooduli loomisega.
Algoritmi aluseks võeti mainitud artiklis kirjeldatud lahendus.
Selleks, et kontrollida kas lisatud täiendus muutis roomamise protsessi täpsemaks või
kiiremaks teostati eksperiment kolme katsega. Käivitati kaks Heritrixi exemplari, millest
mõlemad seadistati WSDL teenuse kirjeldusi ostima, kuid ainult ühele neist lisati suunatud
otsingu tugi. Katse käigus vaadeldi leitud teenuste arvu ja kogu läbi kammitud
veebilehtede kogust.
Eksperimendi tulemuste analüüsist võis järeldada, et suunatud otsingu funktsionaalsus
muudab roomamise protsessi täpsemaks ning võimaldab seeläbi WSDL teenuse kirjeldusi
kiiremini leida.
The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter. We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawler’s source code, that allows logging of search results without any excessive data. With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing “interesting” data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrix’s source code, the algorithm used as basis for our solution was described in one of the articles. To see if our enhancement provided any improvement in the crawl’s process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them.
The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter. We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawler’s source code, that allows logging of search results without any excessive data. With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing “interesting” data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrix’s source code, the algorithm used as basis for our solution was described in one of the articles. To see if our enhancement provided any improvement in the crawl’s process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them.