Veebiandmete eraldamine tooteinfo agregeerimiseks e-poodidest
Files
Date
2016
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Internetist on saanud piiramatu andmeallikas. Läbi otsingumootorite\n\ron see andmehulk tehtud kättesaadavaks igapäevasele interneti kasutajale. Sellele vaatamata on seal ikka informatsiooni, mis pole lihtsasti kättesaadav olemasolevateotsingumootoritega. See tekitab jätkuvalt vajadust ehitada aina uusi otsingumootoreid, mis esitavad informatsiooni uuel kujul, paremini kui seda on varem tehtud. Selleks, et esitada andmeid sellisel kujul, et neist tekiks lisaväärtus tuleb nad kõigepealt kokku koguda ning seejärel töödelda ja analüüsida. Antud magistritöö uurib andmete kogumise faasi selles protsessis.\n\rEsitletakse modernset andmete eraldamise süsteemi ZedBot, mis võimaldab veebilehtedel esinevad pooleldi struktureeritud andmed teisendada kõrge täpsusega struktureeritud kujule. Loodud süsteem täidab enamikku nõudeid, mida peab tänapäevane andmeeraldussüsteem täitma, milleks on: platvormist sõltumatus, võimas reeglite kirjelduse süsteem, automaatne reeglite genereerimise süsteem ja lihtsasti kasutatav kasutajaliides andmete annoteerimiseks. Eriliselt disainitud otsi-robot võimaldab andmete eraldamist kogu veebilehelt ilma inimese sekkumiseta. Töös näidatakse, et esitletud programm on sobilik andmete eraldamiseks väga suure täpsusega suurelt hulgalt veebilehtedelt ning tööriista poolt loodud andmestiku saab kasutada tooteinfo agregeerimiseks ning uue lisandväärtuse loomiseks.
World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value.
World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value.