생각하는 일상 | 검색엔진의 구조(Architecture of a Search Engine)

생각하는 일상

AI Researcher @NCSOFT

TODAY TOTAL

검색엔진의 구조(Architecture of a Search Engine)

2016. 10. 21. 16:53, Computer/Information Retrival

검색 엔진의 구조

Architecture of a Search Engine

"While your first question may be the most pertinent, you may or may not realize it is also the most irrelevant .”
The Architect, Matrix Reloaded

검색엔진 구조

- 3 parts : software components, interfaces, relationships
- 2개의 필요조건
- - effectiveness(효과성) : 검색 결과의 품질. most relevant set of documnet
  - efficiency (효율성) : response time, throughtput

Indexing Process(색인 처리)
- Text acquisition(텍스트 획득)
- Text transformation(텍스트 변환)
- Index Creation(색인 구축)
Query process(질의 처리)
- User interaction(사용자 상호작용)
- Ranking(순위 부여)
- Evaluation(평가)

Indexing Process
- Text Acquisition : 문서를 저장, 크롤링함으로 구축
- Text transformation : 문서를 indext terms or features로 변환한다
- Index creation(색인 구축) :
- - create data structures to support fast searching
  - efficiently updated
  - Inverted indexes(역색인구조) : most common form

(색인구조가 이미 만들어져있다고 가정)

Query Process
- User interaction
- - creation and refinement of query
  - display of results (snippets : 문서의 일부분, 키워드가 포함되어 있는 부분, 그 문서를 요약, 맨 처음 부분 등. ranking, bold, etc)
- Ranking
- - uses query and indexes to generate ranked list of documnets
  - based on retrieval model
- Evaluation
- - effectiveness and efficiency
  - effectiveness depends on retrieval model
  - efficinecy depends on indexes
  - offline activity

Text Acquistion 1. Crawler
- Many types - web(link), enterprise, desktop
- Web crawler
- - coverage & freshness
  - site search
  - Topical or focused crawlers for vertical search
- Document crawler

Text Acquistion 2. Feed
- A mechanism for accessing real-time streams of document
- Web feed(or news fee) is a data format used for providing users with frequently updated content
- Search engine도 받아봄
- ex) RSS

Text Acquistion 3. Conversion
- consistent text plus metadata format
- convert text encoding

Text Acquistion 4. Document data store
- metadata, links, anchor text
- Provides fast access(검색 결과 제공 등)
- a simpler and more efficient storage system is uesd

Text Transformation 1. Parser (여러 부분으로 나누는 것)
- recognize structural elements
- Tokenizer recognizes "words" (capitalization, hyphens, apostrophes, non-alpha characters separators
- if Markup languages, use syntax

Text Transformation 2. Stopping & Stemming
- Stopping(불용어 제거) : common words
- - 색인어로서 가치가 없다.
- Stemming(형태소 분석)
- - benefits vary for different languages

Text Transformation 3. Link analysis
- popularity and authority
- - ex) PageRank : incoming link가 많은 페이지를 랭크한다.

Text Transformation 4. Information Extraction & Classifier
- Information Extraction
- - named entity recognizer
- Classifier
- - assign labels
  - Clustering without predefined categories

Index Creation 1.Documnet Statistics
- counts and positions of words
- Used in ranking algorithm

Index Creation 2. Weighting
- computes weights for index terms(상대적으로)
- 문서 내에서 상대적인 중요도
- tf : term frequency
- idf : inverse documnet frequency(적을수록 좋으니까 inverse로 곱함)

Index Creation 3. Inversion & Index Distribution
- Inversion
- - Core of indexing process
  - 문서-색인어 => 색인어-문서로 변환
  - 불용어를 제외한 사전에 있는 모든 단어
  - Must also handle updates
- Index Distribution
- - Essential for fast query processing
  - both indexing and query processing can be done in parallel

User Interaction 1. Query input
User Interaction 2. Query transformation
- Spell checking, query suggestion
- query expansion, relevance feedback
User Interaction 3. Results output

Ranking
- Scoring : å qidi
- - qi, di가 의미하는 것이 뭔지.
  - qi는 쿼리에 있는 텀 웨이트
  - 다양한 ranking algorithm 존재함. Vector space model
- Performance Optimization(성능 최적화)
- - Term-at-a time과 document를 비교
  - Safe vs unsafe 비교
- Distrivution(분산)
- - 검색엔진에 관련된 이야기. 쿼리를 쪼개서 처리할 수도 있음.
  - Caching

Evaluation
- Logging
- - 사용자가 어떤 쿼리를 날렸느냐
  - 어떤 결과를 제시했더니 몇 번째 문서를 클릭했더라
  - dwell time 등 여러 가지를 기록한다.
- Ranking analysis (순위 분석 : 효과성)
- Performance analysis(성능 분석 : 효율성)

출처 : W.BruceCroft DonaldMetzler TrevorStrohman

Search Engines

Information Retrieval in Practice

저작자표시

'Computer > Information Retrival' 카테고리의 다른 글

정보검색과 검색엔진이란?(Information Retrieval and Search Engine (0)	2016.10.18

Comments, Trackbacks

카테고리

생각하는 일상 (60)

최근 작성 글

최근 작성 댓글

최근 작성 트랙백

공지사항

링크

글 보관함

캘린더

검색

티스토리툴바