2016. 10. 21. 16:53, Computer/Information Retrival
검색 엔진의 구조
Architecture of a Search Engine
"While your first question may be the most pertinent, you may or may not realize it is also the most irrelevant .”The Architect, Matrix Reloaded
- 검색엔진 구조
- 3 parts : software components, interfaces, relationships
- 2개의 필요조건
- effectiveness(효과성) : 검색 결과의 품질. most relevant set of documnet
- efficiency (효율성) : response time, throughtput
- Indexing Process(색인 처리)
- Text acquisition(텍스트 획득)
- Text transformation(텍스트 변환)
- Index Creation(색인 구축)
- Query process(질의 처리)
- User interaction(사용자 상호작용)
- Ranking(순위 부여)
- Evaluation(평가)
- Indexing Process
- Text Acquisition : 문서를 저장, 크롤링함으로 구축
- Text transformation : 문서를 indext terms or features로 변환한다
- Index creation(색인 구축) :
- create data structures to support fast searching
- efficiently updated
- Inverted indexes(역색인구조) : most common form
(색인구조가 이미 만들어져있다고 가정)
- Query Process
- User interaction
- creation and refinement of query
- display of results (snippets : 문서의 일부분, 키워드가 포함되어 있는 부분, 그 문서를 요약, 맨 처음 부분 등. ranking, bold, etc)
- Ranking
- uses query and indexes to generate ranked list of documnets
- based on retrieval model
- Evaluation
- effectiveness and efficiency
- effectiveness depends on retrieval model
- efficinecy depends on indexes
- offline activity
- Text Acquistion 1. Crawler
- Many types - web(link), enterprise, desktop
- Web crawler
- coverage & freshness
- site search
- Topical or focused crawlers for vertical search
- Document crawler
- Text Acquistion 2. Feed
- A mechanism for accessing real-time streams of document
- Web feed(or news fee) is a data format used for providing users with frequently updated content
- Search engine도 받아봄
- ex) RSS
- Text Acquistion 3. Conversion
- consistent text plus metadata format
- convert text encoding
- Text Acquistion 4. Document data store
- metadata, links, anchor text
- Provides fast access(검색 결과 제공 등)
- a simpler and more efficient storage system is uesd
- Text Transformation 1. Parser (여러 부분으로 나누는 것)
- recognize structural elements
- Tokenizer recognizes "words" (capitalization, hyphens, apostrophes, non-alpha characters separators
- if Markup languages, use syntax
- Text Transformation 2. Stopping & Stemming
- Stopping(불용어 제거) : common words
- 색인어로서 가치가 없다.
- Stemming(형태소 분석)
- benefits vary for different languages
- Text Transformation 3. Link analysis
- popularity and authority
- ex) PageRank : incoming link가 많은 페이지를 랭크한다.
- Text Transformation 4. Information Extraction & Classifier
- Information Extraction
- named entity recognizer
- Classifier
- assign labels
- Clustering without predefined categories
- Index Creation 1.Documnet Statistics
- counts and positions of words
- Used in ranking algorithm
- Index Creation 2. Weighting
- computes weights for index terms(상대적으로)
- 문서 내에서 상대적인 중요도
- tf : term frequency
- idf : inverse documnet frequency(적을수록 좋으니까 inverse로 곱함)
- Index Creation 3. Inversion & Index Distribution
- Inversion
- Core of indexing process
- 문서-색인어 => 색인어-문서로 변환
- 불용어를 제외한 사전에 있는 모든 단어
- Must also handle updates
- Index Distribution
- Essential for fast query processing
- both indexing and query processing can be done in parallel
- User Interaction 1. Query input
- User Interaction 2. Query transformation
- Spell checking, query suggestion
- query expansion, relevance feedback
- User Interaction 3. Results output
- Ranking
- Scoring : å qidi
- qi, di가 의미하는 것이 뭔지.
- qi는 쿼리에 있는 텀 웨이트
- 다양한 ranking algorithm 존재함. Vector space model
- Performance Optimization(성능 최적화)
- Term-at-a time과 document를 비교
- Safe vs unsafe 비교
- Distrivution(분산)
- 검색엔진에 관련된 이야기. 쿼리를 쪼개서 처리할 수도 있음.
- Caching
- Evaluation
- Logging
- 사용자가 어떤 쿼리를 날렸느냐
- 어떤 결과를 제시했더니 몇 번째 문서를 클릭했더라
- dwell time 등 여러 가지를 기록한다.
- Ranking analysis (순위 분석 : 효과성)
- Performance analysis(성능 분석 : 효율성)
출처 : W.BruceCroft DonaldMetzler TrevorStrohman
Search Engines
Information Retrieval in Practice
'Computer > Information Retrival' 카테고리의 다른 글
정보검색과 검색엔진이란?(Information Retrieval and Search Engine (0) | 2016.10.18 |
---|
Comments, Trackbacks