AI Researcher @NCSOFT
TODAY TOTAL
검색엔진의 구조(Architecture of a Search Engine)
검색 엔진의 구조
Architecture of a Search Engine


"While your first question may be the most pertinent, you may or may not realize it is also the most irrelevant .”
The Architect, Matrix Reloaded

  • 검색엔진 구조
    • 3 parts : software components, interfaces, relationships
    • 2개의 필요조건
      • effectiveness(효과성) :  검색 결과의 품질. most relevant set of documnet
      • efficiency (효율성) : response time, throughtput

  • Indexing Process(색인 처리)
    • Text acquisition(텍스트 획득)
    • Text transformation(텍스트 변환)
    • Index Creation(색인 구축)
  • Query process(질의 처리)
    • User interaction(사용자 상호작용)
    • Ranking(순위 부여)
    • Evaluation(평가)


  • Indexing Process
    • Text Acquisition : 문서를 저장, 크롤링함으로 구축
    • Text transformation : 문서를 indext terms or features로 변환한다
    • Index creation(색인 구축) :
      • create data structures to support fast searching
      • efficiently updated
      • Inverted indexes(역색인구조) : most common form

(색인구조가 이미 만들어져있다고 가정)
  • Query Process
    • User interaction 
      • creation and refinement of query
      • display of results (snippets : 문서의 일부분, 키워드가 포함되어 있는 부분, 그 문서를 요약, 맨 처음 부분 등. ranking, bold, etc)
    • Ranking
      • uses query and indexes to generate ranked list of documnets
      • based on retrieval model
    • Evaluation
      • effectiveness and efficiency
      • effectiveness depends on retrieval model
      • efficinecy depends on indexes
      • offline activity



  • Text Acquistion 1. Crawler
    • Many types - web(link), enterprise, desktop
    • Web crawler
      • coverage & freshness
      • site search
      • Topical or focused crawlers for vertical search
    • Document crawler

  • Text Acquistion 2. Feed
    • A mechanism for accessing real-time streams of document
    • Web feed(or news fee) is a data format used for providing users with frequently updated content
    • Search engine도 받아봄
    • ex) RSS

  • Text Acquistion 3. Conversion
    • consistent text plus metadata format
    • convert text encoding

  • Text Acquistion 4. Document data store
    • metadata, links, anchor text
    • Provides fast access(검색 결과 제공 등)
    • a simpler and more efficient storage system is uesd



  • Text Transformation 1. Parser (여러 부분으로 나누는 것)
    • recognize structural elements
    • Tokenizer recognizes "words" (capitalization, hyphens, apostrophes, non-alpha characters separators
    • if Markup languages, use syntax

  • Text Transformation 2. Stopping & Stemming
    • Stopping(불용어 제거) : common words
      • 색인어로서 가치가 없다.
    • Stemming(형태소 분석) 
      • benefits vary for different languages

  • Text Transformation 3. Link analysis
    • popularity and authority 
      • ex) PageRank : incoming link가 많은 페이지를 랭크한다.

  • Text Transformation 4. Information Extraction & Classifier
    • Information Extraction
      • named entity recognizer
    • Classifier
      • assign labels
      • Clustering without predefined categories



  • Index Creation 1.Documnet Statistics
    • counts and positions of words
    • Used in ranking algorithm

  • Index Creation 2. Weighting
    • computes weights for index terms(상대적으로)
    • 문서 내에서 상대적인 중요도
    • tf : term frequency
    • idf : inverse documnet frequency(적을수록 좋으니까 inverse로 곱함)

  • Index Creation 3. Inversion & Index Distribution
    • Inversion
      • Core of indexing process
      • 문서-색인어 => 색인어-문서로 변환
      • 불용어를 제외한 사전에 있는 모든 단어
      • Must also handle updates
    • Index Distribution
      • Essential for fast query processing
      • both indexing and query processing can be done in parallel



  • User Interaction 1. Query input
  • User Interaction 2. Query transformation
    • Spell checking, query suggestion
    • query expansion, relevance feedback
  • User Interaction 3. Results output

  • Ranking
    • Scoring : å qidi
      • qi, di가 의미하는 것이 뭔지.
      • qi는 쿼리에 있는 텀 웨이트
      • 다양한 ranking algorithm 존재함. Vector space model
    • Performance Optimization(성능 최적화)
      • Term-at-a time과 document를 비교
      • Safe vs unsafe 비교
    • Distrivution(분산)
      • 검색엔진에 관련된 이야기. 쿼리를 쪼개서 처리할 수도 있음.
      • Caching



  • Evaluation
    • Logging
      • 사용자가 어떤 쿼리를 날렸느냐
      • 어떤 결과를 제시했더니 몇 번째 문서를 클릭했더라
      • dwell time 등 여러 가지를 기록한다.
    • Ranking analysis (순위 분석 : 효과성)
    • Performance analysis(성능 분석 : 효율성)



출처 : W.BruceCroft DonaldMetzler TrevorStrohman
Search Engines
Information Retrieval in Practice


  Comments,     Trackbacks