정보검색과 검색엔진이란?(Information Retrieval and Search Engine

2016. 10. 18. 16:33, Computer/Information Retrival

The problem of IR

find documents relevant to and information need

정의

- 사용자가 필요(Inforamtion need)로 하는 정보를 수집하여 내용을 분석한 뒤 찾기 쉬운 형태로 조직하여서, 정보에 대한 요구가 발생했을 때 해당 정보를 찾아 제공하는 시스템

- structure, analysis, organization storage, searching, retrieval

Documents vs. DB Records

- Document는 content와 structure을 가지고 있다.

- Database records are made up of well-defined fields

- Easy to compare fields

- Text is more difficult(unstructured document)

- 키워드를 가지고 전체 뉴스 이야기와 비교해야 한다.

good match is the core issue, better matching을 위해 rank해야 한다.

Dimensions of IR
- IR is more than just text, web service
- different media, different types of search application and different tasks.
- new media : video, photo, music, speech, etc

IR Tasks

- Ad-hoc Search(질의기반 검색)
- - for an arbitrary text query
  - non-generalizable 그 문제만을 풀 수 있는 방법, 주먹구구식, 보편타당한 방법이 아니다
- Filtering(여과)
- - Identify relevant user profiles for a new document
  - 쿼리가 정해져 있을 때, 필터 인 할것인지 필터 아웃 할 것인지
- Classification
- - Identify relevant labels for documents
- Question answering
- - Give a specific answer a question

Big Issues in IR
- Relevance : 적합성
- - 사용자가 원하는 정보를 가지고 있는 문서
  - 적합성을 판단하는 데 여러가지 요인이 있다 = 주관적이다.
  - ex) task, context, novelty, style
  - Topical relevance vs. user relevance
  - Vocabulary mismatch problem(어휘 불일치 문제) : 똑같은 개념이 여러가지 다양한 형태의 단어로 표현될 수 있다. 글자적으로는 달라도 내용적으로 같다면 적합한 문서가 되는 것이다.
  - Retrieval Model 검색 모델에 따라서 적합성이 달라진다.(다른 관점을 가진다.)
  - Ranking algorithms are based on retrieval models.
  - statistical properties >> linguistic
- Evaluation : 평가
- - 얼마나 검색을 잘하느냐? : test collection, queries, relevance judgements
  - 적합하다고 이미 알려진 셋을 상위 랭크로 노출시키느냐
  - 평가척도 : recall 재현률(적합한 문서를 얼마나 많이 찾아내느냐) / precision 정확도(찾아낸 문서 중에서 적합한 문서(정답)가 얼마나 있느냐)
  - 이 두 가지는 trade-off 관계이다.
  - effectiveness
  - efficiency
- Users information Needs(사용자 정보 요구)
- - 질의를 키워드로만 표현하기에는 부족하다.
  - user intent를 이해하는 것이 중요하다.
  - Query refinement techniques(query expansion, query suggestion, relevance feedback) improve ranking
- IR and Serach Engins
- - the practical application of IR techniques to large scale text collecitons
  - Web search engines are best known example

- - 색인을 수시로 업데이트해야한다.
Search Engine Issues
- Performance
- - Indexes are data structures designed to improve search efficiency
- dynamic data(동적 데이터)
- - constantly changing
  - crawling : 계속해서 새로운 문서를 가져옴.
  - coverage(얼마나 많이 색인했느냐) / freshness(최근에 색인했느냐)
- Scalability
- - millions of users, terabytes of documnets
  - Distributed processing(분산 처리)
- Adapatability
- - Changing and tuning search engine components : 순위화 알고리즘, 색인 전략, 사용자 인터페이스 등 검색엔진의 구성요소를 바꾸고 조율하는 것
Spam
- adversarial IR(적대적 정보검색)
Conferences

ACM SIGIR: Association of Computing Machinery Special Interest Group on Information Retrieval
CIKM: Conference on Information and Knowledge Management
WSDM: Web Search and Data Mining Conference
WWW: World Wide Web Conference
HLT: Human Language Technologies
ACL, NAACL,EACL: Association for Computational Linguistics
ECIR: European Conference on Information Retrieval
AIRS: Asian Information Retrieval Symposium

출처 : W.BruceCroft DonaldMetzler TrevorStrohman

Search Engines

Information Retrieval in Practice

저작자표시

'Computer > Information Retrival' 카테고리의 다른 글

검색엔진의 구조(Architecture of a Search Engine) (0)	2016.10.21

Comments, Trackbacks

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

'Computer > Information Retrival' 카테고리의 다른 글

티스토리툴바