Ⅳ. Index Construction

어떤 index 전략을 취해야 좋을까??

디스크에서 가져오는 거 보다 메모리에서 가져오는게 훨씬 빠르다

디스크가 찾을 때 까지는 data transfer이 되지 않음

디스크가 읽을 때 걸리는 시간

Seek Time : 디스크가 track에 있는 data를 찾는 시간
Latency : 디스크가 sector를 찾는시간
Transfer Time : 전송시간

Access Time (Seek Time + Latency) + Transfer Time

RCV1 Collection

셰익스피어 작품들로는 충분하지 않음.

충분한 collection들은 원래 public하지 않는데 RCV1은 public하다.

scalable한 index construction algorithm을 적용하기 위해서 RCV1을 써보자.

doc이 80만개 * token이 200개 (a, the같은 관사포함) = 0.16GB <positional posting>
term의 종류 : 40만개
띄어쓰기, dot(.) 포함한 token의 길이 : 6byte
띄어쓰기, dot(.) 제외한 token의 길이 : 4.5byte
term의 길이 : 7.5byte
<non-positional posting> : 100,000,000

doc이 80만개 * token이 200개 (a, the같은 관사포함) = 0.16GB <positional posting> term의 종류 : 40만개 띄어쓰기, dot(.) 포함한 token의 길이 : 6byte 띄어쓰기, dot(.) 제외한 token의 길이 : 4.5byte term의 길이 : 7.5byte <non-positional posting> : 100,000,000

ex> a, the, wish, wishes, a, the

token avg : (1+3+4+6+1+3)/6
term avg : (1+3+4)/3