Efficient and Effective Search Services Over
Archival Webs
NSF AWARD IIS-0803605
The Web is enormous and in constant flux, causing much content to be
lost over time. Historical collections of web content are thus of
monumental value in preserving records of significant aspects of modern
society. The Internet Archive offers access to hundreds of billions of
historical web page snapshots. The scale of such archives, however,
presents tremendous challenges to making this content fully searchable.
This NSF-funded research effort investigates efficient and effective approaches to
store, index, and retrieve web content from large-scale historical
archives. In addition, the temporal content and structure of the
archives are mined to exploit temporal characteristics that can improve
search result ranking. Technological advances from this work will be
tested on content from and in collaboration with the Internet Archive
and potentially integrated into its infrastructure, enabling new archival search
capabilities for the public.
Participants:
- Investigators: Brian D.
Davison (PI: Lehigh University),
Torsten Suel (co-PI: NYU
Polytechnic),
Kris Carpenter Negulescu (Internet Archive) and Gordon Mohr (Internet Archive).
- Research Assistants:
Josh Attenberg,
Na Dai,
Shuai Ding,
Jinru He,
Liangjie Hong,
Xiaoguang Qi,
Zhenzhen Xue,
Hao Yan, and
Junyuan
Zeng.
- Additional staff at the Internet Archive:
Brad Tofel, Vinay Goel, and Aaron Binns.
Publications:
-
H. Yan,
S. Ding,
and
T. Suel. (2009)
Inverted Index
Compression and
Query Processing with Optimized Document Ordering. In
Proceedings
of the 18th
International World Wide Web Conference (WWW), pages 401-410,
Madrid, Spain, ACM Press, April.
-
N. Dai,
B. D. Davison and
X. Qi.
(2009)
Looking
into the Past to Better Classify Web
Spam.
In Proceedings of the Fifth International Workshop
on Adversarial Information Retrieval on the Web (AIRWeb),
pages 1-8, Madrid, Spain, ACM Press, April.
-
J. He,
H. Yan, and
T. Suel. (2009)
Compact
Full-Text Indexing of Versioned Document Collections.
In Proceedings of the 18th ACM
Conference on Information and Knowledge Management (CIKM),
pages 415-424,
Hong Kong, ACM Press, November.
-
N. Dai and
B. D. Davison.
(2009)
Vetting the Links of the Web.
In Proceedings of the
18th ACM Conference on Information and Knowledge Management (CIKM),
pages 1745-1748,
Hong Kong, ACM Press, November.
-
N. Dai and
B. D. Davison.
(2010)
Mining
Anchor Text Trends for Retrieval.
In Proceedings of the
32nd European Conference on Information Retrieval (ECIR),
LNCS 5993, pages 127-139,
Milton Keynes, UK, Springer, March.
-
S. Ding,
J. Attenberg,
and
T. Suel. (2010)
Scalable Techniques for
Document Identifier Assignment in Inverted Indexes.
In
Proceedings
of the 19th
International World Wide Web Conference (WWW),
pages 311-320, ACM Press,
Raleigh, NC, April.
-
N. Dai and
B. D. Davison.
(2010)
Freshness
Matters: In Flowers, Food, and Web
Authority.
In Proceedings of the
33rd Annual ACM SIGIR Conference on
Research and Development in Information Retrieval,
pages 114-121,
Geneva, Switzerland, ACM Press, July.
-
N. Dai and
B. D. Davison.
(2010)
Capturing
Page Freshness for Web
Search.
In Proceedings of the
33rd Annual ACM SIGIR Conference
on Research and Development in Information Retrieval,
pages 871-872,
Geneva, Switzerland, ACM Press, July.
-
J. He,
J. Zeng
and
T. Suel. (2010)
Improved Index Compression
Techniques for Versioned Document Collections.
In Proceedings of the
19th ACM
Conference on Information and Knowledge Management (CIKM),
pages 1239-1248,
Toronto, Canada, October.
-
S. Ding,
J. Attenberg,
R. Baeza-Yates
and
T. Suel. (2011)
Batch Query Processing for Web Search Engines.
In Proceedings of the
4th ACM
International Conference on Web Search and Data Mining (WSDM),
pages 137-146,
Hong Kong, February.
-
N. Dai,
X. Qi and
B. D. Davison.
(2011)
Enhancing
Web Search with
Entity Intent.
In Companion Proceedings of the 20th International
World Wide Web Conference, pages 29-30,
Hyderabad, India, March.
-
N. Dai,
X. Qi and
B. D. Davison.
(2011)
Bridging Link and
Query Intent to Enhance Web Search.
In Proceedings of the 22nd ACM Conference on Hypertext and
Hypermedia,
pages 17-26,
Eindhoven, The Netherlands, June.
-
N. Dai,
M.
Shokouhi and
B. D. Davison.
(2011)
Multi-Objective
Optimization in Learning to Rank.
In Proceedings of the
34th Annual International ACM
SIGIR Conference on Research and
Development on Information Retrieval,
pages 1241-1242,
Beijing, China, July.
-
J. He
and
T. Suel. (2011)
Faster Temporal Range
Queries over Versioned Text.
In Proceedings of the
34th Annual International ACM
SIGIR Conference on Research and
Development on Information Retrieval,
pages 565-574,
Beijing, China, July.
-
S. Ding,
and
T. Suel. (2011)
Faster Top-k Document Retrieval Using Block-Max Indexes.
In Proceedings of the
34th Annual International ACM
SIGIR Conference on Research and
Development on Information Retrieval,
pages 993-1002,
Beijing, China, July.
-
N. Dai,
M.
Shokouhi and
B. D. Davison.
(2011)
Learning
to Rank for Freshness and Relevance.
In Proceedings of the
34th Annual International ACM
SIGIR Conference on Research and
Development on Information Retrieval,
pages 95-104,
Beijing, China, July.
-
L. Hong,
D. Yin,
J. Guo and
B. D. Davison.
(2011)
Tracking
Trends: Incorporating Term Volume into Temporal Topic
Models.
In Proceedings of the
17th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining,
pages 484-492,
San Diego, August.
-
Y. Avcular and
T. Suel. (2011)
Scalable Manipulation of Archival Web Graphs.
In Proceedings
of the CIKM Workshop on Large-Scale and Distributed Systems for Information Retrieval (LSDS-IR),
pages 27-32,
Glasgow, Scotland, October.
-
M. Christoforaki, J. He, C. Dimopoulos, A. Markowetz and
T. Suel. (2011)
Text vs. Space: Efficient Geo-Search Query Processing.
In Proceedings
of the
20th ACM Conference on Information and Knowledge Management (CIKM)
,
pages 423-432,
Glasgow, Scotland, October.
-
D. Shan, S. Ding, J. He, H. Yan and X. Li.
(2012)
Optimized top-k processing with global page scores on block-max indexes.
In Proceedings of the
Fifth ACM International Conference on Web Search and Data Mining (WSDM),
pages 423-432,
Seattle, WA, February.
-
L. Hong,
A. Ahmed, S. Gurumurthy, A. Smola and K. Tsioutsiouliklis.
(2012)
Discovering
Geographical Topics in the Twitter Stream.
In Proceedings
of the
21st International World
Wide Web Conference (WWW),
pages 769-778,
Lyon, France, April.
-
N. Dai.
(2012)
Building
Contextual Anchor Text Representation Using Graph Regularization.
In Proceedings of the
26th Conference on Artificial Intelligence (AAAI),
pages 24-30,
Toronto, Canada, July.
-
L. Hong,
R. Bekkerman, J. Adler, and
B. D. Davison.
(2012)
Learning to Rank Social Update Streams.
In Proceedings of the
35th Annual ACM SIGIR
Conference on Research and Development in Information Retrieval,
pages 651-660,
Portland, OR, August.
-
J. He and
T. Suel.
(2012)
Optimizing Positional Index Structures for Versioned Document
Collections.
In Proceedings of the
35th Annual ACM SIGIR
Conference on Research and Development in Information Retrieval,
pages 245-254,
Portland, OR, August.
-
D. Arroyuelo, S. Gonzalez, M. Marin, M. Oyarzun, and
T. Suel.
(2012)
To Index or not
to Index: Time-Space Trade-Offs in Search Engines
with Positional Ranking Functions.
In Proceedings of the
35th Annual ACM SIGIR
Conference on Research and Development in Information Retrieval ,
pages 255-265,
Portland, OR, August.
-
S. Ding
(2013)
Index Compression and Efficient Query Processing in Large Web Search Engines.
Doctoral Dissertation,
Department of Computer Science and Engineering, NYU-Poly, January.
-
J. He
(2013)
Indexing and Querying over Versioned Text.
Doctoral Dissertation,
Department of Computer Science and Engineering, NYU-Poly, January.
-
L. Hong,
A. Doumith, and
B. D. Davison.
(2013)
Co-Factorization
Machines: Modeling User Interests
and Predicting Individual Decisions in Twitter.
In Proceedings of the
6th Annual ACM International Conference on
Web Search and Data Mining (WSDM),
pages 557-566,
Rome, Italy, February. Nominated for best paper award.
-
L. Hong.
(2013)
Mining
and Understanding Online Conversational Media.
Doctoral Dissertation,
Department of Computer Science and Engineering,
Lehigh University, May.
-
N. Dai.
(2013)
Mining
Web Dynamics for Search.
Doctoral Dissertation,
Department of Computer Science and Engineering,
Lehigh University, May.
This research grant supports, in part, research in the
WUME Lab
of the Computer Science and Engineering Department at Lehigh University
and the WEST Lab of the
Computer
Science and Engineering Department at NYU Poly.
This material is based upon work supported by the National Science
Foundation under
Grant No. 0803605 (III-COR-Medium: Efficient and Effective Search Services Over
Archival Webs). Any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the
National Science Foundation.
Last modified: 13 June 2013