Weakly Supervised Extraction of Computer Security Events from Twitter

18 May 2015

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 896-905
https://doi.org/10.1145/2736277.2741083

Abstract

Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might affect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regularizes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks, data breaches and account hijacking. A demonstration of security events extracted by our system is available at: http://kb1.cse.ohio-state.edu:8123/events/hacked

Keywords

Funding Information

Department of Defense (FA8721-05-C-0003)
DARPA (FA8750-13-2-0005)

This publication has 23 references indexed in Scilit:

Open domain event extraction from twitter
Published by Association for Computing Machinery (ACM) ,2012
TwiNER
Published by Association for Computing Machinery (ACM) ,2012
Identifying content for planned events across social media sites
Published by Association for Computing Machinery (ACM) ,2012
Learning classifiers from only positive and unlabeled data
Published by Association for Computing Machinery (ACM) ,2008
Survey of network-based defense mechanisms countering the DoS and DDoS problems
ACM Computing Surveys, 2007
Espresso
Published by Association for Computational Linguistics (ACL) ,2006
Estimating the Support of a High-Dimensional Distribution
Neural Computation, 2001
Extracting Patterns and Relations from the World Wide Web
Lecture Notes in Computer Science, 1999
Message Understanding Conference-6
Published by Association for Computational Linguistics (ACL) ,1996
Automatic acquisition of hyponyms from large text corpora
Published by Association for Computational Linguistics (ACL) ,1992

Cited by 78 articles