Crowdsourcing IP reputation data from online forums

Crowdsourcing IP reputation data from online forums

In this post, I discuss the topic of crowdsourcing IP reputation data from online forums. The post is inspired by a paper I read recently, Gharibshah J., Papalexakis E.E., Faloutsos M. (2018) RIPEx: Extracting Malicious IP Addresses from Security Forums Using Cross-Forum Learning. In: Phung D., Tseng V., Webb G., Ho B., Ganji M., Rashidi L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science, vol 10939. Springer, Cham. Read it on arxiv.

I describe some concerns I see with crowdsourcing of security-related data while also discussing potential solutions.

In fighting fraud and malicious behavior, organizations are better suited to using internal data sources and institutional knowledge to quantify risk. Trust and Safety groups in many organizations that deal with user-generated content, process payments or serve visitors online often have access to data science models that derive risk scores from a variety of internal data sources. These risk scores are then utilized to apply friction such as throttling services/APIs, rate-limiting of user behavior and in egregious cases, banning users from the platform or from accessing their service.

The importance of internal data sources to assess risk is obvious. One compelling argument in favor of incorporating external data sources is the deterrents against bad actors who may have hitherto not visited your service but may in the future. Additionally, increasing your inventory of known high-risk entities is a reasonable endeavor if not a necessary one. For organizations that wish to harden their services and systems from the outset (new product or service release), a blocklist is essential and can often be derived from freely available online data.

Popular examples of crowdsourced blocklists exist to block ads, throwaway email services and hosting providers that can be leveraged by organizations to improve their defenses. These lists often exist in easily consumable text formats, with accompanying code for easy integration and should perhaps be considered a first step in consolidating and utilizing security information available online. Many of the concerns that may plague sourcing information from online forums are mitigated when consuming from the kind of lists mentioned here. Some of these sources are public repositories on Github which reveal the level of activity, tenure, and maturity of the repository. This is invaluable when sourcing online data to incorporate into internal decision-making processes especially considering that real users can be impacted. Denying services to real users due to false positives not only affects businesses adversely with real revenue loss, but denies users access to essential online services that are critical aspects of professional workflows and personal conveniences.

Web crawling, a necessity to crowdsource data from online forums isn't a trivial undertaking. With commercial scraping services available, some of the challenges associated with web scraping can be mitigated somewhat. The tradeoff here is to find (and then crawl) enough number of forums to generate reliable and a large variety of data, while also managing the cost associated with maintaining and updating crawlers.

Parsing crawled data is another challenge that requires careful quality control such as IP address verification, updating crawl frequency of target sites, dealing with the staleness of forum data, adversarial attacks wherein forum data can be poisoned to spoil the integrity of anti-fraud models or blocklists. These challenges are compounded when dealing with a large number of forums that vary in the structure and web markup semantics that can complicate and overwhelm the development of forum specific parsers.

Many of these challenges are prohibitive to solve but not entirely insurmountable. IP addresses are well understood and most programming languages have mature libraries to verify IP address strings as per standards. Forum integrity is largely based on the popularity of forums in the Security community, the volume and recency of activity and the quality of discussions. For security professionals, it may be relatively easy to identify which forums have a high signal to noise ratio and which forums to ignore. This can reduce parse complexity by reducing the number of forums to crawl and parse.

Other concerns related to fake data or poisoned data can be verified by cross-referencing across other reliable security forums and against the well-formed lists available on Github. The absence of any overlap may not be a sure giveaway of bad data but can be used as a signal to downsample those entities or discard altogether. Additionally, usernames associated with forum posts that are being crawled can be utilized to develop a user reputation over time which can help filter out forum data from users who post infrequently, share information without any overlaps across other reliable forums or post messy data.

While the cost of crowdsourcing data from online security forums may seem intractable, there are some clear advantages over IP reputation data available from blocklists or IP databases such as Maxmind. Some advantages are;

  • Availability of information in the early phases of the attack life-cycle. A user or group of users may have posted details that may otherwise take much longer to percolate into more conventional data repositories.
  • Forums may contain rich details posted by the user providing additional context around the IP addresses shared. E.g., knowing if the IP belongs to a botnet, hosting provider or anonymous proxy can help augment downstream systems with additional features to aid in decision making. Additional details such as specific botnet name can provide useful tags for downstream analytics.
  • Access to highly reliable and multiple oracles. Security researchers and white-hat hackers are always on the lookout for threats and threat intelligence. A large number of such individuals share their insights freely and openly online. The information shared is very valuable not to mention multi-faceted and can provide additional levers with which to improve defenses.

Crowdsourcing data from online security forums is recommended for teams who have access to cross-functional professionals given the variety of technologies that need to be stitched together to create and maintain such a program. Alternatively, threat intelligence vendors exist who consolidate data from a variety of data sources and make consumable feeds available as commercial offerings. Organizations are also known to come together to share knowledge and threat intelligence. While the idea of crowdsourcing data is appealing, its effectiveness should be considered compared to sourcing threat intelligence data from more conventional sources. Nevertheless, I believe it is a worthy endeavor to identify reliable online data sources that can be assimilated into the overall threat intelligence strategy.