Transforming Cyber Threat Intelligence with Language Models and GPT-3

digitalkarachi.com 22 November 2023 2 min read

Threat Intelligence (TI) is a complex field whose aim is to understand and predict threats based on data collected across the open internet, dark-web forums, vendor advisories and post-incident reports. For two decades, the workflow has been the same: highly trained analysts read enormous amounts of unstructured text, then write neat structured reports. Large language models are about to upend every step of that pipeline.

The bottleneck nobody talks about

The hard part of TI has never been the data — there is too much of it. The hard part is converting prose into structured indicators of compromise (IoCs), TTPs and MITRE ATT&CK mappings, and doing it fast enough to matter. A senior analyst can read about 20,000 words of threat reporting per day. The internet produces more than that every minute.

Where GPT-class models actually help

1. Few-shot IoC extraction

Given five examples of "extract every IP, domain, hash, CVE and malware family from this paragraph", GPT-3.5 reaches roughly 92% recall on standard TI benchmarks. That is good enough to use as an analyst's first pass, with a human reviewing the 8%.

2. ATT&CK mapping

Mapping a narrative ("the actor used a scheduled task to maintain persistence") to a specific ATT&CK technique (T1053.005) used to take an experienced analyst 15 minutes per report. A retrieval-augmented LLM does it in under three seconds, with the citation embedded.

3. Cross-language correlation

A surprising amount of high-signal TI is in Russian, Mandarin, Persian and Arabic. Translation quality is now good enough that a single English-speaking analyst can run a multilingual workflow that previously required a small team.

The RAG pattern for TI

Retrieval-Augmented Generation is the architecture that actually works in production:

Ingest reports, blogs, forum dumps into a vector database keyed by sentence embeddings.
For a given question (e.g. "what is APT41's latest C2 infrastructure?") retrieve the top-K relevant chunks.
Pass them as context to the LLM and demand a citation for every claim.
Refuse to answer if the citations don't exist — this is the single most important guardrail.

Where it still fails

Novel actors: zero-shot reasoning about a never-before-seen group is unreliable.
Attribution: LLMs will happily over-attribute on weak evidence if you let them.
Adversarial poisoning: threat actors are already seeding fake reports to manipulate downstream RAG pipelines.

What this means for the analyst's job

It does not eliminate the analyst. It eliminates the boring 80% of the analyst's day — extraction, mapping, translation, summary — and concentrates their time on the 20% that actually requires judgement: attribution, prediction and writing the briefing the CISO will read.

The TI teams that lean into this shift in 2024 will produce more, better, faster intelligence than the teams that don't. The gap will compound.