PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

Guo, Wenbo; Liu, Chengwei; Wang, Limin; Wu, Jiahui; Xu, Zhengzi; Huang, Cheng; Fang, Yong; Liu, Yang

Computer Science > Software Engineering

arXiv:2409.15049 (cs)

[Submitted on 23 Sep 2024 (v1), last revised 27 Sep 2024 (this version, v2)]

Title:PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

Authors:Wenbo Guo, Chengwei Liu, Limin Wang, Jiahui Wu, Zhengzi Xu, Cheng Huang, Yong Fang, Yang Liu

View PDF

Abstract:The rise of malicious packages in public registries poses a significant threat to software supply chain (SSC) security. Although academia and industry employ methods like software composition analysis (SCA) to address this issue, existing approaches often lack timely and comprehensive intelligence updates. This paper introduces PackageIntel, a novel platform that revolutionizes the collection, processing, and retrieval of malicious package intelligence. By utilizing exhaustive search techniques, snowball sampling from diverse sources, and large language models (LLMs) with specialized prompts, PackageIntel ensures enhanced coverage, timeliness, and accuracy. We have developed a comprehensive database containing 20,692 malicious NPM and PyPI packages sourced from 21 distinct intelligence repositories. Empirical evaluations demonstrate that PackageIntel achieves a precision of 98.6% and an F1 score of 92.0 in intelligence extraction. Additionally, it detects threats on average 70% earlier than leading databases like Snyk and OSV, and operates cost-effectively at $0.094 per intelligence piece. The platform has successfully identified and reported over 1,000 malicious packages in downstream package manager mirror registries. This research provides a robust, efficient, and timely solution for identifying and mitigating threats within the software supply chain ecosystem.

Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2409.15049 [cs.SE]
	(or arXiv:2409.15049v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2409.15049

Submission history

From: Wenbo Guo [view email]
[v1] Mon, 23 Sep 2024 14:22:53 UTC (2,292 KB)
[v2] Fri, 27 Sep 2024 08:03:44 UTC (2,292 KB)

Computer Science > Software Engineering

Title:PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators