<p>The two datasets here record behavioural activity for malicious and benign executable files capable of running on a Windows 7 operating system. <br></p><p><strong>Dataset 1:</strong></p><ul><li>filename = "data_1.csv"</li><li>594 benign samples </li><li>595 malicious samples</li><li>Up to 305 seconds (5:05 min) execution per file</li><li>The data was collected in a VirtualBox[1] virtual machine using Cuckoo Sandbox[2] with a custom package written in the Java library, Sigar[3] to collect the machine activity data. </li><li>The virtual machine used 2GB RAM, 25 GB storage, and a single CPU core running 64-bit Windows 7.</li></ul><p><br></p><p><strong>Dataset 2:</strong></p><ul><li>filename = "data_2.csv"</li><li>2345 benign samples </li><li>2286 malicious samples</li><li>Up to 20 seconds execution per file</li><li>The data was collected in a VirtualBox[1] virtual machine using Cuckoo Sandbox[2] with a custom package written in the python library, Psutil[4] to collect the machine activity data. </li><li>The virtual machine used 8GB RAM, 25 GB storage, and a single CPU core running 64-bit Windows 7.</li></ul><p><br></p><p><br></p><p><strong>Columns</strong></p><p><br></p><ul><li>sample_id: an identifier value for the samples (categorical)</li><li>vector: time in seconds since start of file execution (numeric)</li><li>malware: class label 0=benign, 1=malicious (categorical)</li><li>cpu_sysem: percentage of cpu being used to run programs in system kernel (numeric)</li><li>cpu_user: percentage of cpu being used to run programs in user space (numeric)</li><li>memory: bytes currently being used in memory (numeric)</li><li>swap: bytes currently being used in swap memory (numeric)</li><li>total_pro: total number of processes running (numeric)</li><li>max_pid: maximum process id held by a process (numeric)</li><li>rx_bytes: number of bytes being received (numeric)</li><li>tx_bytes: number of bytes being sent (numeric)</li><li>rx_packets: number of packets being received (numeric) </li><li>tx_packets: number of packets being sent (numeric)</li><li>test_set: True=sample belongs to test set, False=sample belongs to training set</li></ul><p><br></p><p><br></p><p><strong>Dataset 2 only:</strong></p><ul><li>family: malware type - value missing if unknown or benign (categorical)</li><li>variant: malware variant - value missing unknown or benign (categorical)</li><li>test-set: file was first seen before October (categorical)</li></ul><p><br></p><p>[1] <a href="https://www.virtualbox.org/wiki/Downloads">https://www.virtualbox.org/wiki/Downloads</a></p><p>[2] <a href="https://cuckoosandbox.org/">https://cuckoosandbox.org/</a> </p><p>[3] <a href="https://github.com/hyperic/sigar">https://github.com/hyperic/sigar</a> </p><p>[4] <a href="https://pypi.org/project/psutil/">https://pypi.org/project/psutil/</a></p><p>Research results based upon these data are published at http://doi.org/10.1016/j.cose.2018.05.010<br></p><p><br></p>
Funding
Deep Learning Methods for the Analysis of Cyber Behaviour and Detection of Cyber Risk (2016-10-01 - 2020-09-30); Rhode, Matilda. Funder: Airbus Operations Ltd, Engineering and Physical Sciences Research Council