RPKI Analysis Code (for reproducibility of the IMC’19 paper)

Preliminary

To analyze validated BGP announcements, you first need to have two datasets: RPKI historical objects and public BGP datasets.
Due to the massive size of the datasets (46 billion BGP announcements and 8-years RPKI objects), we strongly encourage you to use distributed cluster-computing framework (we used Spark for large-scale data processing. For your information, it took more than 3 days to verify all 46 billion BGP announcements with RPKI by utilizing ~700 cores and ~4 TB RAMs. )
During the validation process, we also use two additional datasets to infer (1) the relationship between an AS who originated the BGP announcement and an AS that actually owns the announced IP prefix (i.e., ASN in the ROA), and (2) ISP and country information from the ASN from CAIDA: as-organization, as-relationship.
Lastly, we also used NRO (Number Resource Organization) to infer who allocated and assigned the IP prefixes advertised through BGP.

Summary of source codes

Here, we provide following source codes. The instruction and usage of the source codes are explainedbelow.

filename	Download	Misc.
`produce-vrps.py`	link	You can use Ziggy to produce a set of VRPs.
`data-pruning.py`	link
`spark-verify.py`	link	This code runs on a Spark cluster.
`analysis-codes.tar.gz`	link
`plotting-script.tar.gz`	link	Gnuplot scripts.

Reproducing the figures in the IMC’19 paper

1. Generate Validated ROA Payloads (VRPs) from ROAs.

Among the historical RPKI objects, we specifically focus on ROAs to generate VRPs. A structure of ROA is defined at RFC 6482. The script produce-vrps.py will generate VRPs from each of the ROAs in the following format, which will be used to validate the IP prefixes from BGP announcements. (We used a python third-party library, rpki.net, to parse ROA objects.)

time	prefix	prefix-len	max-len	ASN	num-covered-ip-addresses	country code	TAL
20170601	103.205.38.0	24	24	64076	256	N/A	apnic
20170601	101.101.96.0	22	24	45932	1024	N/A	apnic
20170601	103.1.156.0	22	24	45932	1024	N/A	apnic
…	…	…	…	…	…	…	…

2. Obtain daily unique BGP prefixes from each of the BGP Datasets.

After obtaining BGP datasets, we need to remove duplicate IP prefixes announcement to reduce the size of the datasets and to make validation process faster. The script, data-pruning.py, removes duplicated entries and outputs in the following format:

vantage point	type	time	flag	peer-ip	peer-as	prefix	as-path	protocol
rrc00	BGP4MP	05/28/18 07:02:51	A	111.91.233.1	45896	0.0.0.0/0	45896 3356	IGP
rrc00	BGP4MP	05/28/18 07:03:52	A	111.91.233.1	45896	100.0.0.0/16	45896 3356 701	IGP
rrc00	BGP4MP	05/28/18 07:03:44	A	111.91.233.1	45896	1.0.0.0/24	45896 3356 6762 13335 13335	IGP
rrc00	BGP4MP	05/28/18 07:03:52	A	111.91.233.1	45896	100.10.0.0/16	45896 3356 701	IGP
rrc00	BGP4MP	05/28/18 07:03:52	A	111.91.233.1	45896	100.1.0.0/16	45896 3356 701	IGP
rrc00	BGP4MP	05/28/18 07:03:52	A	111.91.233.1	45896	100.11.0.0/16	45896 3356 701	IGP
…	…	…	…	…	…	…	…	…

3. Validate BGP Datasets (obtained from 2) against VRP (obtained from 1).

Now you are ready to validate BGP announcements (obtained from 2) using VRPs (obtained from 1). When validating BGP announcements, we followed the following algorithm from RFC6811.
(For those who are interested in the details of how VRPs are used to verify BGP announcement, please refer BGP Prefix Origin Validation (RFC6811))

   // This Pseudo-Code is an excerpt from RFC6811.

   result = BGP_PFXV_STATE_NOT_FOUND;

   //Iterate through all the Covering entries in the local VRP
   //database, pfx_validate_table.
   entry = next_lookup_result(pfx_validate_table, route_prefix);

   while (entry != NULL) {
     prefix_exists = TRUE;

     if (route_prefix_length <= entry->max_length) {
       if (route_origin_as != NONE
           && entry->origin_as != 0
           && route_origin_as == entry->origin_as) {
         result = BGP_PFXV_STATE_VALID;
         return (result);
       }
     }
     entry = next_lookup_result(pfx_validate_table, input.prefix);
   }

   //If one or more VRP entries Covered the route prefix, but
   //none Matched, return "Invalid" validation state.
   if (prefix_exists == TRUE) {
     result = BGP_PFXV_STATE_INVALID;
   }

   return (result);

The source code, spark-verify.py produces the validated results with the following format (Please note that we used Spark cluster for a large-scale data processing):

time	prefix-addr	prefix-len	origin	origin-isp	origin-country	Verified Information¹
20181227	99.108.0.0	14	7018	AT&T Services Inc.	US	1,7018,AT&T Services Inc.,US,None,99.108.0.0/14-14
20181227	99.112.0.0	12	7018	AT&T Services Inc.	US	1,7018,AT&T Services Inc.,US,None,99.112.0.0/12-12
20181227	99.192.128.0	17	27589	MOJOHOST	US	1,27589,MOJOHOST,US,None,99.192.128.0/17-24
20181227	99.32.0.0	12	7018	AT&T Services Inc.	US	1,7018,AT&T Services Inc.,US,None,99.32.0.0/12-12

¹ Verified information is a list of the validated result of a given BGP announcement from its covered ROA, which contains a list of (validation-index, ASN of the covered ROA, ISP of the covered ROA, country code of the covered ROA, relationship between the origin ASN of the BGP and ASN of the covered ROA, and covered IP prefix) tuples; the details of validation-index can be found in the source code.

4. Analyze Validated BGP announcements

The analysis-codes.gz and plotting-script.gz contain three analysis scripts and 10 plotting scripts which generate the figures in the paper. The below table describes each of the functions in the code that generates the dataset and plotting gnuplot scripts.

file	function	figures in the paper	gnuplot script
spark-rpki-object-validation.py	runSparkROAsIPCnt, runSparkROAsIPPercentage, runSparkPercentageASesInROAs	Figure 2	num-vrps-as-ip-ipv4-byIRR.plot
spark-analysis.py	runSparkCalcRPKIEnabledAdv	Figure 3	percentage-rpki-enabled-adv-ipv4.plot
spark-analysis.py	runSparkValidationUniquePrefix, runSparkValidationUniquePrefixAllPrefix	Figure 4, Figure 5	percentage-rpki-uniq-prefix-asn-invalid-ipv4-merge.plot, percentage-rpki-uniq-prefix-asn-invalid-ipv4-focus.plot
spark-analysis.py	runSparkValidationUniquePrefix	Figure 6	num-rpki-uniq-prefix-asn-invalid-adv-reasoning-ipv4.plot
spark-rpki-object-validation.py	runSparkNumPrefixWithMaxlen	Figure 7	percentage-ipprefix-with-maxlen-ipv4.plot (a), percentage-rpki-uniq-prefix-asn-merge-adv-hasMaxLen-ipv4.plot (b)
spark-analysis.py	runSparkClassifyHijackingUniquePrefix	Figure 8	num-rpki-uniq-prefix-classify-hijack-ipv4.plot
spark-analysis.py	runSparkClassifyHijackingUniquePrefixDuration	Figure 9	cdf-num-attack-duration.plot
spark-analysis.py	runSparkClassifyHijackingUniquePrefix	Figure 10	percentage-rpki-uniq-prefix-asn-invalid-suspicious-ipv4.plot
hijack-analysis.py	getPairsOfAttack	Figure 11, 12	cdf-num-hijacked-ipv4-byAS, cdf-num-attacker-ipv4-byAS.plot