RPKI Analysis Code (for reproducibility of the IMC’19 paper)

Preliminary

  1. To analyze validated BGP announcements, you first need to have two datasets: RPKI historical objects and public BGP datasets.

  2. Due to the massive size of the datasets (46 billion BGP announcements and 8-years RPKI objects), we strongly encourage you to use distributed cluster-computing framework (we used Spark for large-scale data processing. For your information, it took more than 3 days to verify all 46 billion BGP announcements with RPKI by utilizing ~700 cores and ~4 TB RAMs. )

  3. During the validation process, we also use two additional datasets to infer (1) the relationship between an AS who originated the BGP announcement and an AS that actually owns the announced IP prefix (i.e., ASN in the ROA), and (2) ISP and country information from the ASN from CAIDA: as-organization, as-relationship.

  4. Lastly, we also used NRO (Number Resource Organization) to infer who allocated and assigned the IP prefixes advertised through BGP.

Summary of source codes

Here, we provide following source codes. The instruction and usage of the source codes are explainedbelow.

filename Download Misc.
produce-vrps.py link You can use Ziggy to produce a set of VRPs.
data-pruning.py link
spark-verify.py link This code runs on a Spark cluster.
analysis-codes.tar.gz link
plotting-script.tar.gz link Gnuplot scripts.

Reproducing the figures in the IMC’19 paper

1. Generate Validated ROA Payloads (VRPs) from ROAs.

Among the historical RPKI objects, we specifically focus on ROAs to generate VRPs. A structure of ROA is defined at RFC 6482. The script produce-vrps.py will generate VRPs from each of the ROAs in the following format, which will be used to validate the IP prefixes from BGP announcements. (We used a python third-party library, rpki.net, to parse ROA objects.)

time prefix prefix-len max-len ASN num-covered-ip-addresses country code TAL
20170601 103.205.38.0 24 24 64076 256 N/A apnic
20170601 101.101.96.0 22 24 45932 1024 N/A apnic
20170601 103.1.156.0 22 24 45932 1024 N/A apnic

2. Obtain daily unique BGP prefixes from each of the BGP Datasets.

After obtaining BGP datasets, we need to remove duplicate IP prefixes announcement to reduce the size of the datasets and to make validation process faster. The script, data-pruning.py, removes duplicated entries and outputs in the following format:

vantage point type time flag peer-ip peer-as prefix as-path protocol
rrc00 BGP4MP 05/28/18 07:02:51 A 111.91.233.1 45896 0.0.0.0/0 45896 3356 IGP
rrc00 BGP4MP 05/28/18 07:03:52 A 111.91.233.1 45896 100.0.0.0/16 45896 3356 701 IGP
rrc00 BGP4MP 05/28/18 07:03:44 A 111.91.233.1 45896 1.0.0.0/24 45896 3356 6762 13335 13335 IGP
rrc00 BGP4MP 05/28/18 07:03:52 A 111.91.233.1 45896 100.10.0.0/16 45896 3356 701 IGP
rrc00 BGP4MP 05/28/18 07:03:52 A 111.91.233.1 45896 100.1.0.0/16 45896 3356 701 IGP
rrc00 BGP4MP 05/28/18 07:03:52 A 111.91.233.1 45896 100.11.0.0/16 45896 3356 701 IGP

3. Validate BGP Datasets (obtained from 2) against VRP (obtained from 1).

Now you are ready to validate BGP announcements (obtained from 2) using VRPs (obtained from 1). When validating BGP announcements, we followed the following algorithm from RFC6811.
(For those who are interested in the details of how VRPs are used to verify BGP announcement, please refer BGP Prefix Origin Validation (RFC6811))

   // This Pseudo-Code is an excerpt from RFC6811.

   result = BGP_PFXV_STATE_NOT_FOUND;

   //Iterate through all the Covering entries in the local VRP
   //database, pfx_validate_table.
   entry = next_lookup_result(pfx_validate_table, route_prefix);

   while (entry != NULL) {
     prefix_exists = TRUE;

     if (route_prefix_length <= entry->max_length) {
       if (route_origin_as != NONE
           && entry->origin_as != 0
           && route_origin_as == entry->origin_as) {
         result = BGP_PFXV_STATE_VALID;
         return (result);
       }
     }
     entry = next_lookup_result(pfx_validate_table, input.prefix);
   }

   //If one or more VRP entries Covered the route prefix, but
   //none Matched, return "Invalid" validation state.
   if (prefix_exists == TRUE) {
     result = BGP_PFXV_STATE_INVALID;
   }

   return (result);

The source code, spark-verify.py produces the validated results with the following format (Please note that we used Spark cluster for a large-scale data processing):

time prefix-addr prefix-len origin origin-isp origin-country Verified Information1
20181227 99.108.0.0 14 7018 AT&T Services Inc. US 1,7018,AT&T Services Inc.,US,None,99.108.0.0/14-14
20181227 99.112.0.0 12 7018 AT&T Services Inc. US 1,7018,AT&T Services Inc.,US,None,99.112.0.0/12-12
20181227 99.192.128.0 17 27589 MOJOHOST US 1,27589,MOJOHOST,US,None,99.192.128.0/17-24
20181227 99.32.0.0 12 7018 AT&T Services Inc. US 1,7018,AT&T Services Inc.,US,None,99.32.0.0/12-12

1 Verified information is a list of the validated result of a given BGP announcement from its covered ROA, which contains a list of (validation-index, ASN of the covered ROA, ISP of the covered ROA, country code of the covered ROA, relationship between the origin ASN of the BGP and ASN of the covered ROA, and covered IP prefix) tuples; the details of validation-index can be found in the source code.

4. Analyze Validated BGP announcements

The analysis-codes.gz and plotting-script.gz contain three analysis scripts and 10 plotting scripts which generate the figures in the paper. The below table describes each of the functions in the code that generates the dataset and plotting gnuplot scripts.

file function figures in the paper gnuplot script
spark-rpki-object-validation.py runSparkROAsIPCnt, runSparkROAsIPPercentage, runSparkPercentageASesInROAs Figure 2 num-vrps-as-ip-ipv4-byIRR.plot
spark-analysis.py runSparkCalcRPKIEnabledAdv Figure 3 percentage-rpki-enabled-adv-ipv4.plot
spark-analysis.py runSparkValidationUniquePrefix, runSparkValidationUniquePrefixAllPrefix Figure 4, Figure 5 percentage-rpki-uniq-prefix-asn-invalid-ipv4-merge.plot, percentage-rpki-uniq-prefix-asn-invalid-ipv4-focus.plot
spark-analysis.py runSparkValidationUniquePrefix Figure 6 num-rpki-uniq-prefix-asn-invalid-adv-reasoning-ipv4.plot
spark-rpki-object-validation.py runSparkNumPrefixWithMaxlen Figure 7 percentage-ipprefix-with-maxlen-ipv4.plot (a), percentage-rpki-uniq-prefix-asn-merge-adv-hasMaxLen-ipv4.plot (b)
spark-analysis.py runSparkClassifyHijackingUniquePrefix Figure 8 num-rpki-uniq-prefix-classify-hijack-ipv4.plot
spark-analysis.py runSparkClassifyHijackingUniquePrefixDuration Figure 9 cdf-num-attack-duration.plot
spark-analysis.py runSparkClassifyHijackingUniquePrefix Figure 10 percentage-rpki-uniq-prefix-asn-invalid-suspicious-ipv4.plot
hijack-analysis.py getPairsOfAttack Figure 11, 12 cdf-num-hijacked-ipv4-byAS, cdf-num-attacker-ipv4-byAS.plot