Task Description: Implementing a HyperLogLog Sketch and bSktHLL Data Structure
Objective
The goal is to develop two classes to estimate the cardinality (flow spread) of network flow
data using the HyperLogLog algorithm and a specialized data structure comprising an array
of HyperLogLog sketches. The input data (o1.txt) consists of source and destination IP pairs.
The output should be a CSV file containing the estimated cardinality for each unique
destination IP, along with the true count of occurrences.
Requirements
1. HLL Class: Implement a class named HLL to represent a HyperLogLog sketch. This
class should include:
• A constructor that initializes the number of registers (m, default 128) and
register bit size (p, default 5).
• A method add for adding elements to the sketch, which involves hashing the
element, updating the appropriate register based on the hash value, and
tracking the maximum number of leading zeros.
• A method estimate to calculate and return the estimated cardinality based
on the register values.
2. bSktHLL Class: Implement a class named bSktHLL that contains an array of l HLL
sketches (l=3 as each sketch produce a candidate estimate). Each sketch should
have m=128 HLL registers with each register being 5 bits. This class should include:
• A constructor that initializes the array of HLL sketches.
• A method record to hash each flow into two different HLL sketches and
update the sketches accordingly.
• A method query to return the median estimate of the cardinality for a given
flow from the sketches it hashes into.
3. Data Processing Script: Write a script to:
• Read network flow data from a text file, where each line contains a source
and destination IP pair.
• Process each line to record the destination IP in the bSktHLL data structure.
• Generate a CSV file with the following columns: flow_label,
candidate_estimate_1, candidate_estimate_2, candidate_estimate_3,
true_spread. The candidate_estimate columns should contain the
cardinality estimates, and the true_spread should contain the actual count
of occurrences for each destination IP.
Notes: The error in the candidate estimation should not exceed 5% of the actual spread.
References: Assistance functions for recording and querying flows can be sourced from
HyperLogLog.java, while references for hash functions should be utilized from
GeneralUtil.java.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。