联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2024-03-30 07:53

Task Description: Implementing a HyperLogLog Sketch and bSktHLL Data Structure

Objective

The goal is to develop two classes to estimate the cardinality (flow spread) of network flow

data using the HyperLogLog algorithm and a specialized data structure comprising an array

of HyperLogLog sketches. The input data (o1.txt) consists of source and destination IP pairs.

The output should be a CSV file containing the estimated cardinality for each unique

destination IP, along with the true count of occurrences.

Requirements

1. HLL Class: Implement a class named HLL to represent a HyperLogLog sketch. This

class should include:

• A constructor that initializes the number of registers (m, default 128) and

register bit size (p, default 5).

• A method add for adding elements to the sketch, which involves hashing the

element, updating the appropriate register based on the hash value, and

tracking the maximum number of leading zeros.

• A method estimate to calculate and return the estimated cardinality based

on the register values.

2. bSktHLL Class: Implement a class named bSktHLL that contains an array of l HLL

sketches (l=3 as each sketch produce a candidate estimate). Each sketch should

have m=128 HLL registers with each register being 5 bits. This class should include:

• A constructor that initializes the array of HLL sketches.

• A method record to hash each flow into two different HLL sketches and

update the sketches accordingly.

• A method query to return the median estimate of the cardinality for a given

flow from the sketches it hashes into.

3. Data Processing Script: Write a script to:

• Read network flow data from a text file, where each line contains a source

and destination IP pair.

• Process each line to record the destination IP in the bSktHLL data structure.

• Generate a CSV file with the following columns: flow_label,

candidate_estimate_1, candidate_estimate_2, candidate_estimate_3,

true_spread. The candidate_estimate columns should contain the

cardinality estimates, and the true_spread should contain the actual count

of occurrences for each destination IP.

Notes: The error in the candidate estimation should not exceed 5% of the actual spread.

References: Assistance functions for recording and querying flows can be sourced from

HyperLogLog.java, while references for hash functions should be utilized from

GeneralUtil.java.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp