#### 联系方式

• QQ：99515681
• 邮箱：99515681@qq.com
• 工作时间：8:00-23:00
• 微信：codinghelp

#### 您当前位置：首页 >> Algorithm 算法作业Algorithm 算法作业

###### 日期：2020-10-28 10:40

Text Processing: Emails For SPAM/HAM Classification

STA141B, Fall 2020

You may not look for or use code that addresses this dataset or use a package that can read an email

message into an R structure. You must implement the computational approach yourself. You can use

tutorials and stackoverflow for finding guides to general questions about text processing.

We are all familiar with SPAM email messages. Ultimately, we would like to be able to use statistics to

classify new email messages as SPAM or HAM (valid mail). There are many statistical techniques we could

use. But before we can use them we need “data”. Each message is an observation. We need to “measure”

variables on each message in a sample of email messages. We also need to know if a message is SPAM or

HAM. We can then train a statistical classifier using these variables to predict if a new message is SPAM or

HAM.

In this assignment, you will create a data.frame from a set of email messages. Each row corresponds to an

email message and will contain 16 or more variables derived from that message. One variable is whether

it is actually HAM or SPAM which can be derived from the name of the folder in which the message is

located.

The data we use come from the Spam Assassin project. The specific data you are to work with is available

on Canvas1

Your job is to process each email message

? to a structure (list) that contains the header, the body and any attachments (each of which will also

have a header and a body)

? create an R data frame of “derived” variables that give various measures of the email messages

? explore the data to see which variables help to discriminate between SPAM and HAM messages using

plots and numerical summaries

These derived variables might be, for example, the number of recipients to whom the mail was sent, the

percentage of capital words in the body of the text, is the message a reply to another message. See below for

a list of 25 possible variables. You are to write code to compute at least 15 of these and whether the message

is SPAM or HAM.

Many of the variables can be computed using similar approaches. For the most part, you should use regular

expressions rather than using strsplit to break the strings into individual characters.

You are very welcome to define new variables that you can compute from each email message that you think

will help to classify it as SPAM or HAM. Clearly state the meaning and specifics of the variable, why you

think it might be useful in classifying HAM or SPAM emails, and then show the code to compute it.

1

1

Once you have this data frame of “derived” variables, explore the relationships among these variables and

especially how they might be used to classify SPAM and HAM messages. In other words, look at frequency

tables and scatterplots of the variables and color code the points based on if the message is SPAM or HAM.

Which variables seem to do best at discriminating between SPAM and HAM messages?

Your report should be similar in structure to the first assignment in that it has two primary sections:

? describe the computational approach and high- and intermediate-level details of how you implemented

this approach.

? explore the data and interpret them in the context of

Submit both the report (as a PDF document) and the R scripts for creating the data.frame and for exploring

the data, and, importantly, the R files containing functions you write and use in the script(s).

Some functions you might find useful include: grep, gsub, gregexpr, substring, nchar, strsplit.

sprintf, paste.

table, plot, scatter.smooth, hexbin.

1 The Anatomy of an E-mail message

Electronic mail, usually called e-mail, consists of simple text messages – a piece of text sent to a recipient

via the internet. An e-mail message consists of two parts, the header and the body. The body of the e-mail

message is separated from the header by a single blank line. When an attachment is added to an e-mail

message, the attachment is included in the body of the message. Even with attachments, e-mail messages

are still only text messages.

the date of transmission. This information is relayed in a special format that consists of KEY:VALUE pairs.

Below is a sample header from a message found on the SpamAssassin website.

Return-Path: whisper@oz.net

Delivery-Date: Fri Sep 6 20:53:36 2002

From: whisper@oz.net (David LeBlanc)

Date: Fri, 6 Sep 2002 12:53:36 -0700

Subject: [Spambayes] Deployment

Notice the keys are Return-Path, Delivery-Date, From, Date, Subject, In-Reply-to, and Message-ID. The

value follows the keyword. For example, in the above header, the value of the From key is

whisperatoz.net (David LeBlanc).

Some of these keys are mandatory such as Date, From, and To (or In-Reply-To, or Bcc). Other

keys are optional but widely used, such as Subject, Cc, Received, and Message-ID. Many keys

are ignored by the mail system, but the entire header is relayed on to the recipient’s server whether or not

it is recognized. For example, keys starting with “X-” are for personal application or institution use and are

2

Description of Variables

isSpam logical whether mail is Spam (TRUE) or Ham (FALSE)

isRe logical if the string Re: appears as the first word in the subject of the

message

numLinesInBody integer a count of the number of lines in the body of the email message

bodyCharacterCount integer the number of characters in the body of the email message

subjectExclamationCount integer a count of the number of exclamation marks (!) in the subject of

the message

subjectQuestCount integer the number of question marks in the subject

numAttachments integer the number of attachments in the message.

that was set to high

numRecipients integer the number of recipients in the To, Cc fields

percentCapitals numeric the percentage of the characters in the body of the email that are

upper case (excluding blanks, numbers, and punctuation)

sortedRecipients logical the recipient list is sorted by address

subjectPunctuationCheck logical whether the subject has punctuation or digits surrounded by characters,

e.g. V?agra and pay1ng, but not New!

hourSent integer the hour in the day the mail was sent (0 – 23)

multipartText logical whether the header states that the message is a multipart/text, i.e.

with attachments.

containsImages logical whether the message contain images (in HTML)

isPGPsigned logical indicates whether the mail was digitally signed (e.g. using PGP or

GPG)

percentHTMLTags numeric the proportion of any HTML text in the message’s body that is

made up of HTML markup and not content.

subjectSpamWords logical whether the subject contains one of the following phrases: viagra,

pounds, free, weight, guarantee, millions, dollars, credit, risk, prescription,

generic, drug, money back, credit card.

percentSubjectBlanks numeric the percentage of blanks in the subject

messageIdHasNoHostname logical whether the message id that uniquely identifies the message has no

component identifying the machine from which it was set

fromNumericEnd logical whether the user login in the From: field ends in numbers

isYelling logical whether the Subject of the mail is in capital letters

percentForwards numeric percent of the message’s body that is made up of content included

from other messages

isOriginalMessage logical body does not contain the phrase “original message” or something

similar

isDear logical whether the message body contains a form of the introduction Dear

. . .

isWrote logical whether the text includes a line indicating an included message

as identified by the word wrote: in several different possible languages

averageWordLength numeric the average length of the words in the body of the message

numDollarSigns integer the number of dollar signs in the body of the message

3

ignored by other applications. The Received header lines are important because they allow the message

to be tracked. As a message makes its way to the intended recipient, servers add additional Received lines

Below are some typical header keys:

? Message-Id: a unique identifier for the message, assigned by the originating server;

? Return-Path: specifies the sender’s address and bounced mail gets sent to this address;

? Date: added by the e-mail client;

? Cc: lists the recipients of a “carbon copied” message;

? MIME-Version: used for encoding binary content as attachments.

A value may be continued on a second line of the header, in which case the line will be indented and begin

with a tab character or blank spaces. Consider this header

From irregulars-admin@tb.tf Thu Aug 22 14:23:39 2002

Delivered-To: zzzz@localhost.netnoteinc.com

by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66

for <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)

by localhost with IMAP (fetchmail-5.9.0)

for zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)

MIME-Version: 1.0

After the ’From ’ row, there are two fields (Return-Path and Delivered-To) on separate lines. The next two

fields however are ’Received’ fields and both span 3 rows. The first row of each starts in the first column

in the first column and so indicates the start of a new field, and the conclusion of the previous field.

1.2 The Body of the Email

The body of the email is all the text after the first blank line following the header and up to any attachments.

If the message has no attachments, then the body is everything excluding the header. If the message has

attachments, we need to find where they begin to find the body. So we look at attachments next.

1.3 E-mail Attachments

An Internet standard called MIME, Multipurpose Internet Mail Extensions, specifies how messages may be

formatted and how to separate the attachments from the message. Information about the MIME encoding is

provided through header fields, which are specified in an RFC.

The Content-Type key is used to describe the content of a component or of the entire body. The value

provides the top-level type and subtype using the syntax:

4

top-level/subtype; parameter.

Parameters may be required or optional. Below is an example of a content-type where the top-level is

multipart, which indicates there will be several documents in the body of the message, the mixed

subtype tells us that the documents may be of different types, and the boundary parameter provides a

special character string for delimiting the start and end of the message parts.

Content-Type: multipart/mixed;

boundary="----=_NextPart_000_00DE_01511A02.DB1A02A0"

The Content-Type field in this example tells the receiving e-mail program that this message has more than

one component, and each component will be separated by the string of characters

----=_NextPart_000_00DE_01511A02.DB1A02A0

The boundary string marks the beginning of each component. It is prefaced with two additional hyphens in

all instances. The boundary string is also used to denote the end of the message, where it is both prefaced

by two hyphens and followed immediately by two hyphens. The receiving email program knows when the

last component of the message has been read when it reads the boundary string with two additional hyphens

on either end of the string,

------=_NextPart_000_00DE_01511A02.DB1A02A0--

Each component of a message must be prefaced by the boundary string and a blank line. It may also contain

MIME information. If the blank line is missing, the recipient’s e-mail client may have difficulty telling

where the header information stops and the text of the message begins.

There are seven top-level types of attachments: text, image, audio, video, application, multipart, and message.

Other examples of Content-Type values follow:

Content-type: text/html; charset=euc-kr;

Content-Type: application/zip; name="testFile.zip"

The first example indicates that the message is in HTML format using a Korean character set. The second

indicates that the component is a zip file, and the sender named it testFile.zip. Binary files (such as a

compressed archive) can be sent as attachments. In such cases, the sender’s software must first encode the

binary file so that it can be sent over the Internet. One common encoding scheme is known as base64.

We conclude by providing two sample e-mail messages. The first is a plain text e-mail with no attachments.

It consists of an instructor’s response to an e-mail inquiry sent by a student. The second e-mail message

consists of a text message and two attachments sent by a student to the instructor. This e-mail message

has then been forwarded by the instructor to the teaching assistant. The three periods at the end of each

attachment indicates that only part of the attachment has been displayed. The first attachment is a pdf file

and the second is an HTML file. The forwarded message is a plain text file.

5

From nt@stat.Berkeley.EDU Mon Feb 2 22:16:19 2004 -0800

Date: Mon, 2 Feb 2004 22:16:19 -0800 (PST)

From: nt@stat.Berkeley.EDU

X-X-Sender: nt@kestrel.Berkeley.EDU

Subject: Re: prof: did you receive my hw?

Message-ID: <Pine.SOL.4.50.0402022216120.2296-100000@kestrel.Berkeley.EDU>

References: <web-569552@calmail-st.berkeley.edu>

MIME-Version: 1.0

Content-Type: TEXT/PLAIN; charset=US-ASCII

Status: O

X-Status:

X-Keywords:

X-UID: 9079

-------------------------------------

On Mon, 2 Feb 2004, txxxx wrote:

> hey prof .nt,

>

> i sent out my hw on sunday night. i just wonder did you receive it.

> because i am kinda scared thatyou didnt’ receive it.

> like i just wonder how do i know if you got it or not, since the cal

> mail system is kinda weird sometimes. thanks

>

> txxxx

>

Figure 1: Sample email message with no attachments. The header includes fourteen key:value pairs. Note

the Date key includes a time-zone offset, the Message-ID key gives the unique ID to track the mail from

the stat.berkeley.edy mail server, the Content-Type key indicates it is a plain text message with

no sttachments, and thre are four X- keys.

6

From nt@stat.Berkeley.EDU Mon Feb 2 22:18:56 2004 -0800

Date: Mon, 2 Feb 2004 22:18:55 -0800 (PST)

From: nt@stat.Berkeley.EDU

X-X-Sender: nt@kestrel.Berkeley.EDU

To: Gang Liang <liang@stat.Berkeley.EDU>

Subject: Assignment 1 sorry (fwd)

Message-ID: <Pine.SOL.4.50.0402022218470.2296-201000@kestrel.Berkeley.EDU>

MIME-Version: 1.0

Content-Type: MULTIPART/Mixed; BOUNDARY="_===669732====calmail-me.berkeley.edu===_"

Content-ID: <Pine.SOL.4.50.0402022218471.2296@kestrel.Berkeley.EDU>

Status: RO

X-Status:

X-Keywords:

X-UID: 9080

--_===669732====calmail-me.berkeley.edu===_

Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; FORMAT=flowed

Content-ID: <Pine.SOL.4.50.0402022218472.2296@kestrel.Berkeley.EDU>

Figure 2: This sample email (split over two figures) has two attachments, a PDF file and

an HTML file. The Content-Type key indicates that the attachments are separated by

===669732====calmail-me.berkeley.edu=== with a -- prefix. The first part of the email

body is the a forwarded message. Note that it has its own header indicating the content type is plain text.

Next is a PDF attachment which the owner has named PLOTS.pdf, and the third part is an HTML attachment.

Both attachments are encoded in base64.

7

---------- Forwarded message ----------

Date: Mon, 02 Feb 2004 21:50:47 -0800

To: nt@stat.Berkeley.EDU

Subject: Assignment 1 sorry

I am sorry to send this email again, but my outbox told me that

the last email only send 1 attached file.

I am send ing this again to make sure you recieve the all

the necessary files.

Thank You and sorry for the inconvenience.

--_===669732====calmail-me.berkeley.edu===_

Content-Type: APPLICATION/PDF; CHARSET=US-ASCII

Content-Transfer-Encoding: BASE64

Content-ID: <Pine.SOL.4.50.0402022218473.2296@kestrel.Berkeley.EDU>

Content-Description:

Content-Disposition: ATTACHMENT; FILENAME="PLOTS.pdf"

JVBERi0xLjEKJYHigeOBz4HTDQoxIDAgb2JqCjw8Ci9DcmVhdGlvbkRhdGUgKEQ6MjAwNDAy

MDIxMTIwMTEpCi9Nb2REYXRlIChEOjIwMDQwMjAyMTEyMDExKQovVGl0bGUgKFIgR3JhcGhp

Y3MgT3V0cHV0KQovUHJvZHVjZXIgKFIgMS44LjEpCi9DcmVhdG9yIChSKQo+PgplbmRvYmoK

...

--_===669732====calmail-me.berkeley.edu===_

Content-Type: TEXT/HTML; CHARSET=US-ASCII

Content-Transfer-Encoding: BASE64

Content-ID: <Pine.SOL.4.50.0402022218474.2296@kestrel.Berkeley.EDU>

Content-Description:

Content-Disposition: ATTACHMENT; FILENAME="Stat133HW1.htm"

PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNl?M

PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNl?M

Ig0KeG1sbnM6dz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6d29yZCINCnht?M

...

--_===669732====calmail-me.berkeley.edu===_--

Figure 3: This sample email (split over two figures) has two attachments, a PDF file and

an HTML file. The Content-Type key indicates that the attachments are separated by

===669732====calmail-me.berkeley.edu=== with a -- prefix. The first part of the email

body is the a forwarded message. Note that it has its own header indicating the content type is plain text.

Next is a PDF attachment which the owner has named PLOTS.pdf, and the third part is an HTML attachment.

Both attachments are encoded in base64.

8

【上一篇】：到头了
【下一篇】：没有了