Text Processing: Emails For SPAM/HAM Classification
STA141B, Fall 2020
You may not look for or use code that addresses this dataset or use a package that can read an email
message into an R structure. You must implement the computational approach yourself. You can use
tutorials and stackoverflow for finding guides to general questions about text processing.
We are all familiar with SPAM email messages. Ultimately, we would like to be able to use statistics to
classify new email messages as SPAM or HAM (valid mail). There are many statistical techniques we could
use. But before we can use them we need “data”. Each message is an observation. We need to “measure”
variables on each message in a sample of email messages. We also need to know if a message is SPAM or
HAM. We can then train a statistical classifier using these variables to predict if a new message is SPAM or
HAM.
In this assignment, you will create a data.frame from a set of email messages. Each row corresponds to an
email message and will contain 16 or more variables derived from that message. One variable is whether
it is actually HAM or SPAM which can be derived from the name of the folder in which the message is
located.
The data we use come from the Spam Assassin project. The specific data you are to work with is available
on Canvas1
Your job is to process each email message
• to a structure (list) that contains the header, the body and any attachments (each of which will also
have a header and a body)
• create an R data frame of “derived” variables that give various measures of the email messages
• explore the data to see which variables help to discriminate between SPAM and HAM messages using
plots and numerical summaries
These derived variables might be, for example, the number of recipients to whom the mail was sent, the
percentage of capital words in the body of the text, is the message a reply to another message. See below for
a list of 25 possible variables. You are to write code to compute at least 15 of these and whether the message
is SPAM or HAM.
Many of the variables can be computed using similar approaches. For the most part, you should use regular
expressions rather than using strsplit to break the strings into individual characters.
You are very welcome to define new variables that you can compute from each email message that you think
will help to classify it as SPAM or HAM. Clearly state the meaning and specifics of the variable, why you
think it might be useful in classifying HAM or SPAM emails, and then show the code to compute it.
1
https://canvas.ucdavis.edu/files/10090382/download
1
Once you have this data frame of “derived” variables, explore the relationships among these variables and
especially how they might be used to classify SPAM and HAM messages. In other words, look at frequency
tables and scatterplots of the variables and color code the points based on if the message is SPAM or HAM.
Which variables seem to do best at discriminating between SPAM and HAM messages?
Your report should be similar in structure to the first assignment in that it has two primary sections:
• describe the computational approach and high- and intermediate-level details of how you implemented
this approach.
• explore the data and interpret them in the context of
Submit both the report (as a PDF document) and the R scripts for creating the data.frame and for exploring
the data, and, importantly, the R files containing functions you write and use in the script(s).
Some functions you might find useful include: grep, gsub, gregexpr, substring, nchar, strsplit.
sprintf, paste.
read.dcf.
table, plot, scatter.smooth, hexbin.
1 The Anatomy of an E-mail message
Electronic mail, usually called e-mail, consists of simple text messages – a piece of text sent to a recipient
via the internet. An e-mail message consists of two parts, the header and the body. The body of the e-mail
message is separated from the header by a single blank line. When an attachment is added to an e-mail
message, the attachment is included in the body of the message. Even with attachments, e-mail messages
are still only text messages.
1.1 The E-mail Header
The header contains information about the message such as the sender’s address, the recipient’s address, and
the date of transmission. This information is relayed in a special format that consists of KEY:VALUE pairs.
Below is a sample header from a message found on the SpamAssassin website.
Return-Path: whisper@oz.net
Delivery-Date: Fri Sep 6 20:53:36 2002
From: whisper@oz.net (David LeBlanc)
Date: Fri, 6 Sep 2002 12:53:36 -0700
Subject: [Spambayes] Deployment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEJABCAB.tim.one@comcast.net>
Message-ID: <GCEDKONBLEFPPADDJCOECEHJENAA.whisper@oz.net>
Notice the keys are Return-Path, Delivery-Date, From, Date, Subject, In-Reply-to, and Message-ID. The
value follows the keyword. For example, in the above header, the value of the From key is
whisperatoz.net (David LeBlanc).
Some of these keys are mandatory such as Date, From, and To (or In-Reply-To, or Bcc). Other
keys are optional but widely used, such as Subject, Cc, Received, and Message-ID. Many keys
are ignored by the mail system, but the entire header is relayed on to the recipient’s server whether or not
it is recognized. For example, keys starting with “X-” are for personal application or institution use and are
2
Description of Variables
isSpam logical whether mail is Spam (TRUE) or Ham (FALSE)
isRe logical if the string Re: appears as the first word in the subject of the
message
numLinesInBody integer a count of the number of lines in the body of the email message
bodyCharacterCount integer the number of characters in the body of the email message
replyUnderline logical whether the Reply-To field in the header has an underline and numbers/letters
subjectExclamationCount integer a count of the number of exclamation marks (!) in the subject of
the message
subjectQuestCount integer the number of question marks in the subject
numAttachments integer the number of attachments in the message.
priority logical whether the message’s header had an X-Priority or X-SmellPriority
that was set to high
numRecipients integer the number of recipients in the To, Cc fields
percentCapitals numeric the percentage of the characters in the body of the email that are
upper case (excluding blanks, numbers, and punctuation)
isInReplyTo logical whether the header of the message has an In-Reply-To field.
sortedRecipients logical the recipient list is sorted by address
subjectPunctuationCheck logical whether the subject has punctuation or digits surrounded by characters,
e.g. V?agra and pay1ng, but not New!
hourSent integer the hour in the day the mail was sent (0 – 23)
multipartText logical whether the header states that the message is a multipart/text, i.e.
with attachments.
containsImages logical whether the message contain images (in HTML)
isPGPsigned logical indicates whether the mail was digitally signed (e.g. using PGP or
GPG)
percentHTMLTags numeric the proportion of any HTML text in the message’s body that is
made up of HTML markup and not content.
subjectSpamWords logical whether the subject contains one of the following phrases: viagra,
pounds, free, weight, guarantee, millions, dollars, credit, risk, prescription,
generic, drug, money back, credit card.
percentSubjectBlanks numeric the percentage of blanks in the subject
messageIdHasNoHostname logical whether the message id that uniquely identifies the message has no
component identifying the machine from which it was set
fromNumericEnd logical whether the user login in the From: field ends in numbers
isYelling logical whether the Subject of the mail is in capital letters
percentForwards numeric percent of the message’s body that is made up of content included
from other messages
isOriginalMessage logical body does not contain the phrase “original message” or something
similar
isDear logical whether the message body contains a form of the introduction Dear
. . .
isWrote logical whether the text includes a line indicating an included message
as identified by the word wrote: in several different possible languages
averageWordLength numeric the average length of the words in the body of the message
numDollarSigns integer the number of dollar signs in the body of the message
3
ignored by other applications. The Received header lines are important because they allow the message
to be tracked. As a message makes its way to the intended recipient, servers add additional Received lines
to the header.
Below are some typical header keys:
• Message-Id: a unique identifier for the message, assigned by the originating server;
• Return-Path: specifies the sender’s address and bounced mail gets sent to this address;
• Date: added by the e-mail client;
• Cc: lists the recipients of a “carbon copied” message;
• Reply-To: the address set by the sender to which the recipient can reply;
• MIME-Version: used for encoding binary content as attachments.
A value may be continued on a second line of the header, in which case the line will be indented and begin
with a tab character or blank spaces. Consider this header
From irregulars-admin@tb.tf Thu Aug 22 14:23:39 2002
Return-Path: <irregulars-admin@tb.tf>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66
for <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)
Received: from phobos [127.0.0.1]
by localhost with IMAP (fetchmail-5.9.0)
for zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)
MIME-Version: 1.0
After the ’From ’ row, there are two fields (Return-Path and Delivered-To) on separate lines. The next two
fields however are ’Received’ fields and both span 3 rows. The first row of each starts in the first column
and identifies the key ’Received’. The subsequent lines start with white space. The second ’Received’ starts
in the first column and so indicates the start of a new field, and the conclusion of the previous field.
1.2 The Body of the Email
The body of the email is all the text after the first blank line following the header and up to any attachments.
If the message has no attachments, then the body is everything excluding the header. If the message has
attachments, we need to find where they begin to find the body. So we look at attachments next.
1.3 E-mail Attachments
An Internet standard called MIME, Multipurpose Internet Mail Extensions, specifies how messages may be
formatted and how to separate the attachments from the message. Information about the MIME encoding is
provided through header fields, which are specified in an RFC.
The Content-Type key is used to describe the content of a component or of the entire body. The value
provides the top-level type and subtype using the syntax:
4
top-level/subtype; parameter.
Parameters may be required or optional. Below is an example of a content-type where the top-level is
multipart, which indicates there will be several documents in the body of the message, the mixed
subtype tells us that the documents may be of different types, and the boundary parameter provides a
special character string for delimiting the start and end of the message parts.
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_00DE_01511A02.DB1A02A0"
The Content-Type field in this example tells the receiving e-mail program that this message has more than
one component, and each component will be separated by the string of characters
----=_NextPart_000_00DE_01511A02.DB1A02A0
The boundary string marks the beginning of each component. It is prefaced with two additional hyphens in
all instances. The boundary string is also used to denote the end of the message, where it is both prefaced
by two hyphens and followed immediately by two hyphens. The receiving email program knows when the
last component of the message has been read when it reads the boundary string with two additional hyphens
on either end of the string,
------=_NextPart_000_00DE_01511A02.DB1A02A0--
Each component of a message must be prefaced by the boundary string and a blank line. It may also contain
MIME information. If the blank line is missing, the recipient’s e-mail client may have difficulty telling
where the header information stops and the text of the message begins.
There are seven top-level types of attachments: text, image, audio, video, application, multipart, and message.
Other examples of Content-Type values follow:
Content-type: text/html; charset=euc-kr;
Content-Type: application/zip; name="testFile.zip"
The first example indicates that the message is in HTML format using a Korean character set. The second
indicates that the component is a zip file, and the sender named it testFile.zip. Binary files (such as a
compressed archive) can be sent as attachments. In such cases, the sender’s software must first encode the
binary file so that it can be sent over the Internet. One common encoding scheme is known as base64.
We conclude by providing two sample e-mail messages. The first is a plain text e-mail with no attachments.
It consists of an instructor’s response to an e-mail inquiry sent by a student. The second e-mail message
consists of a text message and two attachments sent by a student to the instructor. This e-mail message
has then been forwarded by the instructor to the teaching assistant. The three periods at the end of each
attachment indicates that only part of the attachment has been displayed. The first attachment is a pdf file
and the second is an HTML file. The forwarded message is a plain text file.
5
From nt@stat.Berkeley.EDU Mon Feb 2 22:16:19 2004 -0800
Date: Mon, 2 Feb 2004 22:16:19 -0800 (PST)
From: nt@stat.Berkeley.EDU
X-X-Sender: nt@kestrel.Berkeley.EDU
To: Txxxx Uxxx <txxxx@uclink.berkeley.edu>
Subject: Re: prof: did you receive my hw?
In-Reply-To: <web-569552@calmail-st.berkeley.edu>
Message-ID: <Pine.SOL.4.50.0402022216120.2296-100000@kestrel.Berkeley.EDU>
References: <web-569552@calmail-st.berkeley.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O
X-Status:
X-Keywords:
X-UID: 9079
Yes it was received.
-------------------------------------
On Mon, 2 Feb 2004, txxxx wrote:
> hey prof .nt,
>
> i sent out my hw on sunday night. i just wonder did you receive it.
> because i am kinda scared thatyou didnt’ receive it.
> like i just wonder how do i know if you got it or not, since the cal
> mail system is kinda weird sometimes. thanks
>
> txxxx
>
Figure 1: Sample email message with no attachments. The header includes fourteen key:value pairs. Note
the Date key includes a time-zone offset, the Message-ID key gives the unique ID to track the mail from
the stat.berkeley.edy mail server, the Content-Type key indicates it is a plain text message with
no sttachments, and thre are four X- keys.
6
From nt@stat.Berkeley.EDU Mon Feb 2 22:18:56 2004 -0800
Date: Mon, 2 Feb 2004 22:18:55 -0800 (PST)
From: nt@stat.Berkeley.EDU
X-X-Sender: nt@kestrel.Berkeley.EDU
To: Gang Liang <liang@stat.Berkeley.EDU>
Subject: Assignment 1 sorry (fwd)
Message-ID: <Pine.SOL.4.50.0402022218470.2296-201000@kestrel.Berkeley.EDU>
MIME-Version: 1.0
Content-Type: MULTIPART/Mixed; BOUNDARY="_===669732====calmail-me.berkeley.edu===_"
Content-ID: <Pine.SOL.4.50.0402022218471.2296@kestrel.Berkeley.EDU>
Status: RO
X-Status:
X-Keywords:
X-UID: 9080
--_===669732====calmail-me.berkeley.edu===_
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; FORMAT=flowed
Content-ID: <Pine.SOL.4.50.0402022218472.2296@kestrel.Berkeley.EDU>
Figure 2: This sample email (split over two figures) has two attachments, a PDF file and
an HTML file. The Content-Type key indicates that the attachments are separated by
===669732====calmail-me.berkeley.edu=== with a -- prefix. The first part of the email
body is the a forwarded message. Note that it has its own header indicating the content type is plain text.
Next is a PDF attachment which the owner has named PLOTS.pdf, and the third part is an HTML attachment.
Both attachments are encoded in base64.
7
---------- Forwarded message ----------
Date: Mon, 02 Feb 2004 21:50:47 -0800
From: Yyyy Zzz <Zzz@uclink.berkeley.edu>
To: nt@stat.Berkeley.EDU
Subject: Assignment 1 sorry
I am sorry to send this email again, but my outbox told me that
the last email only send 1 attached file.
I am send ing this again to make sure you recieve the all
the necessary files.
Thank You and sorry for the inconvenience.
--_===669732====calmail-me.berkeley.edu===_
Content-Type: APPLICATION/PDF; CHARSET=US-ASCII
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.SOL.4.50.0402022218473.2296@kestrel.Berkeley.EDU>
Content-Description:
Content-Disposition: ATTACHMENT; FILENAME="PLOTS.pdf"
JVBERi0xLjEKJYHigeOBz4HTDQoxIDAgb2JqCjw8Ci9DcmVhdGlvbkRhdGUgKEQ6MjAwNDAy
MDIxMTIwMTEpCi9Nb2REYXRlIChEOjIwMDQwMjAyMTEyMDExKQovVGl0bGUgKFIgR3JhcGhp
Y3MgT3V0cHV0KQovUHJvZHVjZXIgKFIgMS44LjEpCi9DcmVhdG9yIChSKQo+PgplbmRvYmoK
...
--_===669732====calmail-me.berkeley.edu===_
Content-Type: TEXT/HTML; CHARSET=US-ASCII
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.SOL.4.50.0402022218474.2296@kestrel.Berkeley.EDU>
Content-Description:
Content-Disposition: ATTACHMENT; FILENAME="Stat133HW1.htm"
PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlˆM
PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlˆM
Ig0KeG1sbnM6dz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6d29yZCINCnhtˆM
...
--_===669732====calmail-me.berkeley.edu===_--
Figure 3: This sample email (split over two figures) has two attachments, a PDF file and
an HTML file. The Content-Type key indicates that the attachments are separated by
===669732====calmail-me.berkeley.edu=== with a -- prefix. The first part of the email
body is the a forwarded message. Note that it has its own header indicating the content type is plain text.
Next is a PDF attachment which the owner has named PLOTS.pdf, and the third part is an HTML attachment.
Both attachments are encoded in base64.
8
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。