Title: | Phrase Mining |
---|---|
Description: | Functions to extract and handle commonly occurring principal phrases obtained from collections of texts. |
Authors: | Ellie Small [aut, cre] |
Maintainer: | Ellie Small <[email protected]> |
License: | GPL-3 |
Version: | 1.1.2 |
Built: | 2024-11-03 04:56:06 UTC |
Source: | https://github.com/cran/phm |
Obtain all principal phrases from a corpus or a collection of texts and their frequencies in each of those texts. A principal phrase is a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words "not" and "no", does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user.
The function PhraseDoc will extract all principal phrases from a corpus with documents or a character vector with texts and creates an object of class phraseDoc. The method as.matrix on a phraseDoc object converts the phraseDoc to a term-frequency matrix. The function freqPhrases displays the most frequent principal phrases in a phraseDoc object. The function getDocs will create a frequency matrix with all documents/texts that contain certain phrases, while getPhrases will create a frequency matrix with all phrases present in a specific collection of documents/texts.
Ellie Small
Maintainer: Ellie Small <[email protected]>
Convert a phraseDoc Object to a Matrix
## S3 method for class 'phraseDoc' as.matrix(x, ids = TRUE, sparse = FALSE, ...)
## S3 method for class 'phraseDoc' as.matrix(x, ids = TRUE, sparse = FALSE, ...)
x |
A phraseDoc object. |
ids |
A logical value with TRUE (default) to use ids (if available), FALSE to use indices |
sparse |
A logical value indicates whether a sparse matrix should be returned (default FALSE) |
... |
Additional arguments |
A matrix with phrases as rows, texts as columns, and elements containing the number of times the phrase occurs in the text
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) as.matrix(pd)
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) as.matrix(pd)
Find the documents in a corpus that have the most high frequency phrases and return a corpus with just those documents
bestDocs(co, num = 3L, n = 10L, pd = NULL)
bestDocs(co, num = 3L, n = 10L, pd = NULL)
co |
A corpus with documents |
num |
Integer with the number of documents to return |
n |
Integer with the number of high frequency phrases to use |
pd |
phraseDoc object for the corpus in |
A corpus with the num
documents that have the most
high frequency phrases, in order of the number of high frequency
phrases. The corpus returned will have the meta field oldIdx set
to the index of the document in the original corpus, and the meta
field hfPhrases to the number of high frequency phrases it contains.
v1=c("Here is some text to test phrase mining","phrase mining is fun", "Some text is better than no text","No text, no phrase mining") co=tm::VCorpus(tm::VectorSource(v1)) pd=phraseDoc(co,min.freq=2) bestDocs(co,2,2,pd)
v1=c("Here is some text to test phrase mining","phrase mining is fun", "Some text is better than no text","No text, no phrase mining") co=tm::VCorpus(tm::VectorSource(v1)) pd=phraseDoc(co,min.freq=2) bestDocs(co,2,2,pd)
When two vectors are given, this calculates the Canberra distance between them; This is calculated as the sum of the absolute difference between corresponding elements divided by the sum of their absolute values, for elements that are not both zero only.
canberra(x, y)
canberra(x, y)
x |
A numeric vector |
y |
A numeric vector of the same dimension as x |
The Canberra distance between x and y. For example, between vectors (1,2,0) and (0,1,1), for position 1 we have (1-0)/1, for position 2 we have (2-1)/3, and for position 3 we have abs(0-1)/1, added together this results in 2 1/3, or 2.33. Note that a text distance of zero indicates that the two vectors are equal, while a text distance of 1 indicates that they have no terms in common.
canberra(c(1,2,0),c(0,1,1))
canberra(c(1,2,0),c(0,1,1))
This function will create a DFSource object from a data frame that contains at least columns id and text, but may contain several more. VCorpus will use this to read in each row from the data frame into a PlainTextDocument, storing additional variables in its metadata. It will then combine all those PlainTextDocuments in a VCorpus object.
DFSource(x)
DFSource(x)
x |
A dataframe with at a minimum a text and id column, and a row for each document to be stored in a corpus. |
A DFSource object containing the encoding set to "", the number
of rows (length), the current position (position=0), the type of
reader to use (reader=readDF), and the content (x
).
(df=data.frame(id=1:3,text=c("First text","Second text","Third text"), title=c("N1","N2","N3"))) DFSource(df)
(df=data.frame(id=1:3,text=c("First text","Second text","Third text"), title=c("N1","N2","N3"))) DFSource(df)
Calculate a distance matrix for a numeric matrix, where a distance function
is used to calculate the distance between all combinations of the columns
of the matrix M
.
distMatrix(M, fn = "textDist", ...)
distMatrix(M, fn = "textDist", ...)
M |
A numeric matrix |
fn |
The name of a distance function, default is "textDist". |
... |
Additional arguments to be passed to the distance function |
The distance matrix with the distance between all combinations
of the columns of M
according to the distance function in fn
.
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") M #Text distance matrix distMatrix(M) #Canberra distance matrix distMatrix(M,"canberra")
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") M #Text distance matrix distMatrix(M) #Canberra distance matrix distMatrix(M,"canberra")
Display the most frequent principal phrases in a phraseDoc object.
freqPhrases(pd, n = 10)
freqPhrases(pd, n = 10)
pd |
A phraseDoc object. |
n |
Number of principal phrases to display. |
A vector with the n
most frequent principal phrases and their
frequencies.
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) freqPhrases(pd, 2)
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) freqPhrases(pd, 2)
Display a frequency matrix containing all the documents that contain any of the phrases in phrs and the number of times they occur in that document.
getDocs(pd, phrs, ids = TRUE)
getDocs(pd, phrs, ids = TRUE)
pd |
A phraseDoc object. |
phrs |
A set of phrases. |
ids |
A logical value with TRUE (default) to return ids (if available), FALSE to return indices. |
A matrix with the documents and # of occurrences for the phrases in phrs.
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) getDocs(pd, c("test text","another test text"))
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) getDocs(pd, c("test text","another test text"))
Using the position field of x to indicate the index of the current row, we retrieve the current row of the content of a DFSource. This function is mainly used by the VCorpus function.
## S3 method for class 'DFSource' getElem(x)
## S3 method for class 'DFSource' getElem(x)
x |
A DFSource object |
A list with the current row in the content of a DFSource object. The current row index is the position in the DFSource object.
library(tm) df=data.frame(id=1:3,text=c("First text","Second text","Third text"), title=c("N1","N2","N3")) getElem(stepNext(DFSource(df)))
library(tm) df=data.frame(id=1:3,text=c("First text","Second text","Third text"), title=c("N1","N2","N3")) getElem(stepNext(DFSource(df)))
Display a frequency matrix containing all the documents for which the indices are given in docs with their principal phrases and the number of times they occur in each document.
getPhrases(pd, doc, ids = TRUE)
getPhrases(pd, doc, ids = TRUE)
pd |
A phraseDoc object |
doc |
An integer vector containing indices of documents, or a character vector containing the ids of documents (column names) |
ids |
A logical value with TRUE (default) to return ids (if available), FALSE to return indices |
A matrix with the documents and # of occurrences of principal phrases
for the documents in docs
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) getPhrases(pd, c(1,3))
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) getPhrases(pd, c(1,3))
This function takes as input a file produced via PubMed in PubMed format and outputs a data frame with the id equal to the PMID, text equal to the abstract, date, title, and author for each publication in the file.
getPubMed(file)
getPubMed(file)
file |
path to the PubMed file |
A data table with a row for each publication holding the id equal to the PMID, text equal to the abstract, date, title, and author for that publication.
#Go to Pubmed and enter search criteria, save the result to PubMed format. #If the file is called pubmed_result.txt and located in the current #directory: #PM=getPubMed("pubmed_result.txt") #Will load the data from the search into a data table called PM
#Go to Pubmed and enter search criteria, save the result to PubMed format. #If the file is called pubmed_result.txt and located in the current #directory: #PM=getPubMed("pubmed_result.txt") #Will load the data from the search into a data table called PM
Create an object of class phraseDoc. This will hold all principal phrases of a collection of texts that occur a minimum number of times, plus the texts they occur in and their position within those texts.
phraseDoc( co, mn = 2, mx = 8, ssw = stopStartWords(), sew = stopEndWords(), sp = stopPhrases(), min.freq = 2, principal = function(phrase, freq) { freq >= min.freq }, max.phrases = 1500, shiny = FALSE, silent = FALSE )
phraseDoc( co, mn = 2, mx = 8, ssw = stopStartWords(), sew = stopEndWords(), sp = stopPhrases(), min.freq = 2, principal = function(phrase, freq) { freq >= min.freq }, max.phrases = 1500, shiny = FALSE, silent = FALSE )
co |
A corpus or a character vector with each element the text of a document. |
mn |
Minimum number of words in a phrase. |
mx |
Maximum number of words in a phrase. |
ssw |
A set of words no phrase should start with. |
sew |
A set of words no phrase should end with. |
sp |
A set of phrases to be excluded. |
min.freq |
The minimum frequency of phrases to be included. |
principal |
Function that determines if a phrase is a principal phrase.
By default, FALSE is returned if the phrase occurs less often than the number
in |
max.phrases |
Maximum number of phrases to be included. |
shiny |
TRUE if called from a shiny program. This will allow progress to be recorded on a progress meter; the function uses about 100 progress steps, so it should be created inside a withProgress function with the argument max set to at least 100. |
silent |
TRUE if you do not want progress messages. |
Object of class phraseDoc
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") phraseDoc(tst)
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") phraseDoc(tst)
Print a phraseDoc Object
## S3 method for class 'phraseDoc' print(x, ...)
## S3 method for class 'phraseDoc' print(x, ...)
x |
Object of type phraseDoc |
... |
Additional arguments |
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") (pd=phraseDoc(tst))
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") (pd=phraseDoc(tst))
Print a textCluster Object
## S3 method for class 'textCluster' print(x, ...)
## S3 method for class 'textCluster' print(x, ...)
x |
Object of type textCluster |
... |
Additional arguments |
The total number of clusters and total number of documents are printed. There is no return value.
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") tc=textCluster(M,2) tc
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") tc=textCluster(M,2) tc
Read a row of the content of a DFSource object into a PlainTextDocument.
readDF(elem, language, id = "1")
readDF(elem, language, id = "1")
elem |
A list containing the field content containing one row with data from a data frame containing at least the columns id and text, but possibly more. |
language |
abbreviation of the language used; "en" for English |
id |
Not used, but needed for VCorpus |
A PlainTextDocument with content equal to the contents of the text field, and meta data containing the information in the remaining fields, including the id field
(df=data.frame(id=1:3,text=c("First text","Second text","Third text"), title=c("N1","N2","N3"))) readDF(list(content=df[1,]),"en")
(df=data.frame(id=1:3,text=c("First text","Second text","Third text"), title=c("N1","N2","N3"))) readDF(list(content=df[1,]),"en")
Remove a set of phrases from a phraseDoc object.
removePhrases(pd, phrs)
removePhrases(pd, phrs)
pd |
A phraseDoc object. |
phrs |
A set of phrases. |
A phraseDoc object with the phrases in phrs
removed.
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) removePhrases(pd, c("test text","another test text"))
tst=c("This is a test text", "This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man") pd=phraseDoc(tst) removePhrases(pd, c("test text","another test text"))
Show all documents and their non-zero terms in a cluster, with the terms first ordered by highest number of documents the term appears in, then total frequency.
showCluster(tdm, clust, cl, n = 10L)
showCluster(tdm, clust, cl, n = 10L)
tdm |
A term frequency matrix. |
clust |
A vector indicating for each column in |
cl |
Cluster number |
n |
Integer showing the maximum number of terms to be returned (default 10) |
A matrix with document names of tdm
on the columns and terms
on the rows for all columns in the cluster, where terms that appear in the
most documents (columns), and within that have the highest frequency in the
cluster, are shown first. Two columns are added at the end of the matrix
with the the number of documents each term appears in and its total frequency
in the cluster. The number of terms displayed equals the number in n
,
or less if there are less terms in the cluster.
If there are no terms at all in the cluster, a list is output with the items
docs and note, where docs is a vector with all document names of documents in
the cluster, and the note stating that the cluster has no terms.
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") tc=textCluster(M,2) showCluster(M,tc$cluster,1)
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") tc=textCluster(M,2) showCluster(M,tc$cluster,1)
Create a vector with words that principal phrases should not end with.
stopEndWords()
stopEndWords()
vector with words
stopEndWords()
stopEndWords()
Create a vector with phrases that are not principal phrases.
stopPhrases()
stopPhrases()
vector with phrases
stopPhrases()
stopPhrases()
Create a vector with words that principal phrases should not start with.
stopStartWords()
stopStartWords()
vector with words
stopStartWords()
stopStartWords()
Combine documents (columns) into k clusters that have texts that are most similar based on their text distance. Documents with no terms are assigned to the last cluster.
textCluster(tdm, k, mx = 100, md = 5 * k)
textCluster(tdm, k, mx = 100, md = 5 * k)
tdm |
A term document matrix with terms on the rows and documents on the columns. |
k |
A positive integer with the number of clusters needed |
mx |
Maximum number of times to iterate (default 100) |
md |
Maximum number of documents to use for the initial setup (default
5* |
A textcluster object with three items; cluster, centroids, and size,
where cluster contains a vector indicating for each column in M
what
cluster they have been assigned to, centroids contains a matrix with each
column the centroid of a cluster, and size a named vector with the size of
each cluster.
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") textCluster(M,2)
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") textCluster(M,2)
When two vectors are given, this calculates the text distance between them; text distance is calculated as the proportion of unmatched frequencies, i.e., the number of unmatched frequencies divided by the total frequencies among the two vectors. However, if neither vector has any values at all, their distance equals the number provided in the zeroes argument, which is .5 by default. When two matrices are given, the text distance between corresponding columns is calculated.
textDist(x, y, zeroes = 0.5)
textDist(x, y, zeroes = 0.5)
x |
A numeric vector or matrix |
y |
A numeric vector or matrix of the same dimension as x |
zeroes |
Text distance when both vectors are zero vectors; default is .5 |
When x and y are vectors, the text distance between them. For example, between vectors (1,2,0) and (0,1,1), a total of 5 frequencies are present. However, position 1 matches nothing when it could have matched 1 frequency, position 2 matches 1 frequency when it could have matched both positions, so 1 remains unmatched. Position 3 matches nothing when it could have matched 1. So we have 3 unmatched positions divided by 5 frequencies, resulting in a text distance of 3/5=.6. If x and y are matrices, a vector with the text distance between corresponding columns is returned. So for two 4x2 matrices, a vector with two values is returned, one with the text distance between the first columns of the matrices, and the second one with the text distance between the second columns of the matrices. For large sets of data, it is recommended to use matrices as it is much more efficient than calculating column by column.
#text distance between two vectors textDist(c(1,2,0),c(0,1,1)) (M1=matrix(c(0,1,0,2,0,10,0,14),4)) (M2=matrix(c(12,0,8,0,1,3,1,2),4)) #text distance between corresponding columns of M1 and M2 textDist(M1,M2)
#text distance between two vectors textDist(c(1,2,0),c(0,1,1)) (M1=matrix(c(0,1,0,2,0,10,0,14),4)) (M2=matrix(c(12,0,8,0,1,3,1,2),4)) #text distance between corresponding columns of M1 and M2 textDist(M1,M2)
Calculate a distance matrix for a numeric matrix, using the textDist
function. It is used to calculate the text distance between all combinations
of the columns of the matrix M
.
textDistMatrix(M, zeroes = 0.5)
textDistMatrix(M, zeroes = 0.5)
M |
A numeric matrix |
zeroes |
Text distance when both vectors are zero vectors; default is .5 |
The text distance matrix with the text distance between all
combinations of the columns of M
. This will give the same result as
the function distMatrix when run with its default distance function
"textDist"; however, for large matrices textDistMatrix is much more
efficient. In addition, for very large matrices distMatrix may not run,
while textDistMatrix will.
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") M #Text distance matrix textDistMatrix(M)
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) colnames(M)=1:4;rownames(M)=c("A","B","C","D") M #Text distance matrix textDistMatrix(M)