Package 'lingdist'

Title: Fast Linguistic Distance and Alignment Computation
Description: A fast generalized edit distance and string alignment computation mainly for linguistic aims. As a generalization to the classic edit distance algorithms, the package allows users to define custom cost for every symbol's insertion, deletion, and substitution. The package also allows character combinations in any length to be seen as a single symbol which is very useful for International Phonetic Alphabet (IPA) transcriptions with diacritics. In addition to edit distance result, users can get detailed alignment information such as all possible alignment scenarios between two strings which is useful for testing, illustration or any further usage. Either the distance matrix or its long table form can be obtained and tools to do such conversions are provided. All functions in the package are implemented in 'C++' and the distance matrix computation is parallelized leveraging the 'RcppThread' package.
Authors: Chao Kong [aut, cre]
Maintainer: Chao Kong <[email protected]>
License: GPL (>= 2)
Version: 1.0
Built: 2025-03-07 04:03:31 UTC
Source: https://github.com/fncokg/lingdist

Help Index


Check whether there's missing characters in the cost matrix.

Description

Check whether there's missing characters in the cost matrix and return the missing characters.

Usage

check_cost_defined(data, cost_mat, delim = "")

Arguments

data

DataFrame to be computed.

cost_mat

Cost matrix to be checked.

delim

The delimiter separating atomic symbols.

Value

A string vector containing the missing characters, empty indicating there's no missing characters.

Examples

df <- as.data.frame(rbind(a=c("a_bc_d","d_bc_a"),b=c("b_bc_d","d_bc_a")))
cost.mat <- data.frame()
chars.not.found <- check_cost_defined(df, cost.mat, "_")

Compute edit distance between all row pairs of a dataframe

Description

Compute average edit distance between all row pairs of a dataframe, empty or NA cells are ignored. If all values in a row are not valid strings, all average distances involving this row is set to -1.

Usage

edit_dist_df(
  data,
  cost_mat = NULL,
  delim = "",
  squareform = FALSE,
  symmetric = TRUE,
  parallel = FALSE,
  n_threads = 2L
)

Arguments

data

DataFrame with n rows and m columns indicating there are n languages or dialects to involve in the calculation and there are at most m words to base on, in which the rownames are the language ids.

cost_mat

Dataframe in squareform indicating the cost values when one symbol is deleted, inserted or substituted by another. Rownames and colnames are symbols. 'cost_mat[char1,"_NULL_"]' indicates the cost value of deleting char1 and 'cost_mat["_NULL_",char1]' is the cost value of inserting it. When an operation is not defined in the cost_mat, it is set 0 when the two symbols are the same, otherwise 1.

delim

The delimiter separating atomic symbols.

squareform

Whether to return a dataframe in squareform.

symmetric

Whether to the result matrix is symmetric. This depends on whether the 'cost_mat' is symmetric.

parallel

Whether to parallelize the computation.

n_threads

The number of threads is used to parallelize the computation. Only meaningful if 'parallel' is TRUE.

Value

A dataframe in long table form if 'squareform' is FALSE, otherwise in squareform. If 'symmetric' is TRUE, the long table form has Cn2C_n^2 rows otherwise n2n^2 rows.

Examples

df <- as.data.frame(rbind(a=c("a_bc_d","d_bc_a"),b=c("b_bc_d","d_bc_a")))
cost.mat <- data.frame()
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_")
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", squareform=TRUE)
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", parallel=TRUE, n_threads=4)

Compute edit distance between two strings

Description

Compute edit distance between two strings and get all possible alignment scenarios. Custom cost matrix is supported. Symbols separated by custom delimiters are supported.

Usage

edit_dist_string(
  str1,
  str2,
  cost_mat = NULL,
  delim = "",
  return_alignments = FALSE
)

Arguments

str1

String to be compared.

str2

String to be compared.

cost_mat

Dataframe in squareform indicating the cost values when one symbol is deleted, inserted or substituted by another. Rownames and colnames are symbols. 'cost_mat[char1,"_NULL_"]' indicates the cost value of deleting char1 and 'cost_mat["_NULL_",char1]' is the cost value of inserting it. When an operation is not defined in the cost_mat, it is set 0 when the two symbols are the same, otherwise 1.

delim

The delimiter in 'str1' and 'str2' separating atomic symbols.

return_alignments

Whether to return alignment details

Value

A list contains 'distance' attribution storing the distance result. If 'return_alignments' is TRUE, then a 'alignments' attribution is present which is a list of dataframes with each storing a possible best alignment scenario.

Examples

cost.mat <- data.frame()
dist <- edit_dist_string("leaf","leaves")$distance
dist <- edit_dist_string("ph_l_i_z","p_l_i_s",cost_mat=cost.mat,delim="_")$distance
alignments <- edit_dist_string("ph_l_i_z","p_l_i_s",delim="_",return_alignments=TRUE)$alignments

Generate a default cost matrix

Description

generate a default cost matrix contains all possible characters in the raw data with all diagonal values set to 0 and others set to 1. This avoids you constructing the matrix from scratch.

Usage

generate_default_cost_matrix(data, delim = "")

Arguments

data

DataFrame to be computed.

delim

The delimiter separating atomic symbols.

Value

Cost matrix contains all possible characters in the raw data with all diagonal values set to 0 and others set to 1.

Examples

df <- as.data.frame(rbind(a=c("a_bc_d","d_bc_a"),b=c("b_bc_d","d_bc_a")))
default.cost <- generate_default_cost_matrix(df, "_")

Convert long table to square form

Description

Convert a distance dataframe in long table form to a square matrix form.

Usage

long2squareform(data, symmetric = TRUE)

Arguments

data

Dataframe in long table form. The first and second columns are labels and the third column stores the distance values.

symmetric

Whether the distance matrix are symmetric (if cost matrix is not, then the distance matrix is also not).

Value

Dataframe in square matrix form, rownames and colnames are labels. If the long table only contains Cn2C_n^2 rows and 'symmetric' is set to FALSE, then only lower triangle positions in the result is filled.

Examples

data <- as.data.frame(list(chars1=c("a","a","b"),chars2=c("b","c","c"),dist=c(1,2,3)))
mat <- long2squareform(data)