TextParser is an Elixir library for extracting and validating structured tokens from text, such as URLs, hashtags, and mentions. It provides built-in token types and allows you to define custom parsers with specific validation rules.
This library was extracted from justcrosspost.app where processing tags, mentions and URLs for Bluesky is kinda tricky.
Add text_parser to your list of dependencies in mix.exs:
def deps do
[
{:text_parser, "~> 0.1"}
]
endBy default, TextParser extracts URLs, hashtags, and mentions:
text = "Check out https://elixir-lang.org #elixir @joe"
result = TextParser.parse(text)
# Get URLs
urls = TextParser.get(result, TextParser.Tokens.URL)
# => [%TextParser.Tokens.URL{value: "https://elixir-lang.org", position: {10, 32}}]
# Get hashtags
tags = TextParser.get(result, TextParser.Tokens.Tag)
# => [%TextParser.Tokens.Tag{value: "#elixir", position: {33, 40}}]
# Get mentions
mentions = TextParser.get(result, TextParser.Tokens.Mention)
# => [%TextParser.Tokens.Mention{value: "@joe", position: {41, 45}}]You can specify which token types to extract:
alias TextParser.Tokens.{URL, Tag}
# Extract only URLs and hashtags
result = TextParser.parse("Check out https://elixir-lang.org #elixir @joe", extract: [URL, Tag])
TextParser.get(result, URL) # => [%URL{...}]
TextParser.get(result, Tag) # => [%Tag{...}]You can create custom token types to extract different patterns from text. Here's an example of a token that extracts ISO 8601 dates:
defmodule MyParser.Tokens.Date do
use TextParser.Token,
# Match YYYY-MM-DD format, requiring space or start of string before the date
pattern: ~r/(?:^|\s)(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01]))/,
trim_chars: [",", ".", "!", "?"]
@impl true
def is_valid?(date_text) when is_binary(date_text) do
case Date.from_iso8601(date_text) do
{:ok, _date} -> true
_ -> false
end
end
def is_valid?(_), do: false
end
# Usage
text = "Meeting on 2024-01-15, conference on 2024-02-30, party on 2024-12-31!"
result = TextParser.parse(text, extract: [MyParser.Tokens.Date])
dates = TextParser.get(result, MyParser.Tokens.Date)
# => [
# %MyParser.Tokens.Date{value: "2024-01-15", position: {11, 21}},
# %MyParser.Tokens.Date{value: "2024-12-31", position: {47, 57}}
# ]
# Note: 2024-02-30 is filtered out as it's not a valid dateCustom tokens require:
- A regex
patternthat captures the token in the first capture group - Optional
trim_charsto remove trailing punctuation - An
is_valid?/1function that validates the extracted value
You can mix custom tokens with built-in ones:
alias TextParser.Tokens.{URL, Tag}
alias MyParser.Tokens.Date
result = TextParser.parse(
"Meeting on 2024-01-15 at https://example.com #elixir",
extract: [URL, Tag, Date]
)
TextParser.get(result, Date) # => [%Date{value: "2024-01-15", position: {11, 21}}]
TextParser.get(result, URL) # => [%URL{value: "https://example.com", position: {25, 44}}]
TextParser.get(result, Tag) # => [%Tag{value: "#elixir", position: {45, 52}}]You can create custom parsers with specific validation rules:
defmodule BlueskyParser do
use TextParser
alias TextParser.Tokens.Tag
# Enforce Bluesky's 64-character limit for hashtags
def validate(%Tag{value: value} = tag) do
if String.length(value) >= 66,
do: {:error, "tag too long"},
else: {:ok, tag}
end
# Allow other token types to pass through
def validate(token), do: {:ok, token}
end
# Usage
text = "Check out #elixir and #this_is_a_very_long_hashtag_that_exceeds_bluesky_limit"
result = BlueskyParser.parse(text)
# Only valid tags are included
TextParser.get(result, Tag)
# => [%Tag{value: "#elixir", position: {10, 17}}]Each token includes its value and position in the original text:
text = "Check https://elixir-lang.org #elixir"
result = TextParser.parse(text)
[url] = TextParser.get(result, TextParser.Tokens.URL)
url.value # => "https://elixir-lang.org"
url.position # => {6, 29} (start and end byte positions)position tuple uses a range format where:
- The start position is inclusive (the first byte of the token)
- The end position is exclusive (one byte past the last byte of the token)
For example, with position: {6, 29}, the token starts at byte 6 and ends at byte 28. This format makes it easy to:
- Calculate token length:
end - start(29 - 6 = 23 bytes) - Extract the token from text:
binary_part(text, start, end - start)
The full documentation can be found at https://hexdocs.pm/text_parser.