GitHub - willf/segment: A tool to segment text based on frequencies and the Viterbi algorithm "#TheBoyWhoLived" => ['#', 'The', 'Boy', 'Who', 'Lived']

This module segments text according word frequency using the Viterbi algorithm. Probably due to Peter Norvig somehow.

Three sources of frequency information is provided.

One is from the Google NGram corpus, a general web corpus.

The second is from the Rovereto Twitter N-Gram Corpus, which is better for some Twitter data.

The third is from a webcrawl dataset of anchor text provided by Vinay Goel of the Internet Archive.

> from segment.segmenter import Analyzer
> e = Analyzer('en')
> e.segment("AbeLincoln")
['Abe', 'Lincoln']
> e.segment("BieberHeartsBeliebers")
['Bi', 'e', 'ber', 'Hearts', 'Be', 'lieber', 's']
> t = Analyzer('twitter')
> t.segment("BieberHeartsBeliebers")
['Bieber', 'Hearts', 'Beliebers']
> t = Analyzer('anchor')
> t.segment("wordpress&sex")
['wordpress', '&', 'sex']

Name	Name	Last commit message	Last commit date
Latest commit willf Add anchor text corpus Apr 23, 2016 0e41d48 · Apr 23, 2016 History 9 Commits
segment	segment	Add anchor text corpus	Apr 23, 2016
.gitignore	.gitignore	initial commit	Apr 20, 2015
README.md	README.md	Add anchor text corpus	Apr 23, 2016
requirements.txt	requirements.txt	initial commit	Apr 20, 2015
setup.py	setup.py	add datafiles to setup	Apr 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

willf/segment

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages