Python split text on sentences

I have a text file. I need get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as dot being used in abbreviations.

My old regexp works bad.

re.compile('(/. |^|!|/?)([A-Z][^;↑/.<>@/^&//[/]]*(/.|!|/?) )',re.M)

how split or tokenize Arabic text into sentences in python

My question i need to split or tokenize the Arabic text into sentences, which is every sentences end with (.), then tokenization into word. and the output as you see bellow. how can i fix it. text =

Split text into sentences

I wish to split text into sentences. Can anyone help me? I also need to handle abbreviations. However my plan is to replace these at an earlier stage. Mr. -> Mister import re import unittest class

Split a text into sentences

How can I split a text into an array of sentences? Example text: Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers… End Should output: 0 => Fry me a B

regex split text document into sentences

I have a big text string and I am trying to split it into the sentences based on . ? !. But my regex is not working somehow, can somebody guide me to detect the error? String str = When my friend s

Split a text into multiple sentences in bash

I would like to split text into sentences. Sentence ends with a dot and followed by whitespace character.

Improve regex to Split large text into sentences [duplicate]

Possible Duplicate: What is a regular expression for parsing out individual sentences? I want to split large text into sentence . The regex expression i got from answer here string[] sentences = Re

Split text into sentences, but skip quoted content

I want to split some text into sentences using regular expression (using Ruby). It does not need to be accurate, so cases such as Washington D.C. can be ignored. However I have an requirement that,

Split text into sentences in C#

I want to divide a text into sentences. A sentence ends with (dot) or ? or ! followed by one or more whitespace characters followed and the next sentence starts with an uppercase letter. For example:

RegexpTokenize Japanese sentences – python

I’m trying to split the japanese sentences up using RegexpTokenizer but it is returning null sets. can someone tell me why? and how to get split the japanese sentences up? #!/usr/bin/python # -*- enco

How can I split a text into sentences using the Stanford parser?

How can I split a text or paragraph into sentences using Stanford parser? Is there any method that can extract sentences, such as getSentencesFromString() as it’s provided for Ruby?

Answers

The Natural Language Toolkit (http://www.nltk.org/) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '/n-----/n'.join(tokenizer.tokenize(data))

(I haven’t tried it!)

For simple cases (where sentences are terminated normally), this should work:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[/./?!][/'"/)/]]* *', text)

The regex is */. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it’ll do fine in most cases. The only case this won’t cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)

@Artyom,

Hi! You could make a new tokenizer for Russian (and some other languages) using this function:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('/"', ' /" ')
    result = result.replace('/'', ' /' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

Good luck, Marilena.

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it – you just reap the rewards)

So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 

Here is a middle of the road approach that doesn’t rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: ‘.’ vs. ‘.”‘

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl’s find_all function from this entry: Find all occurrences of a substring in Python

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. “Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst.

# -*- coding: utf-8 -*-
import re
caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He/s|She/s|It/s|They/s|Their/s|Our/s|We/s|But/s|However/s|That/s|This/s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("/n"," ")
    text = re.sub(prefixes,"//1<prd>",text)
    text = re.sub(websites,"<prd>//1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("/s" + caps + "[.] "," //1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"//1<stop> //2",text)
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","//1<prd>//2<prd>//3<prd>",text)
    text = re.sub(caps + "[.]" + caps + "[.]","//1<prd>//2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," //1<stop> //2",text)
    text = re.sub(" "+suffixes+"[.]"," //1<prd>",text)
    text = re.sub(" " + caps + "[.]"," //1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "/"" in text: text = text.replace("./"","/".")
    if "!" in text: text = text.replace("!/"","/"!")
    if "?" in text: text = text.replace("?/"","/"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences