Exercism v3 launches on Sept 1st 2021. Learn more! ๐Ÿš€๐Ÿš€๐Ÿš€
Avatar of nicolemon

nicolemon's solution

to Word Count in the Python Track

Published at Jul 13 2018 · 6 comments
Instructions
Test suite
Solution

Given a phrase, count the occurrences of each word in that phrase.

For the purposes of this exercise you can expect that a word will always be one of:

  1. A number composed of one or more ASCII digits (ie "0" or "1234") OR
  2. A simple word composed of one or more ASCII letters (ie "a" or "they") OR
  3. A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")

When counting words you can assume the following rules:

  1. The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word)
  2. The count is unordered; the tests will ignore how words and counts are ordered
  3. Other than the apostrophe in a contraction all forms of punctuation are ignored
  4. The words can be separated by any form of whitespace (ie "\t", "\n", " ")

For example, for the phrase "That's the password: 'PASSWORD 123'!", cried the Special Agent.\nSo I fled. the count would be:

that's: 1
the: 2
password: 2
123: 1
cried: 1
special: 1
agent: 1
so: 1
i: 1
fled: 1

Exception messages

Sometimes it is necessary to raise an exception. When you do this, you should include a meaningful error message to indicate what the source of the error is. This makes your code more readable and helps significantly with debugging. Not every exercise will require you to raise an exception, but for those that do, the tests will only pass if you include a message.

To raise a message with an exception, just write it as an argument to the exception type. For example, instead of raise Exception, you should write:

raise Exception("Meaningful message indicating the source of the error")

Running the tests

To run the tests, run pytest word_count_test.py

Alternatively, you can tell Python to run the pytest module: python -m pytest word_count_test.py

Common pytest options

  • -v : enable verbose output
  • -x : stop running tests on first failure
  • --ff : run failures from previous test before running other test cases

For other options, see python -m pytest -h

Submitting Exercises

Note that, when trying to submit an exercise, make sure the solution is in the $EXERCISM_WORKSPACE/python/word-count directory.

You can find your Exercism workspace by running exercism debug and looking for the line that starts with Workspace.

For more detailed information about running tests, code style and linting, please see Running the Tests.

Source

This is a classic toy problem, but we were reminded of it by seeing it in the Go Tour.

Submitting Incomplete Solutions

It's possible to submit an incomplete solution so you can see how others have completed the exercise.

word_count_test.py

import unittest

from word_count import count_words

# Tests adapted from `problem-specifications//canonical-data.json`


class WordCountTest(unittest.TestCase):
    def test_count_one_word(self):
        self.assertEqual(count_words("word"), {"word": 1})

    def test_count_one_of_each_word(self):
        self.assertEqual(count_words("one of each"), {"one": 1, "of": 1, "each": 1})

    def test_multiple_occurrences_of_a_word(self):
        self.assertEqual(
            count_words("one fish two fish red fish blue fish"),
            {"one": 1, "fish": 4, "two": 1, "red": 1, "blue": 1},
        )

    def test_handles_cramped_lists(self):
        self.assertEqual(count_words("one,two,three"), {"one": 1, "two": 1, "three": 1})

    def test_handles_expanded_lists(self):
        self.assertEqual(
            count_words("one,\ntwo,\nthree"), {"one": 1, "two": 1, "three": 1}
        )

    def test_ignore_punctuation(self):
        self.assertEqual(
            count_words("car: carpet as java: javascript!!&@$%^&"),
            {"car": 1, "carpet": 1, "as": 1, "java": 1, "javascript": 1},
        )

    def test_include_numbers(self):
        self.assertEqual(
            count_words("testing, 1, 2 testing"), {"testing": 2, "1": 1, "2": 1}
        )

    def test_normalize_case(self):
        self.assertEqual(count_words("go Go GO Stop stop"), {"go": 3, "stop": 2})

    def test_with_apostrophes(self):
        self.assertEqual(
            count_words("First: don't laugh. Then: don't cry."),
            {"first": 1, "don't": 2, "laugh": 1, "then": 1, "cry": 1},
        )

    def test_with_quotations(self):
        self.assertEqual(
            count_words("Joe can't tell between 'large' and large."),
            {"joe": 1, "can't": 1, "tell": 1, "between": 1, "large": 2, "and": 1},
        )

    def test_substrings_from_the_beginning(self):
        self.assertEqual(
            count_words("Joe can't tell between app, apple and a."),
            {
                "joe": 1,
                "can't": 1,
                "tell": 1,
                "between": 1,
                "app": 1,
                "apple": 1,
                "and": 1,
                "a": 1,
            },
        )

    def test_multiple_spaces_not_detected_as_a_word(self):
        self.assertEqual(
            count_words(" multiple   whitespaces"), {"multiple": 1, "whitespaces": 1}
        )

    def test_alternating_word_separators_not_detected_as_a_word(self):
        self.assertEqual(
            count_words(",\n,one,\n ,two \n 'three'"), {"one": 1, "two": 1, "three": 1}
        )

    # Additional tests for this track

    def test_tabs(self):
        self.assertEqual(
            count_words(
                "rah rah ah ah ah	roma roma ma	ga ga oh la la	want your bad romance"
            ),
            {
                "rah": 2,
                "ah": 3,
                "roma": 2,
                "ma": 1,
                "ga": 2,
                "oh": 1,
                "la": 2,
                "want": 1,
                "your": 1,
                "bad": 1,
                "romance": 1,
            },
        )

    def test_non_alphanumeric(self):
        self.assertEqual(
            count_words("hey,my_spacebar_is_broken"),
            {"hey": 1, "my": 1, "spacebar": 1, "is": 1, "broken": 1},
        )

    def test_multiple_apostrophes_ignored(self):
        self.assertEqual(count_words("''hey''"), {"hey": 1})


if __name__ == "__main__":
    unittest.main()
import string

IGNORE = string.punctuation + string.whitespace


def split_phrase(phrase):  # I AM NOT PROUD OF THIS
    """Recursively split phrase by delimiters."""
    if ',' in phrase:
        words = [word for word in phrase.split(',')]
        return split_phrase(' '.join(words))

    if '_' in phrase:
        words = [word for word in phrase.split('_')]
        return split_phrase(' '.join(words))

    words = [word.strip(IGNORE) for word in phrase.split() if
             len(word.strip(IGNORE)) > 0]

    return words


def count_words(phrase):
    """Count the occurences of each word in phrase."""
    words = split_phrase(phrase)

    word_dictionary = {}
    for item in words:
        normalized = item.lower()
        if normalized not in word_dictionary:
            word_dictionary[normalized] = 1
        else:
            word_dictionary[normalized] = word_dictionary[normalized] + 1

    return word_dictionary

Community comments

Find this solution interesting? Ask the author a question to learn more.
Avatar of nicolemon

Is there a better, less hacky way to split the words by multiple possible delimiters?

Avatar of mrainne

I used string methods translate and maketrans for cleaning the phrase but I wonder if there is better way to do it.

Avatar of eowsek

I don't think its less hacky but the way I did it was to use maketrans to replace all punctuation with spaces instead of just getting rid of them as it solves the _ problem and is generaliseable if the _ test were instead to become some other form of punctuation. import string

def word_count(phrase): phrase = phrase.lower() phraselist = phrase.translate(str.maketrans(string.punctuation, " " * len(string.punctuation))) second = phraselist.split() final = {} for i in range(len(second)): final[second[i]] = second.count(second[i]) return final

Avatar of edbrook

@nicolemon commented:

Is there a better, less hacky way to split the words by multiple possible delimiters?

Take a look at re.split() - https://docs.python.org/3/library/re.html#re.split

A quick example: >>> import re >>> s = 'test,12@45' >>> re.split('[,@]', s) ['test', '12', '45'] >>> re.split('[^a-z0-9]', s) ['test', '12', '45']

Avatar of DUznanski

You could also use re.findall instead, and make the regex say what you do want it to match.

Avatar of nicolemon

This is exactly what I was looking for. I should read documentation more slowly. Appreciate it!

What can you learn from this solution?

A huge amount can be learned from reading other peopleโ€™s code. This is why we wanted to give exercism users the option of making their solutions public.

Here are some questions to help you reflect on this solution and learn the most from it.

  • What compromises have been made?
  • Are there new concepts here that you could read more about to improve your understanding?