This was originally a post made in r/python.

So a bit of history

Recently the thought of learning to program went from something I only wished I was capable in order to automate my life, into becoming a very remarkable and enjoyable part of my every day life. For maybe the past 2 years, mostly speeding up in the last year it’s become one of the most remarkable learning experiences.

I can definitely say though, Python was the reason for my breakthrough.

Python was the language that taught me what it means to write good, maintainable, and beautiful code. I’ve fiddled a little bit with bash and shell programs, as well as C, JS and Matlab on and off for a while.

But what really accelerated my journey was the amazing community that was very active and passionate about teaching and refining each other as fellow programmers, hackers no matter what background. For someone like me whose learning model isn’t very compatible with a spoon-fed style lectures online, or drilling codeacademy programs that were instructional but still quite unremarkable, it was really the perfect place for me to grow and thrive.

I learned by building things, failing, trying new things, wading through blogs and technical articles. I slowly started to understand them, and eventually what came to fruition was finding some quite remarkable people who through their words taught me many amazing things.

So finally, just wanted to say thank everyone in this remarkable community!

I’ve finally started to putting in my resume, and a remarkable thing is I’ve heard back from some. I’m pretty excited to say I’ve just submitted one of my first serious programming challenges and probably won’t hear back till Monday. But super anxious since being fairly new to showing my code to others, and want to get some feedback on how I can improve and what steps I can take to be competitive people with more conventional education.

Won’t say what company it is I’m submitting this for since I don’t know if that’s okay to disclose. But since my code is submitted, just thought it would be alright to share it here!


So Heres the full code and the prompt!

Mostly I’m writing a spellchecker, it’s not based on the frequency of words in english, I just need to check membership which makes life much easier and is different from the famous norvig spellchecker.

Data Structures

I just use a hash table based data structure like a set, so we’re already O(1) for membership test where n is the size of the hash.

I could have easily just used an actual dictionary, and honestly it’s probably going to be just as fast, or even faster. However, I chose a set over a dict because of the set mathematical operators that have a feeling of elegance much like they do in mathematics.
class Dictionary(object):
    """
    Dictionary class used as a delegate for the expensive IO operation
    of generating the dictionary.

    Returns a mutible `set` container type and so membership testing
    against it is an O(1) operation in terms of the amount of values stored in
    the hash table, and a O(n) operation in terms of the size hash function.
    """

    def __init__(self, filepath):
        self.filepath = filepath

    def __repr__(self):
        return '%s(%r)' % (self.__class__.__name__, self.filepath)

    def __getattr__(self, attr):
        self.__dict__[attr] = getattr(self.words, attr)
        return self.__dict__[attr]

    @property
    def words(self):
        with open(self.filepath, 'rt') as corpus:
            raw_text = corpus.read()
        return set(raw_text.splitlines())

The important part here is the ``@property`` decorator means that IO
is computed ``lazily`` and only performed once. Otherwise we would
be creating the entire corpus of words every time the method is
called without the decorator.

    **Edit** *Thanks to /u/bnorick pointing out that this isn't
    actually true!* *Interesting Discussion in the comments
    regarding the actual purpose of the ``@property`` decorator!*

I know I could just have used a function.

I can be honest and say I don’t normally do this, but here I built a class mostly for show ;)

SpellChecker

Then now I need to build the actual spellchecker.

My train of thought here is.

  1. Membership test against the dictionary.
  2. If word is in the set, return the word
  3. If word is not in the set, calculate all the possible permutations of edits within a certain criteria.

Eventually what I chose was a really naive implementation, but using recursion to show that I understand how to work with recursive programs.

class Correct(object):
    """
    Tries to correct for an misspelled word by accounting for the
    types of corrections.

    Accounts for.
        * Case errors
        * Repeat letters
        * Incorrect vowels
    """

    def __init__(self, text):
        self.text = text.lower()

    def __repr__(self):
        return '%s(%r)' % (self.__class__.__name__, self.word)

    def edit_repeats(self):
        """
        Reduces any sequentially repeating words and their permutations.
        """
        # a == b offset by 1
        a = self.text
        b = a[1:] + a[0]

        # ord(x) - ord(y) == 0 for repeats
        c = Counter(x for (x,y) in zip(a,b) if not ord(x) - ord(y))

        reductions = set()
        for word, rep in c.items():

            # expand and contract string each time remove one repeat
            for trampoline in range(rep):
                expand = list(a)
                expand.remove(word)
                a = ''.join(expand)
                reductions.add(a)

        return reductions

    def edit_vowels(self):
        """
        Generates all vowel substitutions.

        Algorithm
        =========
        - Find all vowel in initial text
        - Generate every permutation of vowel substitutions
        - Create a translation table for all possible vowel sets.
        = Map translation tables back to initial text
        """
        t, v   = self.text, 'aeiou'
        edits  = [w for (n,w) in enumerate(t) if w in v]
        perms  = permutations(v, len(edits))
        maps   = [maketrans(''.join(edits), ''.join(perm)) for perm in perms]
        trans  = [t.translate(m) for m in maps]
        return set(trans)

    def edits(self):
        """
        Recursively finds all edits possible within set parameters.
        """
        edit, repeats = self.edit_vowels(), self.edit_repeats()
        if repeats:
            for repeating in repeats:
                edit |= Correct(repeating).edit_vowels()
        return edit

Edit Distance

Now that I have an object that will build all the possible corrections possible, I need to choose the best one.

Meaning I need to calculate the minimum leveinstein distance to find the shortest edit distance.

I originally tried doing hamming, but it got me weird results. My implementation is recursive so it’s not very efficient, out of the box. But that’s okay, because I can memoize using a cache either through the @lru_cache() decorator in Python 3, but since I wanted to support both Python 2 as well, I built my own memoization decorator.

It also runs on PyPy! And the tracing JIT is pretty good optimizing for these slow regions if you really need the speed

Also decorators are awesome! They’re very versatile, and plug and play. It took me a while to really understand closures, but once you get to learn them it’s a great feeling. Unfortunately I don’t find many places where I use decorators, so it was a real pleasure using this one.
def memoize(func):
    """
    Memoizing decorator for functions, methods, or classes, exposing
    the cache publicly.
    Currently only works on positional arguments.
    """
    cache = func.cache = {}

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        if args not in cache:
            cache[args] = func(*args, **kwargs)
        return cache[args]

    return wrapper

@memoize
def leveinstein_distance(s, t):
    """
    Memoized Recursive Implementation of Leveinstein Distance
    For Calculating Edit Distances
    """
    if not s: return len(t)
    if not t: return len(s)

    if s[0] == t[0]:
        return leveinstein_distance(s[1:], t[1:])

    splt = leveinstein_distance(s,     t[1:]) # Splits
    dels = leveinstein_distance(s[1:], t    ) # Deletes
    tras = leveinstein_distance(s[1:], t[1:]) # Transposes

    return 1 + min(splt, dels, tras)

if __name__ == '__main__':

And finally, I built a class that encapsulates the state of the program and just ran it!

class SpellingBee(object):
    """
    Manages the state the spelling bee application.

    Read from stdin:
        If spelling is incorrect, returns correction
        If no suggestion can be returns "NO SUGGESTION"
    """
    def __init__(self, dictionary):
        self.words = dictionary.words

    def run(self):
        try:
            self()
        except (EOFError, KeyboardInterrupt):
            print('Thanks for Playing!')
            return

    def __call__(self):
        while True:
            spelling = input('> ')
            if args.test:
                print('Input: ' + spelling)
                time.sleep(1)
                sys.stdout.flush()

            if spelling in self.words:
                print(spelling)
                continue

            distance = functools.partial(leveinstein_distance, spelling)
            correction = Correct(spelling)
            candidates = list(correction.edits() & self.words)

            if not candidates:
                print('NO SUGGESTION')
                continue

            print(min((c for c in candidates), key=lambda x: distance(x)))


def main():
    words  = Dictionary('/usr/share/dict/words')

    if args.test:
        dictionary = list(words.words)
        incorrect = [Incorrect(random.choice(dictionary)) for _ in range(10)]
        mispelled = '\n'.join([split.edits() for split in incorrect])
        sys.stdin = StringIO(mispelled)

    spelling_bee = SpellingBee(words)
    spelling_bee.run()

Comments

comments powered by Disqus