7. Working with External Libraries

In this lesson, I'll be talking about imports in Python, giving some tips for working with unfamiliar libraries (and the objects they return), and digging into the guts of Python just a bit to talk about operator overloading.

# Imports

So far we've talked about types and functions which are built-in to the language.

But one of the best things about Python (especially if you're a data scientist) is the vast number of high-quality custom libraries that have been written for it.

Some of these libraries are in the "standard library", meaning you can find them anywhere you run Python. Others libraries can be easily added, even if they aren't always shipped with Python.

Either way, we'll access this code with imports.

We'll start our example by importing math from the standard library.

import math

print("It's math! It has type {}".format(type(math)))

1
2
3

math is a module. A module is just a collection of variables (a namespace, if you like) defined by someone else. We can see all the names in math using the built-in function dir().

print(dir(math))

We can access these variables using dot syntax. Some of them refer to simple values, like math.pi:

print("pi to 4 significant digits = {:.4}".format(math.pi))

But most of what we'll find in the module are functions, like math.log:

math.log(32, 2) # $log_2 32$

Of course, if we don't know what math.log does, we can call help() on it:

help(math.log)

We can also call help() on the module itself. This will give us the combined documentation for all the functions and values in the module (as well as a high-level description of the module). Click the "output" button to see the whole math help page.

help(math)

# Other import syntax

If we know we'll be using functions in math frequently we can import it under a shorter alias to save some typing (though in this case "math" is already pretty short).

import math as mt
mt.pi

1
2

You may have seen code that does this with certain popular libraries like Pandas, Numpy, Tensorflow, or Matplotlib. For example, it's a common convention to import numpy as np and import pandas as pd.

The as simply renames the imported module. It's equivalent to doing something like:

import math
mt = math

1
2

Wouldn't it be great if we could refer to all the variables in the math module by themselves? i.e. if we could just refer to pi instead of math.pi or mt.pi? Good news: we can do that.

from math import *
print(pi, log(32, 2))

1
2

import * makes all the module's variables directly accessible to you (without any dotted prefix).

Bad news: some purists might grumble at you for doing this.

Worse: they kind of have a point.

from math import *
from numpy import *
print(pi, log(32, 2))

1
2
3

What the what? But it worked before!

These kinds of "star imports" can occasionally lead to weird, difficult-to-debug situations.

The problem in this case is that the math and numpy modules both have functions called log, but they have different semantics. Because we import from numpy second, its log overwrites (or "shadows") the log variable we imported from math.

A good compromise is to import only the specific things we'll need from each module:

from math import log, pi
from numpy import asarray

1
2

# Submodules

We've seen that modules contain variables which can refer to functions or values. Something to be aware of is that they can also have variables referring to other modules.

import numpy
print("numpy.random is a", type(numpy.random))
print("it contains names such as...",
      dir(numpy.random)[-15:]
     )

1
2
3
4
5

So if we import numpy as above, then calling a function in the random "submodule" will require two dots.

# Roll 10 dice
rolls = numpy.random.randint(low=1, high=6, size=10)
rolls

1
2
3

# Oh the places you'll go, oh the objects you'll see

So after 6 lessons, you're a pro with ints, floats, bools, lists, strings, and dicts (right?).

Even if that were true, it doesn't end there. As you work with various libraries for specialized tasks, you'll find that they define their own types which you'll have to learn to work with. For example, if you work with the graphing library matplotlib, you'll be coming into contact with objects it defines which represent Subplots, Figures, TickMarks, and Annotations. pandas functions will give you DataFrames and Series.

In this section, I want to share with you a quick survival guide for working with strange types.

# Three tools for understanding strange objects

In the cell above, we saw that calling a numpy function gave us an "array". We've never seen anything like this before (not in this course anyways). But don't panic: we have three familiar builtin functions to help us here.

1: type() (what is this thing?)

type(rolls)

2: dir() (what can I do with it?)

print(dir(rolls))

# What am I trying to do with this dice roll data? Maybe I want the average roll, in which case the "mean"
# method looks promising...
rolls.mean()

1
2
3

# Or maybe I just want to get back on familiar ground, in which case I might want to check out "tolist"
rolls.tolist()

1
2

3: help() (tell me more)

# That "ravel" attribute sounds interesting. I'm a big classical music fan.
help(rolls.ravel)

1
2

# Okay, just tell me everything there is to know about numpy.ndarray
# (Click the "output" button to see the novel-length output)
help(rolls)

1
2
3

(Of course, you might also prefer to check out the online docs (opens new window))

# Operator overloading(运算符重载)

What's the value of the below expression?

[3, 4, 1, 2, 2, 1] + 10 # TypeError: can only concatenate list (not "int") to list

What a silly question. Of course it's an error.

But what about...

rolls + 10

We might think that Python strictly polices how pieces of its core syntax behave such as +, <, in, ==, or square brackets for indexing and slicing. But in fact, it takes a very hands-off approach. When you define a new type, you can choose how addition works for it, or what it means for an object of that type to be equal to something else.

The designers of lists decided that adding them to numbers wasn't allowed. The designers of numpy arrays went a different way (adding the number to each element of the array).

Here are a few more examples of how numpy arrays interact unexpectedly with Python operators (or at least differently from lists).

# At which indices are the dice less than or equal to 3?
rolls <= 3

1
2

xlist = [[1,2,3],[2,4,6],]
# Create a 2-dimensional array
x = numpy.asarray(xlist)
print("xlist = {}\nx =\n{}".format(xlist, x))

1
2
3
4

# Get the last element of the second row of our numpy array
x[1,-1]

1
2

# Get the last element of the second sublist of our nested list?
xlist[1,-1] # TypeError: list indices must be integers or slices, not tuple

1
2

numpy's ndarray type is specialized for working with multi-dimensional data, so it defines its own logic for indexing, allowing us to index by a tuple to specify the index at each dimension.

# When does 1 + 1 not equal 2?

Things can get weirder than this. You may have heard of (or even used) tensorflow, a Python library popularly used for deep learning. It makes extensive use of operator overloading.

import tensorflow as tf
# Create two constants, each with value 1
a = tf.constant(1)
b = tf.constant(1)
# Add them together to get...
a + b

1
2
3
4
5
6

a + b isn't 2, it is (to quote tensorflow's documentation)...

a symbolic handle to one of the outputs of an Operation. It does not hold the values of that operation's output, but instead provides a means of computing those values in a TensorFlow tf.Session.

It's important just to be aware of the fact that this sort of thing is possible and that libraries will often use operator overloading in non-obvious or magical-seeming ways.

Understanding how Python's operators work when applied to ints, strings, and lists is no guarantee that you'll be able to immediately understand what they do when applied to a tensorflow Tensor, or a numpy ndarray, or a pandas DataFrame.

Once you've had a little taste of DataFrames, for example, an expression like the one below starts to look appealingly intuitive:

# Get the rows with population over 1m in South America
df[(df['population'] > 10**6) & (df['continent'] == 'South America')]

1
2

But why does it work? The example above features something like 5 different overloaded operators. What's each of those operations doing? It can help to know the answer when things start going wrong.

# Curious how it all works?

Have you ever called help() or dir() on an object and wondered what the heck all those names with the double-underscores were?

print(dir(list))

This turns out to be directly related to operator overloading.

When Python programmers want to define how operators behave on their types, they do so by implementing methods with special names beginning and ending with 2 underscores such as __lt__, __setattr__, or __contains__. Generally, names that follow this double-underscore format have a special meaning to Python.

So, for example, the expression x in [1, 2, 3] is actually calling the list method __contains__ behind-the-scenes. It's equivalent to (the much uglier) [1, 2, 3].__contains__(x).

If you're curious to learn more, you can check out Python's official documentation (opens new window), which describes many, many more of these special "underscores" methods.

We won't be defining our own types in these lessons (if only there was time!), but I hope you'll get to experience the joys of defining your own wonderful, weird types later down the road.

# Exercise: Working with External Libraries

Head over to the final coding exercise (opens new window) for one more round of coding questions involving imports, working with unfamiliar objects, and, of course, more gambling.

# 1.

After completing the exercises on lists and tuples (opens new window), Jimmy noticed that, according to his estimate_average_slot_payout function, the slot machines at the Learn Python Casino are actually rigged against the house, and are profitable to play in the long run.

Starting with $200 in his pocket, Jimmy has played the slots 500 times, recording his new balance in a list after each spin. He used Python's matplotlib library to make a graph of his balance over time:

# Import the jimmy_slots submodule
from learntools.python import jimmy_slots
# Call the get_graph() function to get Jimmy's graph
graph = jimmy_slots.get_graph()
graph

1
2
3
4
5

As you can see, he's hit a bit of bad luck recently. He wants to tweet this along with some choice emojis, but, as it looks right now, his followers will probably find it confusing. He's asked if you can help him make the following changes:

Add the title "Results of 500 slot machine pulls"
Make the y-axis start at 0.
Add the label "Balance" to the y-axis

After calling type(graph) you see that Jimmy's graph is of type matplotlib.axes._subplots.AxesSubplot. Hm, that's a new one. By calling dir(graph), you find three methods that seem like they'll be useful: .set_title(), .set_ylim(), and .set_ylabel().

Use these methods to complete the function prettify_graph according to Jimmy's requests. We've already checked off the first request for you (setting a title).

(Remember: if you don't know what these methods do, use the help() function!)

def prettify_graph(graph):
    """Modify the given graph according to Jimmy's requests: add a title, make the y-axis
    start at 0, label the y-axis. (And, if you're feeling ambitious, format the tick marks
    as dollar amounts using the "$" symbol.)
    """
    graph.set_title("Results of 500 slot machine pulls")
    # Complete steps 2 and 3 here
    # Make the y-axis begin at 0
    graph.set_ylim(bottom=0)
    # Label the y-axis
    graph.set_ylabel("Balance")
    # Bonus: format the numbers on the y-axis as dollar amounts
    # An array of the values displayed on the y-axis (150, 175, 200, etc.)
    ticks = graph.get_yticks()
    # Format those values into strings beginning with dollar sign
    new_labels = ['${}'.format(int(amt)) for amt in ticks]
    # Set the new labels
    graph.set_yticklabels(new_labels)

graph = jimmy_slots.get_graph()
prettify_graph(graph)
graph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

# 2. 🌶️🌶️

This is a very hard problem. Feel free to skip it if you are short on time:

Luigi is trying to perform an analysis to determine the best items for winning races on the Mario Kart circuit. He has some data in the form of lists of dictionaries that look like...

[
    {'name': 'Peach', 'items': ['green shell', 'banana', 'green shell',], 'finish': 3},
    {'name': 'Bowser', 'items': ['green shell',], 'finish': 1},
    # Sometimes the racer's name wasn't recorded
    {'name': None, 'items': ['mushroom',], 'finish': 2},
    {'name': 'Toad', 'items': ['green shell', 'mushroom'], 'finish': 1},
]

'items' is a list of all the power-up items the racer picked up in that race, and 'finish' was their placement in the race (1 for first place, 3 for third, etc.).

He wrote the function below to take a list like this and return a dictionary mapping each item to how many times it was picked up by first-place finishers.

# Import luigi's full dataset of race data
from learntools.python.luigi_analysis import full_dataset

# Fix me!
def best_items(racers):
    winner_item_counts = {}
    for i in range(len(racers)):
        # The i'th racer dictionary
        racer = racers[i]
        # We're only interested in racers who finished in first
        if racer['finish'] == 1:
            for item in racer['items']:
                # Add one to the count for this item (adding it to the dict if necessary)
                if i not in winner_item_counts:
                    winner_item_counts[i] = 0
                winner_item_counts[i] += 1

        # Data quality issues :/ Print a warning about racers with no name set. We'll take care of it later.
        if racer['name'] is None:
            print("WARNING: Encountered racer with unknown name on iteration {}/{} (racer = {})".format(
                i+1, len(racers), racer['name'])
                 )
    return winner_item_counts

# Try analyzing the imported full dataset
best_items(full_dataset)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Solution: Luigi used the variable name i to represent each item in racer['items']. However, he also used i as the loop variable for the outer loop (for i in range(len(racers))). These i's are clobbering each other. This becomes a problem only if we encounter a racer with a finish of 1 and a name of None. If that happens, when we try to print the "WARNING" message, i refers to a string like "green shell", which python can't add to an integer, hence a TypeError.

This is similar to the issue we saw when we imported * from math and numpy. They both contained variables called log, and the one we got when we tried to call it was the wrong one.

We can fix this by using different loop variables for the inner and outer loops. i wasn't a very good variable name for the inner loop anyways. for item in racer['items'] fixes the bug and is easier to read.

Variable shadowing bugs like this don't come up super often, but when they do they can take an infuriating amount of time to diagnose!

# 3. 🌶️

Suppose we wanted to create a new type to represent hands in blackjack. One thing we might want to do with this type is overload the comparison operators like > and <= so that we could use them to check whether one hand beats another. e.g. it'd be cool if we could do this:

>>> hand1 = BlackjackHand(['K', 'A'])
>>> hand2 = BlackjackHand(['7', '10', 'A'])
>>> hand1 > hand2
True

1
2
3
4

Well, we're not going to do all that in this question (defining custom classes is a bit beyond the scope of these lessons), but the code we're asking you to write in the function below is very similar to what we'd have to write if we were defining our own BlackjackHand class. (We'd put it in the __gt__ magic method to define our custom behaviour for >.)

Fill in the body of the blackjack_hand_greater_than function according to the docstring.

def hand_total(hand):
    """Helper function to calculate the total points of a blackjack hand.
    """
    total = 0
    # Count the number of aces and deal with how to apply them at the end.
    aces = 0
    for card in hand:
        if card in ['J', 'Q', 'K']:
            total += 10
        elif card == 'A':
            aces += 1
        else:
            # Convert number cards (e.g. '7') to ints
            total += int(card)
    # At this point, total is the sum of this hand's cards *not counting aces*.

    # Add aces, counting them as 1 for now. This is the smallest total we can make from this hand
    total += aces
    # "Upgrade" aces from 1 to 11 as long as it helps us get closer to 21
    # without busting
    while total + 10 <= 21 and aces > 0:
        # Upgrade an ace from 1 to 11
        total += 10
        aces -= 1
    return total

def blackjack_hand_greater_than(hand_1, hand_2):
    """
    Return True if hand_1 beats hand_2, and False otherwise.
    
    In order for hand_1 to beat hand_2 the following must be true:
    - The total of hand_1 must not exceed 21
    - The total of hand_1 must exceed the total of hand_2 OR hand_2's total must exceed 21
    
    Hands are represented as a list of cards. Each card is represented by a string.
    
    When adding up a hand's total, cards with numbers count for that many points. Face
    cards ('J', 'Q', and 'K') are worth 10 points. 'A' can count for 1 or 11.
    
    When determining a hand's total, you should try to count aces in the way that 
    maximizes the hand's total without going over 21. e.g. the total of ['A', 'A', '9'] is 21,
    the total of ['A', 'A', '9', '3'] is 14.
    
    Examples:
    >>> blackjack_hand_greater_than(['K'], ['3', '4'])
    True
    >>> blackjack_hand_greater_than(['K'], ['10'])
    False
    >>> blackjack_hand_greater_than(['K', 'K', '2'], ['3'])
    False
    """
    total_1 = hand_total(hand_1)
    total_2 = hand_total(hand_2)
    return total_1 <= 21 and (total_1 > total_2 or total_2 > 21)

q3.check()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

# The end

You've finished the Python micro-course. Congrats!

As always, if you have any questions about these exercises, or anything else you encountered in the course, come to the Learn Forum (opens new window).

You probably didn't put in all these hours of learning Python just to play silly games of chance, right? If you're interested in applying your newfound Python skills to some data science tasks, check out some of our other Kaggle Courses (opens new window). Some good next steps are:

Happy Pythoning!

编辑

#Python

上次更新: 2021/02/16, 02:45:37

← 6. Strings and Dictionaries Kaggle Machine Learning Micro-Course→