"Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
high-level - easy to understand for humans
interpreted - we do not have to compile whole programs to run it
general-purpose - we will use it for bio informatics, but You could also build web services, games, etc.
If you have a Google-Account: use Google Colab in Your browser
Running a online-notebook with binder
Running Python Jupyter Notebooks on Your PC. Install the free version of Anaconda (follow this installation guide)
How we will work together
Before You start, put the the red card on top, this will indicate that You are still working on the challenge
are simple practical task You should try on Your own
are more challenging practical task, where You can work in a group
are optional task, if You want to learn more. Only start them if You finished all non-optional tasks until the next
Once, You reach the recap mark, switch the cards. A green card indicates that everything is clear, a yellow card that we should discuss the solution together
At any time, if You have a question: Raise Your hand and work with Your class mates
# importing the math libraryimport math
# here the user can define the input
...
# here the calculation is made
...
# here, we print out the results
...
2.2 Lists and loops
2.2.1 Special Data Types
Learning Objectives
You will be able to
use Python to formulate and evaluate Boolean expressions
use None type variables
Difference between Assignment (=) and Comparison (==)
eve_is_here = True# Assignment: eve_is_here is set to True
adam_is_here = False# Assignmentprint(adam_is_here == eve_is_here) # Comparison of the values#> False
Case Study: Use Python to break up a RNA sequence into codons
You are given data of the following RNA-Sequence. dna_sequence = ["A", "U", "C", "C", "G","A", "G", "C", "U", "E", "G","A", "G", "C", "U", "G", "Z", "G","A", "G", "C", "U","U"]
What sequence of amino acids does this RNA sequence encode?
Bonus: Find an open reading frame fist
Task
As You see, the data might have some errors, it should contain only the nucleotides A, U, T, and G. First clean the data by removing all corrupted items from the list.
Next create a loop, that prints out all three letter codons of the remaining sequence and store it in a list:
pin_correct = Trueif pin_correct:
print('Pin is correct!')
if-else-clause - two-sided
if <logical expression>:
<statement a>
else <logical expression>:
<statement b>
logical expression is evaluated
TRUE: statement a is executed
FALSE: statement b is ignored
# Beispiel printed Lösung
a = 4
b = 3if a>b:
print("a is larger than b")
else:
print("b is larger than a")
Case Study: Write a program that manages lab access
Write a program the checks the lab clearance of anyone wanting to enter the lab:
Make a list of five or more usernames called users_with_clearance.
Make another list of five usernames called users_wanting_to_enter. Make sure one or two of the new usernames are also in the users_with_clearance list.
Loop through the users_wanting_to_enter list to see if each of them has a lab clearance. Print a message that to greet all the persons. If the person hasn't sent them to a supervisor.
Case Study: Write a program to analyze logistic growth
Write a simulation for logistic growth that evaluates the population size for all time steps starting from zero and only ends if the population surpasses a closeness of ϵ=1 to the carrying capacity.
Sometimes it is unclear when to terminate the algorithm
Solution with for loop
for t inrange(1,100):
current_population = <...>
Solution with while-loop
t = 0while epsilon > 1:
epsilon = carrying_capacity - current_population
t = t+1
current_population = <...>
Store the results for each time step in a list. The list should contain dictionaries that have the time step, current population size and the population growth since the last time step.
Find the time step with the maximum growth in population.
counter = 0for k_mer_1 in seq_1: # Create a possible k-mers from seq_1for k_mer_2 in seq_2: # Create a possible k-mers from seq_2if k_mer_1 == k_mer_2: # Compare them
counter = counter + 1# Count the identical k-mers
seq_1 = "ABC"
seq_2 = "ABD""AB" == "AB"# 1/4 of k-mers matches"AB" == "BD""BC" == "AB""BC" == "BD"# H = 1# L = 1
When several algorithms solve a problem, how do you know which one
is best?
Is it the simplest?
The fastest?
The smallest?
Or something else?
One way to judge an algorithm is by its run time. An algorithm’s run time is the amount of time
import time
start = time.perf_counter()
for i inrange(1, 6):
print(i)
end = time.perf_counter()
print(end – start)
Linear time
The same algorithm would take (about) twice as long if we double number of elements
e.g., looping through a range
import time
start = time.perf_counter()
for i inrange(1, 12):
print(i)
end = time.perf_counter()
print(end-start)
Constant Time
Finding the value to a key in a dictionary hash a constant time independent of the length of the dictionary. The computer only has to compute the hash-function of the key to find the memory address:
Fining data in a linked list is more expensive, as the computer has to traverse the list to find the right position in the memory
Quadratic Time
Nesting two for-loops has quadratic time complexity
if the sequences are twice as long the algorithm will take four times as long
counter = 0for k_mer_1 in seq_1: # Create a possible k-mers from seq_1for k_mer_2 in seq_2: # Create a possible k-mers from seq_1if k_mer_1 == k_mer_2: # Compare them
counter = counter + 1# Count the identical k-mers
O-Notation
gives us the order of magnitude of the time-complexity of a algorithm
with larger data set (larger n) this becomes very important
Case Study: Develop an Algorithm for finding Similarity
We have three Sequences from three different Influenza viruses
The sequences encode for the same protein but have have some variations
Your task is to find the two samples, that are more closely related to each other
this might be the most challenging task so far, as it requires some creativity
pen and paper are useful tool to structure Your ideas
this is a good preparation for Your final project
60 minutes
2.7 Testing and Final Challenge
2.7.1 Testing
Learning Objectives
You will be able to
use unit tests, to test the functionality of their functions
develop and test their own algorithms and code to find practical solutions for sequence analysis
Functions
defget_formatted_name(first, last):"""Generate a neatly formatted full name."""
full_name = first + ' ' + last
return full_name.title()
must fulfill certain requirements
grow and change over time
Unittests
import unittest
# We define a class the holds different test for this specific functionclassNamesTestCase(unittest.TestCase):"""Tests for 'name_function.py'."""# We define on of several test function for the functiondeftest_first_last_name(self):"""Do names like 'Janis Joplin' work?"""# we all the function
formatted_name = get_formatted_name('janis', 'joplin')
# we check whether we get the expected result
self.assertEqual(formatted_name, 'Janis Joplin')
unittest.main(argv=[''], verbosity=2, exit=False)
provide a automatic test cast for a unit of code
you can run them after each change in the function to check whether they still work as expected
2.7.2 Final Project: Reconstruction in Shotgun-Sequencing
So far, we assumed to have full length DNA-Sequences available from a database or any other source. In fact, we are still developing the technology yet to read long strands of DNA and directly storing them in a database. Instead, the DNA strand is broken down in shorter sequences for reading the results. However, we have to bring the short sequences in the right order again.
"Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome. The method involves randomly breaking up the genome into small DNA fragments that are sequenced individually. A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to reconstitute the genome." genome.gov
You have a list of fragmented DNA-snippets that have to be ordered in the right order. Your task is to implement an algorithm, that brings the DNA-sequence into thr right order.
Verbal Description of the Algorithm
Randomly pick a sequence of five to eight nucleotides form the following original sequence and write them on a piece of paper "TAGCTAGCTAGCTTTTAGTTAGCAGCC"
write Your name on the other side of the paper
Algorithm assemble_sequence()
Inputs: List with segment fragment
Output: Assembled sequence
1) draw a random seed sequence
2) while sequences left in the list
2.1) draw next sequence
2.2) compare_beginning(overlap)
if exact match
glue_fragments_end()
2.3) compare_end(overlap)
if exact match
glue_fragments_end()
does the algorithm terminate?
yes, at one point we cannot find any new matches
but, we cannot be sure, that we found the whole original sequence
is the algorithm deterministic?
it depends on how we select the sequences
we can get stuck, if we do it in the wrong order
hence, we should draw by random and to it multiple times
Your task
You will implement this algorithm in Python
The starting point is given in the notebook
Your submission is a *.ipynb file showing all the results in the sakai-task
the naming convention must be
<lastname-member1>.pdf
e.g. huber.pdf
Deadline: 27.11.2023 (24:00)
What is given to You
Jupyter Notebook with the outline
the original DNA-sequence and the list of fragments
four functions with name, parameters and docstring
four three cases, which must be passed to get the points
Grading
This is the basis for the final grade.
Passing the unit tests for the following functions:
CompareBeginningsTestCase - 20 %
CompareEndingsTestCase- 20%
GlueFragmentsTestCase - 20 %
Implement the following function without unit test:
assemble_sequence() - 10 %
Coding style and comments - 20 %
Length of final sequence and speed of execution of assemble_sequence() - 10 %