The first is just a simple count of all nucleotides within a string s. This problem can be seen as simply loading a sequence from a file (and stripping out the newline), and then finding and reporting frequencies. I am doing everything within an instarepl tab in LightTable, so there will not be any main function.
Lets look at a first version of the counting code, and see if it is possible to clean it up after developing the correct functionality.
(defn displayFrequencies [s] {:pre (string? s)}
(def f (into (sorted-map) (frequencies (clojure.string/upper-case s))))
(println (vals f))
)
Here we first take a string called s, using pre conditions to check the type prior to execution. For this application we are expecting all of our sequences to be represented as a string, though a vector of character literals would be equally valid in reality. The frequencies function actually counts the occurrences of each character in the string. The output of frequencies is a normal map, where keys are not ordered. However, it is helpful to have the items in the hash-map in lexicographical order (partly because that is the expected form for the problem). Sorted-map is like a hash-map however the items are stored in order sorted by their keys.
While this works, there is an unnecessary binding to f to help with clarity, as well as a bunch of nested calls. This can be rewritten more clearly using the thread-last macro (->>) as below.
(defn displayFrequencies [s] {:pre (string? s)}
(->> s
clojure.string/upper-case
frequencies
(into (sorted-map))
vals
println
)
)
We really want to be able to apply this, or another, function to the contents of a file containing a sequence. In order to facilitate this we will create a function that accepts a filename and a function to apply to the string in the file.
(defn processFile [filename func] {:pre [(string? filename) (clojure.test/function? func)]
(->> filename
clojure.java.io/file
slurp
clojure.string/trim
func)
)
(processFile "sequence.txt" displayFrequencies)
This sets things up easily to check the GC content of the sequence once that function is available.
No comments:
Post a Comment