Project: tools.jsephler.co.uk

jsephler project - default image

tools.jephler.co.uk is a project site that I have set up which primarily houses different tools ranging from convertors and calculators to more generally links to reference sites. It was built with ease of use and portability in mind and is fully compatible with mobile devices too!

I think of the site as my coding playground and allows me to create and share tools which I not only find interesting, but more importantly useful. Thanks to Django, this does not limit the site to what you see today and will hopefully be expanded upon through time.

I have some existing projects that currently live in a myriad of Python files which I am hopeful to convert and eventually host here.

Suggestions or comments welcome! Please get involved.

tools.jsephler.co.uk

Compile and install Python’s MOD_WSGI for Apache2 in Ubuntu 18.04

Having installed mod_wsgi a couple of times on different machines, I decided to note down the steps I took and put together a guide. There are also some solutions to some issues that you may encounter, which left me clueless for a while. If you are looking for a guide to deploying a Flask/Django project, you will have to look elsewhere (for the time being).

This is for use with python3 using Ubuntu 18.04’s shipped version. You may need to tweak this guide to point to non-standard versions of python3 that you specifically want to use.

I believe some Ubuntu versions have pre-packaged mod_wsgi installations available. However, noted in a previous blog post, apache2 may flood error logs pointing to a mismatch with either differing python3 or apache2 versions and in my view, a clean compilation ensures a tailored fit to your current system without the worry of logging and runtime errors.

Things to note:

  • I prefer to compile inside the “/opt” directory. Please amend the guide to suite your preference
  • I will refer to the current latest version of mod_wsgi (4.6.4 at time of writing). This is for illustrative purposes only, and some directory paths will be different with different versions
  • The guide is built around CLi, and assumes you are comfortable with the basics

Pre-Requisites

You will need to have installed some additional packages:

  • python3-dev  –  mainly for python3 header files for compilation
  • apache2
  • apache2-dev
  • gcc  –  *or an equivalent C compiler*

You may use apt install for all the above packages

Next, you will need to locate the latest mod_wsgi source files (found here)

Downloading

Navigate to /opt and download the latest version of mod_wsgi:

$ cd /opt
$ wget https://github.com/GrahamDumpleton/mod_wsgi/archive/4.6.4.tar.gz

Unpack the tar:

$ tar -xzf 4.6.4.tar.gz mod_wsgi-4.6.4/

Configure Make file

Change into the unpacked directory:

$ cd mod_wsgi-4.6.4/

Complete a “test” run of the configure script to so you can debug any errors:

$ ./configure

 

  • Error: “configure: error: no acceptable C compiler found in $PATH
    Solution: install a C compiler:

    $ sudo apt install gcc

 

  • Error: “Checking Apache Version.. ./configure: line 2765: apsx: command not found
    Solution: install apache2-dev

    $ sudo apt install apache2-dev

 

  • Error: “checking for python… no
    Solution: add argument to ./configure to point to python3 (global) or full path

    $ ./configure --with-python=python3

If you run configure without any errors, you’re ready to compile!

Compiling mod_wsgi

The make file is ready. Run Make:

$ make

This will now compile mod_wsgi. It should complete without error.

We finish the process by installing mod_wsgi to our apache2 installation:

$ make install

Configure Apache2

Finally, we need to configure apache2 correctly to load the mod_wsgi module. There are 2 ways we can do this:

  1. The “lazy way”:
    Add the following to the end of the apache2.conf file:
    (located: “/etc/apache2/apache2.conf“)

    LoadModule wsgi_module /usr/lib/apache2/modules/mod_wsgi.so

    Restart Apache2:

    $ sudo systemctl restart apache2

    If apache2 restarts without errors, you have successfully installed and loaded mod_wsgi!

  2. The “proper” way:
    Navigate to “/etc/apache2/mods-available“Create “mod_wsgi.conf”:

    $ sudo touch mod_wsgi.conf

     

    Add the following to” mod_wsgi.conf”:

    <IfModule mod_wsgi.c>
    </IfModule>

    For further information about the configuration options that can be included in “mod_wsgi.conf”, refer to docs : https://modwsgi.readthedocs.io/en/develop/configuration.html

    Create “mod_wsgi.load”:

    $ sudo touch mod_wsgi.load

    Add the following to “mod_wsgi.load”:

    LoadModule wsgi_module /usr/lib/apache2/modules/mod_wsgi.so

    Activate module in Apache and follow onscreen instructions:

    $ sudo a2enmod mod_wsgi

    If apache2 restarts without errors, you have successfully installed and loaded mod_wsgi!

SQLite3 and Python3 : Generating Statements

I’ve been working on a project that interacts with a database, and happened upon a some interesting problems.

The data I want to input into the database is initially stored in a dict variable.

  • If you weren’t already aware, the order of a Python dict changes, even if you use a blueprint or template. This means that a pre-prepared database statement wouldn’t necessarily align with the values of the dict, deeming the data to be parsed into the wrong fields in a database.

The dict may contain some or all of the tables’ fields

  • My program will acquire as many data values from input as possible to fully populate the dict, however I have allowed this to be dynamic in a sense that if the data is irretrievable, or doesn’t match a regex for the field, it will ignore said field and move on, deeming the dict value to default to “None”.

I want to make correct use of the API’s escape mechanism

  • This is most important. Not only to make fully sure that I’m not inserting “unclean and potentially harmful” data, but also to follow the line of best practice. I’m really sure (like 99%) that the data will be clean. However, mitigating the risk further will bullet proof the commit.

I don’t want to bore you with many lines of code, so I will explain my methodology of my trials instead.

Many Loops

At first, I kind of penned down the rough idea of the function in long hand.
I started with a loop to store the key/value pairs in 2 separate lists. I had to include a couple of cases; to ignore “None” values and some keys which weren’t destined for the intended table.

For each list, I wrote a loop that builds a string that had to be properly formatted for the final statement.

INSERT INTO x (f1, f2) VALUES(v1, v2)

This worked well but the code really was too verbose. 30~50 lines in fact. I thought about it and lessened the load.

Less loops, messy code

Trying to lessen the loops, I went with the same loop to extract the dict key/values as above.

For reference:

_ = [[],[]]

ignore = ["ig1", "ig2"]

for key, value in dict:
    if key in ignore:
        continue
    _[0].append(key)
    _[1].append(value)

I decided to use the string .join() method to create the raw statements from each list, which was concatenated to make the final statement.

It worked and lessened the loop burden. However I had forgotten an extremely important step, the SQLite3 escape mechanism.

Success in less words

It suddenly dawned on me that through these statement creating methods, I’m joining raw data to this string for input, without properly parsing the data. I had also tried:

"{v0}".format(v0=data[0])

This of course gave me the same result.

The fix was fairly simple, however.
The values in the statement needed to be changed from the “data values” to “?”.
The execute method also required the second argument as a tuple to allow for parsing and “?” substitution. I simply converted the list to a tuple.

The code is now down from around 50 to 15 lines which I’m happy about.

ignore = ["ig1", "ig2"]  # list of dict keys to ignore.
col = []  # list of col headings
val = []  # list of col values

# loop dict, append key/value to lists respectively
for key, value in dict.items():
    if key not in ignore:
        if value:
            col.append(key)
            val.append(value)

val = tuple(val)  # convert list to tuple

f = ", ".join(col)  # make string of fields
v = ",".join(["?" for i in range(len(col))])  # make string of "?"'s
statement = "INSERT INTO x (" + f + ") VALUES(" + v + ")"  # finalise statement

dbc.execute(statement, val)

This code sits right with me. It has taken a few attempts, but the final result gets the job done correctly:

INSERT INTO x (f1, f2, f3) VALUES(?,?,?)

It goes to show that even something fairly trivial as chucking information into a database can require some careful planning, especially when taking limitations into account (in this case datatypes), and the nature of task and has taught me some important lessons.

First look at building a configuration file parser – Python3

Intro and context

The project that I’m working on is actually based on a previous (now defunct) project that had to be re-written. I was in the middle of creating a scrape tool to pull data from a website.

The original (I’ll refer to as MK1) worked really well, until the site was completely re-designed. I always knew of implications around this, but continued with it regardless. Looking back, I could have mitigated to lessen the impact of unexpected changes. This post is less about the details of the project and more about future-proofing expected changes, and creating an easier way in order to do so.

Anyone who has worked with an HTML parser knows that they can only work with what they are given, and if the HTML changes, so does the way the rest of the script behaves. I thought long at hard whilst rethinking the program… I thought about the 3 main objectives I wanted to achieve.

  1. Get data (input)
  2. Extract and order data (process)
  3. Save data (output)

I wanted this to be an automated, unsupervised process. There are (will be) many test cases if things go wrong… but still “want/need to” store the bread crumbs of “broken” data records for completeness.

Being a cup half full kinda guy, I broke MK1 down bit by bit looking for worst case scenarios and weaknesses.


Input

The webpage is the input, it can’t be changed after its received. It’s fairly simple to programmatically grab HTML from from the internet, but what if I needed multiple pages? URLs change all the time, how to speed the process of changing a list of hard coded sites in a script? What if I wanted to add a new site entirely?

Ideally, I needed a simpler way, with as less hard-coding as possible, to pull raw data and push it onto process. If things change, this impact will be minimal. I also needed an accessible list of URLs to queue, which can be changed whenever needed.

Process

From the HTML, I want to focus only on the elements of usefulness. Things I need. I look for similarities in lines of text. I find many different words, phrases, numbers expressing the same things differently. I could dedicate an entire function to do this for each group of data I want to extract. Adding to an ever growing list of if statements or switch cases, some may stretch for literally hundreds of lines of code for 25 different cases. (Like an exaggerated MK1). What if these 25 cases suddenly change… It could mean 100 lines+ of code needs to be reworked. What if the phrases that I were originally looking for, also changes?

I opted for files to hold these rules. They can all be read, loaded and used within a single loop statement without the need to build these in to the script.

Output

I know what the output should be and how it should be stored. What if I wanted to add more data to the data set? Maybe I have fragments of data that processing has missed? Shall I just discard of it?

Here, I decided to include a list of values inside some of the config files used in the input stage. These will correspond to database columns and can be added/changed whenever needed.

Eventually I was able to group the problems together and create a logical solution for them all.

If you notice, there’s alot of “daisy chaining” going on. I don’t mind this as config files are a lot easier to manipulate then creating a database and a front end to manipulate it, and easier still then hardcoding the majority of variables that are needed.

Creating a parser in python

Essentially, the txt configuration files will contain your own little language and syntax in order to sort and use the data appropriately.

Let’s take this simple configuration:

urls.txt

#this is a configuration file
#syntax: sitename=url

#url1
jsephler=https://jsephler.co.uk

#url2
time50=https://time50.jsephler.co.uk

In the example, we have some familiar sights of a typical config file.

  1. “#” for block comments. We must tell the parser to ignore these
  2. Empty lines (or new line characters) to make the file more human readable. We must tell the parser to ignore.
  3. Finally, the configurations. “jsephler=https://jsephler.co.uk”

In python, we need to first open up the configuration file. Lets assume that “urls.txt” is in the same directory as our script.

openconfig.py

def main():
    urllist = []  # a list for config data
    filename = "urls.txt"  # path to file
    with open(filename, "r") as urlfile:  # open file
        for line in urlfile:  # iterate through file
            if line[0] != "#" and line.startswith("\n") is False:  # ignore "#" and "newline" characters
                tmp = line.strip().split("=")  # strip line of "whitespace" and split string by "=" character
                urllist.append(tmp)  # append to list

    print(urllist)  # print list to console

if __name__ == "__main__":  # initiate "main()" first
    main()

If permissions will allow, the script will open our file and refer to it as “urlfile”.

The loop will iterate through every line in the file, while the if statements check for any lines that start with “#” or “\n” new line characters.

Before we store our data, we remove whitespace (strip) and seperate (split) the string by the “=” character.

Only after this, we can append it to our urllist array.

Output should look like this:

[['jsephler', 'https://jsephler.co.uk'], ['time50', 'https://time50.jsephler.co.uk']]

An array of arrays, where each member of urllist:

[0] is the sitename, and [1] is the url.

Breaking this down further, you could have a configuration like this:

jsephler=https://jsephler.co.uk|copy

After the first “=” split, you could split the second member of the array a second time by using the “|”  character to end up with another 2 pieces of data. Copy could call a function to do just that, copy!
Obviously plan ahead and use the characters wisely. You do not want to use a character that could be included in a URL as you may need to use it the future.

By doing this, you can create a config file that’s not only simple but powerful too.

Conclusion

There is a Python config parser library, however I preferred to create my own. My reasons for doing so:

  1. I didn’t actually know it existed until I started writing this post.
  2. It is fairly simple logic and I can tailor the syntax and uses.
  3. You could potentially save on overheads instead of loading and using a separate module.
  4. It’s alot of fun to experiment with!

For reference, here is the documentation for the standard parser library: https://docs.python.org/3/library/configparser.html