CKAN

This chapter documents a CKAN install using Datacats. Features:

  • Datacats will be installed from source inside a virtualenv.
  • The virtualenv will live in /var/venvs/ckan.
  • The datacats installation and environments will live in /var/projects/ckan.
  • The datacats data dir ~/.datacats is symlinked to /mnt/btrfsvol/datacats_data.
  • The directory /var/lib/docker contains all docker images.
  • The directories /var/venvs, /var/projects and /var/lib/docker are symlinked to the external 100 GB volume /mnt/btrfsvol/.
  • Nginx will be configured to reverse-proxy custom subdomain to servers running on local ports.
  • The domain hosting redirects requests to the custom subdomains to the VM’s static IP.

A note on conflicting pip and requests packages: If pip gets ImportError: cannot import name IncompleteRead, run sudo easy_install requests==2.2.1. To avoid this bug, we’ll install datacats (and every other python-based project) into its own virtualenv, where they can have their preferred requests version, and the system can have its own, pip-compatible version (e.g. requests==2.2.1).

Datacats install

With the datacats virtualenv activated, clone the datacats repo and pull the Docker images:

workon ckan
(ckan)ubuntu@ip:/var/projects/ckan$

git clone https://github.com/datacats/datacats.git
cd datacats
python setup.py install
datacats pull -a

Datacats environments

Create an environment as per datacats docs:

(ckan)ubuntu@ip:/var/projects/ckan$
datacats create --ckan latest --site-url http://catalogue.alpha.data.wa.gov.au datawagovau 5000

This will create /var/projects/ckan/datawagovau, install ckan and run the server on the given port (here: 5000).

Reverse proxy the datacats environment

If the environment runs on e.g. port 5000, add this section to /etc/nginx/sites-enabled/base.conf to host the environment on a subdomain:

proxy_cache_path /tmp/nginx_cache levels=1:2 keys_zone=cache:30m max_size=250m;
proxy_temp_path /tmp/nginx_proxy 1 2;

server {
    server_name catalogue.alpha.data.wa.gov.au;
    listen 80;
    client_max_body_size 2G;
    location / {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_set_header Host $host;
        proxy_cache cache;
        proxy_cache_bypass $cookie_auth_tkt;
        proxy_no_cache $cookie_auth_tkt;
        proxy_cache_valid 30m;
        proxy_cache_key $host$scheme$proxy_host$request_uri;
    }
}

Test and apply with sudo nginx configtest and sudo service nginx reload. This will create a working CKAN without any further extensions. To enable the extensions, follow the next chapter.

Extensions

The following list of extensions displays their installation status on our example CKAN.

The installation process is:

  • installed: extension repo is downloaded and installed into the datacats environment
  • active: extension is enabled in CKAN config
  • working: extension actually works

Between the last two steps lies a varying amount of configuration to the environment, including but not limited to:

  • database additions,
  • running of servers (celery task queue, redis message queue, pycsw server etc.),
  • addition of config files (pycsw, harvester),
  • writing to weird and wonderful locations outside the installation directory (flickrapi being the worst offender).

All these additions have to be applied within the contraints of datacats’ docker-based deployment approach.

Extension Functionality Status
ckanext-dcat Metadata export as RDF working
ckanext-pages Static pages working
ckanext-spatial Georeferencing (DPaW widget), spatial search fork working
ckanext-scheming Custom metadata schema fork working
ckanext-pdfview PDF resource preview working
ckanext-geoview Spatial resource preview working
ckanext-cesiumpreview NationalMap preview working
ckanext-harvest Metadata harvesting in dev, currently scripted
pycsw CSW endpoint for CKAN working
ckan-galleries Image hosting on CKAN some issues
ckanext-doi DOI minting in dev
ckanext-archiver Resource file archiving working
ckanext-qa QA checks (e.g. has DOI) working
ckanext-hierarchy Hierarchical organisations working
WA data licenses WA data licensing pending license list
ckanext-geopusher SHP and KML to GeoJSON converter working
ckanext-featuredviews Showcase resource views works in layout 1
ckanext-showcase Replace featured items working
ckanext-disqus User comments working
ckanext-datawagovautheme Data.wa.gov.au theme working
ckanapi Python client for CKAN API working
ckanR R client for CKAN API working

Note: Unless specified otherwise, all code examples are executed as non-root user “ubuntu” (who must be in the docker group) in the CKAN environment’s directory, e.g.:

workon ckan
(ckan)ubuntu@ip:/var/projects/ckan/
# cd into datacats environment "test"
cd test/
(ckan)ubuntu@ip:/var/projects/ckan/test$

Download extensions

Run:

git config --global push.default matching

datacats install

# ckanext-spatial custom fork
git clone git@github.com:datawagovau/ckanext-spatial.git
cd ckanext-spatial
git remote add upstream https://github.com/ckan/ckanext-spatial.git
git fetch upstream
git merge upstream/master master -m 'merge upstream'
git push
cd ..

# ckanext-scheming custom fork
git clone git@github.com:florianm/ckanext-scheming.git
cd ckanext-scheming
git remote add upstream https://github.com/open-data/ckanext-scheming.git
git fetch upstream
git merge upstream/master master -m 'merge upstream'
git push
cd ..

#git clone https://github.com/datawagovau/ckanext-datawagovautheme.git
git clone git@github.com:datawagovau/ckanext-datawagovautheme.git

#git clone https://github.com/ckan/ckanext-pages.git
git clone https://github.com/datawagovau/ckanext-pages.git

# git clone https://github.com/ckan/ckanext-harvest.git
git clone git@github.com:datawagovau/ckanext-harvest.git

git clone https://github.com/ckan/ckanext-archiver.git
git clone https://github.com/datagovau/ckanext-cesiumpreview.git
git clone https://github.com/ckan/ckanext-dcat.git
git clone https://github.com/ckan/ckanext-disqus.git
git clone https://github.com/NaturalHistoryMuseum/ckanext-doi.git
git clone https://github.com/datacats/ckanext-featuredviews.git
#git clone https://github.com/DataShades/ckan-galleries.git
git clone https://github.com/ckan/ckanext-geoview.git
git clone https://github.com/datacats/ckanext-geopusher.git
git clone https://github.com/datagovuk/ckanext-hierarchy.git
git clone https://github.com/ckan/ckanext-pdfview.git
git clone https://github.com/ckan/ckanext-qa.git
git clone https://github.com/ckan/ckanext-showcase.git

git clone https://github.com/ckan/ckanapi.git
git clone https://github.com/geopython/pycsw.git

# pycsw dependencies
sudo apt-get install -y python-dev libxml2-dev libxslt-dev libgeos-dev

Manage dependency conflicts

Before running through this section, note that dependency conflicts are caused by multiple independently developed code bases of ckan and its plugins. Each code base pins third party library versions known to work at the time of release. Naturally, the most established extensions, e.g. spatial and harvesting, have the oldest dependencies, while brand new extensions, e.g. agls, require much newer libraries.

Note: currently, the setup works without this section.

Review possible collisions at http://rshiny.yes-we-ckan.org/ckan-pip-collisions/. Note, the following example lists dependencies current as of October 2015 and will outdate quickly. We recommend to research your own version conflicts and use this example as a how-to guide, but with your own dependencies. In our example the following packages have differing, hard-coded requirements:

grep -rn --include="*requirements*" 'requests' .
grep -rn --include="*requirements*" 'six' .
grep -rn --include="*requirements*" 'lxml' .
grep -rn --include="*requirements*" 'python-dateutil' .
grep -rn --include="*requirements*" 'SQLAlchemy' .

We’ll need to update all colliding requirement versions to one that works across all extensions. In our case, a simple bump to the highest mentioned version will work, such as with the perfectly backwards compatible requests library. In other cases, breaking changes between different dependency versions could require an upgrade to an actual extension.

Batch-modify version numbers as shown here work on our listed extensions at the time of writing. Modify to your actual needs. Warning - a mistake in this step could corrupt your installed code (including CKAN source), requiring to git checkout incorrectly modified files in each repo.:

grep -rl --include="*requirements*" 'requests' . | xargs sed -i 's/^.*requests.*$/requests==2.7.0/g'
grep -rl --include="*requirements*" 'six' . | xargs sed -i 's/^.*six^.*/six==1.9.0/g'
grep -rl --include="*requirements*" 'lxml' . | xargs sed -i 's/^.*lxml^.*/lxml==3.4.4/g'
grep -rl --include="*requirements*" 'python-dateutil' . | xargs sed -i 's/^.*python-dateutil^.*/python-dateutil==2.4.2/g'
grep -rl --include="*requirements*" 'SQLAlchemy' . | xargs sed -i 's/^.*SQLAlchemy.*$/SQLAlchemy==0.9.6/g'

# review version numbers
grep -rn --include="*requirements*" 'requests' .
grep -rn --include="*requirements*" 'six' .
grep -rn --include="*requirements*" 'lxml' .
grep -rn --include="*requirements*" 'python-dateutil' .

# any other requirements conflicts?
cat `find . -name '*requirements*'` | sort | uniq

To fix issues with any dependency versions:

datacats shell
pip freeze | grep lchemy
pip install SQLAlchemy==0.9.6
exit

E.g., this is necessary when receiving this error on datacats reload:

File "/usr/lib/ckan/local/lib/python2.7/site-packages/geoalchemy2/comparator.py", line 52, in <module>
class BaseComparator(UserDefinedType.Comparator):
AttributeError: type object 'UserDefinedType' has no attribute 'Comparator'
Starting subprocess with file monitor

Install extensions

To install all extensions and their dependencies in the site’s environment, run:

datacats install

Modify datacats containers

Some extensions require modifications to the database, or additional servers, such as a message queue (redis) or a task runner (celery). Following ckanext-spatial docs and ckanext-harvest docs with datacats’ paster command:

# (re)install postgis, add redis
datacats tweak --install-postgis
datacats tweak --add-redis
# datacats tweak --add-pycsw # soon
datacats reload
# pulls redis image

# initdb for spatial
cd ckanext-spatial
datacats paster spatial initdb
cd ..

# initdb for harvester, plus two celery containers, see also below
cd ckanext-harvest
datacats paster harvester initdb
datacats paster -d harvester gather_consumer
datacats paster -d harvester fetch_consumer
cd ..

Note: git init the theme extension (ckanext-SITEtheme) to preserve significant customisations.

Config

General procedure:

  • Edit config vim development.ini, replace everything from “Authorization Settings” with settings below.
  • Apply changes with datacats reload. That should be it!

development.ini:

## Authorization Settings
 ckan.auth.anon_create_dataset = false
 ckan.auth.create_unowned_dataset = false
 ckan.auth.create_dataset_if_not_in_organization = false
 ckan.auth.user_create_groups = true
 ckan.auth.user_create_organizations = false
 ckan.auth.user_delete_groups = true
 ckan.auth.user_delete_organizations = false
 ckan.auth.create_user_via_api = true
 ckan.auth.create_user_via_web = true
 ckan.auth.roles_that_cascade_to_sub_groups = admin editor member

 ## Search Settings
 ckan.site_id = default
 solr_url = http://solr:8080/solr

 ## CORS Settings
 ckan.cors.origin_allow_all = true

 ## Plugins Settings
 base = cesium_viewer resource_proxy datastore datapusher datawagovau_theme stats archiver qa featuredviews showcase disqus
 sch = scheming_datasets
 rcl = recline_grid_view recline_graph_view recline_map_view
 prv = text_view image_view recline_view pdf_view webpage_view
 geo = geo_view geojson_view
 spt = spatial_metadata spatial_query geopusher
 hie = hierarchy_display hierarchy_form
 dcat = dcat dcat_rdf_harvester dcat_json_harvester dcat_json_interface
 hrv = harvest ckan_harvester csw_harvester
 pkg = datapackager downloadtdf
 ckan.plugins = %(base)s %(sch)s %(rcl)s %(prv)s %(dcat)s %(geo)s %(spt)s %(hrv)s %(hie)s
 #%(pkg)s ## missing ckan branch datapackager

 ckanext.geoview.ol_viewer.formats = wms wfs gml kml arcgis_rest gft
 ckan.views.default_views = cesium_view %(prv)s geojson_view


 # ckanext-scheming
 scheming.dataset_schemas = ckanext.datawagovautheme:datawagovau_dataset.json
 #scheming.organization_schemas = ckanext.datawagovautheme:datawagovau_organization.json

 # ckanext-harvest
 ckan.harvest.mq.type = redis
 ckan.harvest.mq.hostname = redis
 ckanext.spatial.harvest.continue_on_validation_errors= True

 # ckanext-pages
 ckanext.pages.organization = True
 ckanext.pages.group = True
 # disable to make space for static pages:
 ckanext.pages.about_menu = True
 ckanext.pages.group_menu = True
 ckanext.pages.organization_menu = True

 # ckanext-disqus
 # add Engage to site > add a subaccount to your disqus account for this CKAN
 # choose name = disqus.name
 # settings > advanced >
 # add %(site_url)s to trusted domains, e.g. catalogue.beta.data.wag.gov.au
 disqus.name = xxxx

 ## Front-End Settings
 ckan.site_title = Parks & Wildlife Data
 ckan.site_logo = /logo.png
 ckan.site_description =
 ckan.favicon = /favicon.ico
 ckan.gravatar_default = identicon
 ckan.preview.direct = png jpg gif
 ckan.preview.loadable = html htm rdf+xml owl+xml xml n3 n-triples turtle plain atom csv tsv rss txt json
 ckan.display_timezone = server
 # package_hide_extras = for_search_index_only
 #package_edit_return_url = http://another.frontend/dataset/<NAME>
 #package_new_return_url = http://another.frontend/dataset/<NAME>
 #licenses_group_url = http://licenses.opendefinition.org/licenses/groups/ckan.json
 # ckan.template_footer_end =
 ckan.recaptcha.version = 1
 ckan.recaptcha.publickey = xxxx
 ckan.recaptcha.privatekey = xxxx

 ## Internationalisation Settings
 ckan.locale_default = en_AU
 ckan.locale_order = en_AU pt_BR ja it cs_CZ ca es fr el sv sr sr@latin no sk fi ru de pl nl bg ko_KR hu sa sl lv
 ckan.locales_offered =
 ckan.locales_filtered_out = en_GB

 ## Feeds Settings
 ckan.feeds.authority_name =
 ckan.feeds.date =
 ckan.feeds.author_name =
 ckan.feeds.author_link =

 ## Storage Settings
 ckan.storage_path = /var/www/storage
 #ckan.max_resource_size = 10

 ## Datapusher settings
 # Make sure you have set up the DataStore
 ckan.datapusher.formats = csv xls xlsx tsv application/csv application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 ckan.datapusher.url = http://datapusher:8800

 # Resource Proxy settings
 ckan.max_resource_size = 1000000
 ckan.max_image_size = 200000
 ckan.resource_proxy.max_file_size = 31457280

 ## Activity Streams Settings
 ckan.activity_streams_enabled = true
 ckan.activity_list_limit = 31
 #ckan.activity_streams_email_notifications = true
 #ckan.email_notifications_since = 2 days
 ckan.hide_activity_from_users = %(ckan.site_id)s

 ## Email settings
 email_to = xxxx
 error_email_from = xxxx
 smtp.server = smtp.gmail.com:587
 smtp.starttls = True
 smtp.user = xxxx
 smtp.password = xxxx
 smtp.mail_from = xxxx

 ## Logging configuration
 [loggers]
 keys = root, ckan, ckanext
 [handlers]
 keys = console
 [formatters]
 keys = generic
 [logger_root]
 level = WARNING
 handlers = console
 [logger_ckan]
 level = INFO
 handlers = console
 qualname = ckan
 propagate = 0
 [logger_ckanext]
 level = INFO
 handlers = console
 qualname = ckanext
 propagate = 0
 [handler_console]
 class = StreamHandler
 args = (sys.stderr,)
 level = NOTSET
 formatter = generic
 [formatter_generic]
 format = %(asctime)s %(levelname)-5.5s [%(name)s] %(message)s

PyCSW

While our contribution is in development, we’ll manually build and run a dockerised pycsw using our datacats fork:

cd /var/projects/ckan/datacats/docker/pycsw/
docker build -t datacats/pycsw .
docker run -d -p 9000:8000 -it datacats/pycsw python /var/www/pycsw/csw.wsgi

This will build a pycsw server image with harvesting enabled (transactions) for non-local IPs and run a pycsw server on localhost:9000. See also nginx settings in Deployment to expose the csw server publicly.