CKAN¶
This chapter documents a CKAN install using Datacats. Features:
- Datacats will be installed from source inside a virtualenv.
- The virtualenv will live in
/var/venvs/ckan
. - The datacats installation and environments will live in
/var/projects/ckan
. - The datacats data dir
~/.datacats
is symlinked to/mnt/btrfsvol/datacats_data
. - The directory
/var/lib/docker
contains all docker images. - The directories
/var/venvs
,/var/projects
and/var/lib/docker
are symlinked to the external 100 GB volume/mnt/btrfsvol/
. - Nginx will be configured to reverse-proxy custom subdomain to servers running on local ports.
- The domain hosting redirects requests to the custom subdomains to the VM’s static IP.
A note on conflicting pip and requests packages:
If pip gets ImportError: cannot import name IncompleteRead
, run sudo easy_install requests==2.2.1
.
To avoid this bug, we’ll install datacats (and every other python-based project) into its own virtualenv,
where they can have their preferred requests version, and the system can have its own, pip-compatible version (e.g. requests==2.2.1).
Directories and symlinks¶
With virtualennvwrapper installed and sourced from ~/.bashrc
, create virtualenv and project directories for datacats:
mkproject ckan
With our custom settings, this will create /var/projects/ckan
as project directory, and /var/venvs/ckan
for the virtualenv.
It will also enable the virtualenv. Deactivate and reactivate to use the virtualenv’s binaries rather than the system-wide ones.
Create a symlink to ~/.datacats
before any datacats environment is created. Otherwise, ~/.datacats
will contain files owned by
other users (root, postgres) and will have to be moved by the root user and chowned to the current user, while all datacats environments are stopped.
Datacats install¶
With the datacats virtualenv activated, clone the datacats repo and pull the Docker images:
workon ckan
(ckan)ubuntu@ip:/var/projects/ckan$
git clone https://github.com/datacats/datacats.git
cd datacats
python setup.py install
datacats pull -a
Datacats environments¶
Create an environment as per datacats docs:
(ckan)ubuntu@ip:/var/projects/ckan$
datacats create --ckan latest --site-url http://catalogue.alpha.data.wa.gov.au datawagovau 5000
This will create /var/projects/ckan/datawagovau
,
install ckan and run the server on the given port (here: 5000).
Reverse proxy the datacats environment¶
If the environment runs on e.g. port 5000, add this section to /etc/nginx/sites-enabled/base.conf
to host the environment on a subdomain:
proxy_cache_path /tmp/nginx_cache levels=1:2 keys_zone=cache:30m max_size=250m;
proxy_temp_path /tmp/nginx_proxy 1 2;
server {
server_name catalogue.alpha.data.wa.gov.au;
listen 80;
client_max_body_size 2G;
location / {
proxy_pass http://127.0.0.1:5000;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header Host $host;
proxy_cache cache;
proxy_cache_bypass $cookie_auth_tkt;
proxy_no_cache $cookie_auth_tkt;
proxy_cache_valid 30m;
proxy_cache_key $host$scheme$proxy_host$request_uri;
}
}
Test and apply with sudo nginx configtest
and sudo service nginx reload
.
This will create a working CKAN without any further extensions. To enable the extensions, follow the next chapter.
Extensions¶
The following list of extensions displays their installation status on our example CKAN.
The installation process is:
- installed: extension repo is downloaded and installed into the datacats environment
- active: extension is enabled in CKAN config
- working: extension actually works
Between the last two steps lies a varying amount of configuration to the environment, including but not limited to:
- database additions,
- running of servers (celery task queue, redis message queue, pycsw server etc.),
- addition of config files (pycsw, harvester),
- writing to weird and wonderful locations outside the installation directory (flickrapi being the worst offender).
All these additions have to be applied within the contraints of datacats’ docker-based deployment approach.
Extension | Functionality | Status |
---|---|---|
ckanext-dcat | Metadata export as RDF | working |
ckanext-pages | Static pages | working |
ckanext-spatial | Georeferencing (DPaW widget), spatial search | fork working |
ckanext-scheming | Custom metadata schema | fork working |
ckanext-pdfview | PDF resource preview | working |
ckanext-geoview | Spatial resource preview | working |
ckanext-cesiumpreview | NationalMap preview | working |
ckanext-harvest | Metadata harvesting | in dev, currently scripted |
pycsw | CSW endpoint for CKAN | working |
ckan-galleries | Image hosting on CKAN | some issues |
ckanext-doi | DOI minting | in dev |
ckanext-archiver | Resource file archiving | working |
ckanext-qa | QA checks (e.g. has DOI) | working |
ckanext-hierarchy | Hierarchical organisations | working |
WA data licenses | WA data licensing | pending license list |
ckanext-geopusher | SHP and KML to GeoJSON converter | working |
ckanext-featuredviews | Showcase resource views | works in layout 1 |
ckanext-showcase | Replace featured items | working |
ckanext-disqus | User comments | working |
ckanext-datawagovautheme | Data.wa.gov.au theme | working |
ckanapi | Python client for CKAN API | working |
ckanR | R client for CKAN API | working |
Note: Unless specified otherwise, all code examples are executed as non-root user “ubuntu” (who must be in the docker group) in the CKAN environment’s directory, e.g.:
workon ckan
(ckan)ubuntu@ip:/var/projects/ckan/
# cd into datacats environment "test"
cd test/
(ckan)ubuntu@ip:/var/projects/ckan/test$
Download extensions¶
Run:
git config --global push.default matching
datacats install
# ckanext-spatial custom fork
git clone git@github.com:datawagovau/ckanext-spatial.git
cd ckanext-spatial
git remote add upstream https://github.com/ckan/ckanext-spatial.git
git fetch upstream
git merge upstream/master master -m 'merge upstream'
git push
cd ..
# ckanext-scheming custom fork
git clone git@github.com:florianm/ckanext-scheming.git
cd ckanext-scheming
git remote add upstream https://github.com/open-data/ckanext-scheming.git
git fetch upstream
git merge upstream/master master -m 'merge upstream'
git push
cd ..
#git clone https://github.com/datawagovau/ckanext-datawagovautheme.git
git clone git@github.com:datawagovau/ckanext-datawagovautheme.git
#git clone https://github.com/ckan/ckanext-pages.git
git clone https://github.com/datawagovau/ckanext-pages.git
# git clone https://github.com/ckan/ckanext-harvest.git
git clone git@github.com:datawagovau/ckanext-harvest.git
git clone https://github.com/ckan/ckanext-archiver.git
git clone https://github.com/datagovau/ckanext-cesiumpreview.git
git clone https://github.com/ckan/ckanext-dcat.git
git clone https://github.com/ckan/ckanext-disqus.git
git clone https://github.com/NaturalHistoryMuseum/ckanext-doi.git
git clone https://github.com/datacats/ckanext-featuredviews.git
#git clone https://github.com/DataShades/ckan-galleries.git
git clone https://github.com/ckan/ckanext-geoview.git
git clone https://github.com/datacats/ckanext-geopusher.git
git clone https://github.com/datagovuk/ckanext-hierarchy.git
git clone https://github.com/ckan/ckanext-pdfview.git
git clone https://github.com/ckan/ckanext-qa.git
git clone https://github.com/ckan/ckanext-showcase.git
git clone https://github.com/ckan/ckanapi.git
git clone https://github.com/geopython/pycsw.git
# pycsw dependencies
sudo apt-get install -y python-dev libxml2-dev libxslt-dev libgeos-dev
Manage dependency conflicts¶
Before running through this section, note that dependency conflicts are caused by multiple independently developed code bases of ckan and its plugins. Each code base pins third party library versions known to work at the time of release. Naturally, the most established extensions, e.g. spatial and harvesting, have the oldest dependencies, while brand new extensions, e.g. agls, require much newer libraries.
Note: currently, the setup works without this section.
Review possible collisions at http://rshiny.yes-we-ckan.org/ckan-pip-collisions/. Note, the following example lists dependencies current as of October 2015 and will outdate quickly. We recommend to research your own version conflicts and use this example as a how-to guide, but with your own dependencies. In our example the following packages have differing, hard-coded requirements:
grep -rn --include="*requirements*" 'requests' .
grep -rn --include="*requirements*" 'six' .
grep -rn --include="*requirements*" 'lxml' .
grep -rn --include="*requirements*" 'python-dateutil' .
grep -rn --include="*requirements*" 'SQLAlchemy' .
We’ll need to update all colliding requirement versions to one that works across all extensions.
In our case, a simple bump to the highest mentioned version will work, such as with the perfectly backwards compatible requests
library.
In other cases, breaking changes between different dependency versions could require an upgrade to an actual extension.
Batch-modify version numbers as shown here work on our listed extensions at the time of writing.
Modify to your actual needs. Warning - a mistake in this step could corrupt your installed code (including CKAN source),
requiring to git checkout
incorrectly modified files in each repo.:
grep -rl --include="*requirements*" 'requests' . | xargs sed -i 's/^.*requests.*$/requests==2.7.0/g'
grep -rl --include="*requirements*" 'six' . | xargs sed -i 's/^.*six^.*/six==1.9.0/g'
grep -rl --include="*requirements*" 'lxml' . | xargs sed -i 's/^.*lxml^.*/lxml==3.4.4/g'
grep -rl --include="*requirements*" 'python-dateutil' . | xargs sed -i 's/^.*python-dateutil^.*/python-dateutil==2.4.2/g'
grep -rl --include="*requirements*" 'SQLAlchemy' . | xargs sed -i 's/^.*SQLAlchemy.*$/SQLAlchemy==0.9.6/g'
# review version numbers
grep -rn --include="*requirements*" 'requests' .
grep -rn --include="*requirements*" 'six' .
grep -rn --include="*requirements*" 'lxml' .
grep -rn --include="*requirements*" 'python-dateutil' .
# any other requirements conflicts?
cat `find . -name '*requirements*'` | sort | uniq
To fix issues with any dependency versions:
datacats shell
pip freeze | grep lchemy
pip install SQLAlchemy==0.9.6
exit
E.g., this is necessary when receiving this error on datacats reload:
File "/usr/lib/ckan/local/lib/python2.7/site-packages/geoalchemy2/comparator.py", line 52, in <module>
class BaseComparator(UserDefinedType.Comparator):
AttributeError: type object 'UserDefinedType' has no attribute 'Comparator'
Starting subprocess with file monitor
Install extensions¶
To install all extensions and their dependencies in the site’s environment, run:
datacats install
Modify datacats containers¶
Some extensions require modifications to the database, or additional servers, such as a message queue (redis) or a task runner (celery). Following ckanext-spatial docs and ckanext-harvest docs with datacats’ paster command:
# (re)install postgis, add redis
datacats tweak --install-postgis
datacats tweak --add-redis
# datacats tweak --add-pycsw # soon
datacats reload
# pulls redis image
# initdb for spatial
cd ckanext-spatial
datacats paster spatial initdb
cd ..
# initdb for harvester, plus two celery containers, see also below
cd ckanext-harvest
datacats paster harvester initdb
datacats paster -d harvester gather_consumer
datacats paster -d harvester fetch_consumer
cd ..
Note: git init
the theme extension (ckanext-SITEtheme) to preserve significant customisations.
Config¶
General procedure:
- Edit config vim development.ini, replace everything from “Authorization Settings” with settings below.
- Apply changes with datacats reload. That should be it!
development.ini
:
## Authorization Settings
ckan.auth.anon_create_dataset = false
ckan.auth.create_unowned_dataset = false
ckan.auth.create_dataset_if_not_in_organization = false
ckan.auth.user_create_groups = true
ckan.auth.user_create_organizations = false
ckan.auth.user_delete_groups = true
ckan.auth.user_delete_organizations = false
ckan.auth.create_user_via_api = true
ckan.auth.create_user_via_web = true
ckan.auth.roles_that_cascade_to_sub_groups = admin editor member
## Search Settings
ckan.site_id = default
solr_url = http://solr:8080/solr
## CORS Settings
ckan.cors.origin_allow_all = true
## Plugins Settings
base = cesium_viewer resource_proxy datastore datapusher datawagovau_theme stats archiver qa featuredviews showcase disqus
sch = scheming_datasets
rcl = recline_grid_view recline_graph_view recline_map_view
prv = text_view image_view recline_view pdf_view webpage_view
geo = geo_view geojson_view
spt = spatial_metadata spatial_query geopusher
hie = hierarchy_display hierarchy_form
dcat = dcat dcat_rdf_harvester dcat_json_harvester dcat_json_interface
hrv = harvest ckan_harvester csw_harvester
pkg = datapackager downloadtdf
ckan.plugins = %(base)s %(sch)s %(rcl)s %(prv)s %(dcat)s %(geo)s %(spt)s %(hrv)s %(hie)s
#%(pkg)s ## missing ckan branch datapackager
ckanext.geoview.ol_viewer.formats = wms wfs gml kml arcgis_rest gft
ckan.views.default_views = cesium_view %(prv)s geojson_view
# ckanext-scheming
scheming.dataset_schemas = ckanext.datawagovautheme:datawagovau_dataset.json
#scheming.organization_schemas = ckanext.datawagovautheme:datawagovau_organization.json
# ckanext-harvest
ckan.harvest.mq.type = redis
ckan.harvest.mq.hostname = redis
ckanext.spatial.harvest.continue_on_validation_errors= True
# ckanext-pages
ckanext.pages.organization = True
ckanext.pages.group = True
# disable to make space for static pages:
ckanext.pages.about_menu = True
ckanext.pages.group_menu = True
ckanext.pages.organization_menu = True
# ckanext-disqus
# add Engage to site > add a subaccount to your disqus account for this CKAN
# choose name = disqus.name
# settings > advanced >
# add %(site_url)s to trusted domains, e.g. catalogue.beta.data.wag.gov.au
disqus.name = xxxx
## Front-End Settings
ckan.site_title = Parks & Wildlife Data
ckan.site_logo = /logo.png
ckan.site_description =
ckan.favicon = /favicon.ico
ckan.gravatar_default = identicon
ckan.preview.direct = png jpg gif
ckan.preview.loadable = html htm rdf+xml owl+xml xml n3 n-triples turtle plain atom csv tsv rss txt json
ckan.display_timezone = server
# package_hide_extras = for_search_index_only
#package_edit_return_url = http://another.frontend/dataset/<NAME>
#package_new_return_url = http://another.frontend/dataset/<NAME>
#licenses_group_url = http://licenses.opendefinition.org/licenses/groups/ckan.json
# ckan.template_footer_end =
ckan.recaptcha.version = 1
ckan.recaptcha.publickey = xxxx
ckan.recaptcha.privatekey = xxxx
## Internationalisation Settings
ckan.locale_default = en_AU
ckan.locale_order = en_AU pt_BR ja it cs_CZ ca es fr el sv sr sr@latin no sk fi ru de pl nl bg ko_KR hu sa sl lv
ckan.locales_offered =
ckan.locales_filtered_out = en_GB
## Feeds Settings
ckan.feeds.authority_name =
ckan.feeds.date =
ckan.feeds.author_name =
ckan.feeds.author_link =
## Storage Settings
ckan.storage_path = /var/www/storage
#ckan.max_resource_size = 10
## Datapusher settings
# Make sure you have set up the DataStore
ckan.datapusher.formats = csv xls xlsx tsv application/csv application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
ckan.datapusher.url = http://datapusher:8800
# Resource Proxy settings
ckan.max_resource_size = 1000000
ckan.max_image_size = 200000
ckan.resource_proxy.max_file_size = 31457280
## Activity Streams Settings
ckan.activity_streams_enabled = true
ckan.activity_list_limit = 31
#ckan.activity_streams_email_notifications = true
#ckan.email_notifications_since = 2 days
ckan.hide_activity_from_users = %(ckan.site_id)s
## Email settings
email_to = xxxx
error_email_from = xxxx
smtp.server = smtp.gmail.com:587
smtp.starttls = True
smtp.user = xxxx
smtp.password = xxxx
smtp.mail_from = xxxx
## Logging configuration
[loggers]
keys = root, ckan, ckanext
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARNING
handlers = console
[logger_ckan]
level = INFO
handlers = console
qualname = ckan
propagate = 0
[logger_ckanext]
level = INFO
handlers = console
qualname = ckanext
propagate = 0
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(asctime)s %(levelname)-5.5s [%(name)s] %(message)s
PyCSW¶
While our contribution is in development, we’ll manually build and run a dockerised pycsw using our datacats fork:
cd /var/projects/ckan/datacats/docker/pycsw/
docker build -t datacats/pycsw .
docker run -d -p 9000:8000 -it datacats/pycsw python /var/www/pycsw/csw.wsgi
This will build a pycsw server image with harvesting enabled (transactions) for non-local IPs and run a pycsw server on localhost:9000. See also nginx settings in Deployment to expose the csw server publicly.