Skip to main content

Migrate Process Newspaper3k

Submitted by daniel on

Migrate Process Newspaper3k provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper3k article download framework.

Newspaper3k Features

  1. Multi-threaded article download framework
  2. News url identification
  3. Text extraction from html
  4. Top image extraction from html
  5. All image extraction from html
  6. Keyword extraction from text
  7. Summary extraction from text
  8. Author extraction from text
  9. Google trending terms extraction
  10. Works in 10+ languages (English, Chinese, German, Arabic, …)


LAMP server with python3 support.

See for installation and setup instructions.

Example Usage

       plugin: migrate_process_js_redirect_link
       source: link
       plugin: migrate_process_newspaper3k
       plugin: skip_on_empty
       method: row
       message: 'migrate_process_newspaper3k import failed'
       plugin: extract
         - summary

Schema Keys

See the included stub json for more info on this. Those of note include:

summary (string)
source_url (string)
url (string)
title (string)
top_img (string)
top_image (string)
meta_img (string)
imgs (array)
images (array)
movies (array)
text (string)
keywords (array)
meta_keywords (array)
tags (array)
authors (array)
publish_date (string)
summary (string)
html (string)
meta_data (array object)
meta_description (string)
article_html (string)
top_node (string)
doc (string)

DDEV support

If you are running ddev locally, to install newspaper3k and supporting libraries, you can add the following to your config.yaml found in your projects .ddev folder to install Newspaper3k and the necessary natural language data sets orcorpora.

        - exec: 'pip3 install newspaper3k'
        - exec: 'curl | python3'

More info

For more info on how Newspaper3k parses articles please see


Date Created
4 months ago